Improving the Estimation of Relevance Models Using Large External Corpora

Abstract

Information retrieval algorithms leverage various collection statistics to improve performance. Because these statistics are often computed on a relatively small evaluation corpus, we believe using larger, non-evaluation corpora should improve performance. Specifically, we advocate incorporating external corpora based on language modeling. We refer to this process as external expansion. When compared to traditional pseudo-relevance feedback techniques, external expansion is more stable across topics and up to 10% more effective in terms of mean average precision. Our results show that using a high quality corpus that is comparable to the evaluation corpus can be as, if not more, effective than using the web. Our results also show that external expansion outperforms simulated relevance feedback. In addition, we propose a method for predicting the extent to which external expansion will improve retrieval performance. Our new measure demonstrates positive correlation with improvements in mean average precision.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA449013

Entities

People

  • Donald Metzler
  • Fernando Diaz

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Arithmetic
  • Bayesian Networks
  • Computational Science
  • Computer Science
  • Data Sets
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Natural Language Processing
  • Precision
  • Probability
  • Probability Distributions
  • Statistics
  • Test And Evaluation

Fields of Study

  • Computer science

Readers

  • Approximation Theory.
  • Information Retrieval
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks