Autocorrelation and Regularization of Query-Based Information Retrieval Scores

Abstract

Query-based information retrieval refers to the process of scoring documents given a short natural language query. Query-based information retrieval systems have been developed to support searching diverse collections such as the world wide web, personal email archives, news corpora, and legal collections. This thesis is motivated by one of the tenets of information retrieval: the cluster hypothesis. We define a design principle based on the cluster hypothesis which states that retrieval scores should be locally consistent. We refer to this design principle as score autocorrelation. Our experiments show that the degree to which retrieval scores satisfy this design principle correlates positively with system performance. We use this result to define a general, black box method for improving the local consistency of a set of retrieval scores. We refer to this process as local score regularization. We demonstrate that regularization consistently and significantly improves retrieval performance for a wide variety of baseline algorithms. Regularization is closely related to classic techniques such as pseudo-relevance feedback and cluster-based retrieval. We demonstrate that the effectiveness of these techniques may be explained by their regularizing behavior. We argue that regularization should be adopted either as a generic post-processing step or as a fundamental design principle for retrieval models.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 01, 2008
Accession Number
ADA477497

Entities

People

  • Fernando Diaz

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Ground and Sea Platforms

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automata Theory
  • Computational Science
  • Computer Languages
  • Data Mining
  • Dimensionality Reduction
  • Information Processing
  • Information Retrieval
  • Information Science
  • Information Systems
  • Knowledge Management
  • Machine Learning
  • Natural Language Processing
  • Network Science
  • Statistics
  • Supervised Machine Learning
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Brain and Cognitive Science; Experimental Psychology; Cognitive Neuroscience
  • Computer Vision.
  • Database Systems and Applications

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms