The Smoothed Dirichlet Distribution: Understanding Cross-Entropy Ranking in Information Retrieval

Abstract

Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date. Another related and interesting observation is that the naive Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2006
Accession Number
ADA477407

Entities

People

  • Ramesh Nallapati

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Autonomy
  • Biomedical

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence Software
  • Computational Science
  • Data Sets
  • Generative Models
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Maximum Likelihood Estimation
  • Network Science
  • Probabilistic Models
  • Probability
  • Probability Distributions
  • Random Variables
  • Supervised Machine Learning
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Statistical inference.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval