The Smoothed Dirichlet Distribution: Understanding Cross-Entropy Ranking in Information Retrieval

Abstract

Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date. Another related and interesting observation is that the naive Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jul 01, 2006
Accession Number: ADA477407

Entities

People

Ramesh Nallapati

Organizations

University of Massachusetts Amherst

The Smoothed Dirichlet Distribution: Understanding Cross-Entropy Ranking in Information Retrieval

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas