The Smoothed Dirichlet Distribution: Understanding Cross-Entropy Ranking in Information Retrieval
Abstract
Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date. Another related and interesting observation is that the naive Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jul 01, 2006
- Accession Number
- ADA477407
Entities
People
- Ramesh Nallapati
Organizations
- University of Massachusetts Amherst