A Note on Topical N-Grams
Abstract
Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some contexts, but not in other contexts. In this paper, the authors propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics. They present very interesting results on large text corpora.
Document Details
- Document Type
- Technical Report
- Publication Date
- Dec 24, 2005
- Accession Number
- ADA449632
Entities
People
- Andrew McCallum
- Xuerui Wang
Organizations
- University of Massachusetts Amherst