A Note on Topical N-Grams

Abstract

Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some contexts, but not in other contexts. In this paper, the authors propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics. They present very interesting results on large text corpora.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 24, 2005
Accession Number
ADA449632

Entities

People

  • Andrew McCallum
  • Xuerui Wang

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automata Theory
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Generative Models
  • Hidden Markov Models
  • Information Processing
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Markov Models
  • Natural Language Processing
  • Neural Networks
  • Probability
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation