Linguistic Extensions of Topic Models

Abstract

Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable groups of words. In this context, topic models discover topics, distributions over words that express a coherent theme like "business" or "politics." While one of the strengths of topic models is that they make few assumptions about the underlying data, such a general approach sometimes limits the type of problems topic models can solve. When we restrict our focus to natural language datasets, we can use insights from linguistics to create models that understand and discover richer language patterns. In this thesis, we extend LDA in three different ways: adding knowledge of word meaning modeling multiple languages, and incorporating local syntactic context. These extensions apply topic models to new problems, such as discovering the meaning of ambiguous words, extend topic models for new datasets, such as unaligned multilingual corpora, and combine topic models with other sources of information about documents' context. In Chapter 2, we present latent Dirichlet allocation with WordNet (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. LDAWN replaces the multinomial topics of LDA with Abney and Light's distribution over meanings. Thus, posterior inference in this model discovers not only the topical domains of each token, as in LDA, but also the meaning associated with each token. We show that considering more topics improves the problem of word sense disambiguation. LDAWN allows us to separate the representation of meaning from how that meaning is expressed as word forms.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2010
Accession Number
ADA571374

Entities

People

  • Jordan Boyd-graber

Organizations

  • Princeton University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Fish
  • Information Processing
  • Information Retrieval
  • Information Science
  • Language
  • Monte Carlo Method
  • Natural Language Processing
  • Network Science
  • Ontologies
  • Probabilistic Models
  • Probability Distributions

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation