Linguistic Extensions of Topic Models

Abstract

Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable groups of words. In this context, topic models discover topics, distributions over words that express a coherent theme like "business" or "politics." While one of the strengths of topic models is that they make few assumptions about the underlying data, such a general approach sometimes limits the type of problems topic models can solve. When we restrict our focus to natural language datasets, we can use insights from linguistics to create models that understand and discover richer language patterns. In this thesis, we extend LDA in three different ways: adding knowledge of word meaning modeling multiple languages, and incorporating local syntactic context. These extensions apply topic models to new problems, such as discovering the meaning of ambiguous words, extend topic models for new datasets, such as unaligned multilingual corpora, and combine topic models with other sources of information about documents' context. In Chapter 2, we present latent Dirichlet allocation with WordNet (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. LDAWN replaces the multinomial topics of LDA with Abney and Light's distribution over meanings. Thus, posterior inference in this model discovers not only the topical domains of each token, as in LDA, but also the meaning associated with each token. We show that considering more topics improves the problem of word sense disambiguation. LDAWN allows us to separate the representation of meaning from how that meaning is expressed as word forms.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 01, 2010
Accession Number: ADA571374

Entities

People

Jordan Boyd-graber

Organizations

Princeton University

Linguistic Extensions of Topic Models

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas