Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014
Abstract
Identifying documents that contain timely and vital information for an entity of interest, a task known as vital filtering, has become increasingly important with the availability of large document collections. To efficiently filter such large text corpora in a streaming manner, we need to compactly represent previously observed entity contexts and quickly estimate whether a new document contains novel information. Existing approaches to modeling contexts, such as bag of words, latent semantic indexing, and topic models are limited in several respects: they are unable to handle streaming data, do not model the underlying topic of each document, suffer from lexical sparsity, and/or do not accurately estimate temporal vitalness. In this paper, we introduce a word embedding-based non-parametric representation of entities that addresses the above limitations. The word embeddings provide accurate and compact summaries of observed entity contexts further described by topic clusters that are estimated in a non-parametric manner. Additionally we associate a staleness measure with each entity and topic cluster, dynamically estimating their temporal relevance. This approach of using word embeddings, non-parametric clustering, and staleness provides an efficient yet appropriate representation of entity contexts for the streaming setting, enabling accurate vital filtering.
Document Details
- Document Type
- Technical Report
- Publication Date
- Nov 01, 2014
- Accession Number
- ADA618587
Entities
People
- Carlos Guestrin
- Ignacio Cano
- Sameer Singh
Organizations
- University of Washington