Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014

Abstract

Identifying documents that contain timely and vital information for an entity of interest, a task known as vital filtering, has become increasingly important with the availability of large document collections. To efficiently filter such large text corpora in a streaming manner, we need to compactly represent previously observed entity contexts and quickly estimate whether a new document contains novel information. Existing approaches to modeling contexts, such as bag of words, latent semantic indexing, and topic models are limited in several respects: they are unable to handle streaming data, do not model the underlying topic of each document, suffer from lexical sparsity, and/or do not accurately estimate temporal vitalness. In this paper, we introduce a word embedding-based non-parametric representation of entities that addresses the above limitations. The word embeddings provide accurate and compact summaries of observed entity contexts further described by topic clusters that are estimated in a non-parametric manner. Additionally we associate a staleness measure with each entity and topic cluster, dynamically estimating their temporal relevance. This approach of using word embeddings, non-parametric clustering, and staleness provides an efficient yet appropriate representation of entity contexts for the streaming setting, enabling accurate vital filtering.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2014
Accession Number
ADA618587

Entities

People

  • Carlos Guestrin
  • Ignacio Cano
  • Sameer Singh

Organizations

  • University of Washington

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Automata Theory
  • Clustering
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Science
  • Data Sets
  • Embedding
  • Filtration
  • Information Science
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Supervised Machine Learning
  • Vector Spaces
  • Visualizations

Fields of Study

  • Computer science

Readers

  • Information Retrieval
  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.