Building an Entity-Centric Stream Filtering Test Collection for TREC 2012

Abstract

The Knowledge Base Acceleration track in TREC 2012 focused on a single task: filter a time-ordered corpus for documents that are highly relevant to a predefined list of entities. KBA differs from previous filtering evaluations in two primary ways: the stream corpus is >100x larger than previous filtering collections, and the use of entities as topics enables systems to incorporate structured knowledge bases (KB), such as Wikipedia, as external data sources. A successful KBA system must do more than resolve the meaning of entity mentions by linking documents to the KB: it must also distinguish centrally relevant documents that are worth citing in the entity's WP article. This combines thinking from natural language processing (NLP) and information retrieval (IR). Filtering tracks in TREC have typically used queries based on topics described by a set of keyword queries or short descriptions, and annotators have generated relevance judgments based on their personal interpretation of the topic. For TREC 2012, we selected a set of filter topics based on Wikipedia entities: 27 people and 2 organizations. Such named entities are more familiar in NLP than IR. We also constructed an entirely new stream corpus spanning 4,973 consecutive hours from October 2011 through April 2012. It contains over 400M documents, which we augmented with named entity classification tagging for the ~40% of the documents identified as English. Each document has a timestamp that places it in the stream. The 29 target entities were mentioned infrequently enough in the corpus that NIST assessors could judge the relevance of most of the mentioning documents (~91%). Judgments for documents from before January 2012 were provided to TREC teams as training data for filtering documents from the remaining hours. Run submissions were evaluated against the assessor-generated list of citation-worthy documents. We present peak F_1 scores averaged across the entities for all run submissions. High scoring system

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2012
Accession Number
ADA581248

Entities

People

  • Ce Zhang
  • Christopher RĂ©
  • Daniel A. Roberts
  • Feng Niu
  • Ian Soboroff
  • John R. Frank
  • Max Kleiman-weiner

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Classification
  • Computer Languages
  • Data Sets
  • Filters
  • Information Retrieval
  • Information Science
  • Judgment
  • Language
  • Lessons Learned
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Ratings
  • Supervised Machine Learning
  • Test And Evaluation

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Information Retrieval

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval