Evaluating Stream Filtering for Entity Profile Updates for TREC 2013 (KBA Track Overview)

Abstract

The Knowledge Base Acceleration (KBA) track in TREC 2013 expanded the entity-centric filtering evaluation from TREC KBA 2012. This track evaluates systems that filter a time-ordered corpus for documents and slot fills that would change an entity profile in a predefined list of entities. We doubled the size of the KBA streamcorpus to twelve thousand contiguous hours and a billion documents from blogs, news, and Web content. We quadrupled the number of entities as query topics from structured knowledge bases (KB), such as Wikipedia and Twitter. We also added a second task component: identifying entity slot values that change over the course of the stream. This Streaming Slot Filling (SSF) subtask focuses on natural language understanding and is a step toward decomposing the profile update process undertaken by humans maintaining a knowledge base. A successful KBA system must do more than resolve the meaning of entity mentions by linking documents to the KB: it must also distinguish vitally relevant documents and new slot fills that would change a target entity's profile. This combines thinking from natural language processing (NLP) and information retrieval (IR). Filtering tracks in TREC have typically used queries based on topics described by a set of keyword queries or short descriptions, and annotators have generated relevance judgments based on their personal interpretation of the topic. For TREC 2013, we selected a set of filter topics based on Wikipedia and Twitter entities 98 people, 19 organizations, and 24 facilities. Assessors judged ~50k documents, which included all documents that mention a name from a handcrafted list of surface form names of the 141 target entities. Judgments for documents from before February 2012 were provided to TREC teams as training data, and the remaining 12 months of data was used to measure the F_1 accuracy and scaled utility of these systems. We present peak macro-averaged F_1 scores for all run sub

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2013
Accession Number
ADA600032

Entities

People

  • Ce Zhang
  • Christopher RĂ©
  • Daniel A. Roberts
  • Ellen Voorhees
  • Ian Soboroff
  • John R. Frank
  • Max Kleiman-weiner
  • Nilesh Tripuraneni
  • Steven J. Bauer

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Computer Languages
  • Data Sets
  • Filters
  • Filtration
  • Information Retrieval
  • Judgment
  • Language
  • Natural Language Processing
  • Natural Languages
  • Precision
  • Social Media
  • Standards
  • Test And Evaluation
  • Thinking
  • Training

Readers

  • Computational Linguistics
  • Information Retrieval
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval