Sentence Level Information Patterns for Novelty Detection

Abstract

The detection of new information in a document stream is an important component of many potential applications. In this thesis, a new novelty detection approach based on the identification of sentence level information patterns is proposed. Given a user's information need, some information patterns in sentences such as combinations of query words, sentence lengths, named entities and phrases, and other sentence patterns, may contain more important and relevant information than single words. The work of the thesis includes three parts. First, we redefine "what is novelty detection" in the light of the proposed information patterns. Examples of several different types of information patterns are given corresponding to different types of user's information need. Second, we analyze why the proposed information pattern concept has a significant impact in novelty detection. A thorough analysis of sentence level information patterns is elaborated on data from the TREC novelty tracks, including sentence lengths, named entities (NEs), and sentence level opinion patterns. Finally, we present how we perform novelty detection based on information patterns, which focuses on the identification of previously unseen query-related patterns in sentences. A unified pattern-based approach is presented to novelty detection for both specific NE topics and more general topics. Experiments on novelty detection were carried out on data from the TREC 2002, 2003 and 2004 novelty tracks. Experimental results show that the proposed approach significantly improves the performance of novelty detection for both specific and general topics, therefore the overall performance for all topics, in terms of precision at top ranks. Future research directions are suggested.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2006
Accession Number
ADA454817

Entities

People

  • Xiaoyan Li

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Biomedical
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Automated Text Summarization
  • Carbon Monoxide
  • Computational Linguistics
  • Computational Science
  • Computer Science
  • Data Analysis
  • Data Sets
  • Dielectric Gases
  • Information Retrieval
  • Information Science
  • Knowledge Management
  • Language
  • Law
  • Named Entity Recognition
  • Poisoning
  • Statistical Analysis

Fields of Study

  • Computer science

Readers

  • Brain and Cognitive Science; Experimental Psychology; Cognitive Neuroscience
  • Distributed Systems and Data Platform Development
  • Geospatial Intelligence and Artificial Intelligence Analytics