Natural Language Information Retrieval: TREC-4 Report

Abstract

In this paper we report on the joint GE/NYU natural language information retrieval project as related to the 4th Text Retrieval Conference "TREC-4". The main thrust of this project is to use natural language processing techniques to enhance the effectiveness of full-text document retrieval. During the course of the four TREC conferences, we have built a prototype IR system designed around a statistical full-text indexing and search backbone provided by the NIST?s Prise engine. The original Prise has been modified to allow handling of multi-word phrases, differential term weighting schemes, automatic query expansion, index partitioning and rank merging, as well as dealing with complex documents. Natural language processing is used to "1" preprocess the documents in order to extract content-carrying terms, "2" discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and "3" process user?s natural language requests into effective search queries. The overall architecture of the system is essentially the same as in TREC-3, as our efforts this year were directed at optimizing the performance of all components. A notable exception is the new massive query expansion module used in routing experiments, which replaces prototype extension used in the TREC-3 system. On the other hand, it has to be noted that the character and the level of difficulty of TREC queries has changed quite significantly since the last year evaluation. TREC-4 new ad-hoc queries are far shorter, less focused, and they have a flavor of information requests "What is the prognosis of ..." rather than search directives typical for earlier TRECs "The relevant document will contain ...". This makes building of good search queries a more sensitive task than before. We thus decided to introduce only minimum number of changes to our indexing and search processes,

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 1995
Accession Number
ADA470538

Entities

People

  • Jose P. Carballo
  • Tomek Strzalkowski

Tags

Communities of Interest

  • Biomedical
  • Space

DTIC Thesaurus Topics

  • Agent Orange
  • Automatic
  • Classification
  • Computational Linguistics
  • Databases
  • Information Processing
  • Information Retrieval
  • Information Science
  • Language
  • Law
  • Linguistics
  • Natural Language Processing
  • Natural Languages
  • Precision
  • Statistics
  • Test And Evaluation
  • Universities

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval