Natural Language Information Retrieval: TREC-3 Report

Abstract

In this paper we report on the recent developments in NYU's natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference "TREC-3". The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statistical backbone consisting of the indexer module, which builds inverted index files from pre-processed documents, and a retrieval engine which searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC-3 effort, the total of 3.3 GBytes of text articles have been processed "Tipster disks 1 through 3", including material from the Wall Street Journal, the Associated Press newswire, the Federal Register, Ziff Communications's Computer Library, Department of Energy abstracts, U.S. Patents and the San Jose Mercury News, totaling more than 500 million words of English. Since the TREC-2 conference, many components of the system have been redesigned to facilitate its scalability to deal with ever increasing amounts of data. In particular, a randomized index-splitting mechanism has been installed which allows the system to create a number of smaller indexes that can be independently and efficiently searched.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 1994
Accession Number
ADA470537

Entities

People

  • Jose P. Carballo
  • Mihnea Marinescu
  • Tomek Strzalkowski

Organizations

  • New York University

Tags

Communities of Interest

  • Biomedical
  • Human Systems
  • Space

DTIC Thesaurus Topics

  • Abstracts
  • Agent Orange
  • Classification
  • Databases
  • Frequency
  • Hot Spots
  • Information Retrieval
  • Information Science
  • Language
  • Law
  • Linguistics
  • Materials
  • Natural Language Processing
  • Natural Languages
  • New York
  • Statistics
  • Training

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Library and Information Science
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval