Robust Text Processing in Automated Information Retrieval

Abstract

This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract contents-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. The basic assumption of this design is that term-based representation of contents is in principle sufficient to build an effective if not optimal search query out of any user's request. This has been confirmed by an experiment that compared effectiveness of expert-user prepared queries with those derived automatically from an initial narrative information request. In this paper we show that large-scale natural language processing (hundreds of millions of words and more) is not only required for a better retrieval, but it is also doable, given appropriate resources. We report on selected preliminary results of experiments with 500 MByte database of Wall Street Journal articles, as well as some earlier results with a smaller document collection.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1993
Accession Number
ADA460240

Entities

People

  • Tomek Strzalkowski

Organizations

  • New York University

Tags

Communities of Interest

  • Air Platforms
  • Human Systems

DTIC Thesaurus Topics

  • Abstracts
  • Classification
  • Computational Linguistics
  • Computer Science
  • Databases
  • Frequency
  • Grammars
  • Information Processing
  • Information Retrieval
  • Language
  • Linguistics
  • Natural Language Processing
  • Natural Languages
  • Notation
  • Standards
  • Statistics
  • Text Processing

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Database Systems and Applications
  • Library and Information Science

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks