Robust Text Processing in Automated Information Retrieval

Abstract

This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract contents-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. The basic assumption of this design is that term-based representation of contents is in principle sufficient to build an effective if not optimal search query out of any user's request. This has been confirmed by an experiment that compared effectiveness of expert-user prepared queries with those derived automatically from an initial narrative information request. In this paper we show that large-scale natural language processing (hundreds of millions of words and more) is not only required for a better retrieval, but it is also doable, given appropriate resources. We report on selected preliminary results of experiments with 500 MByte database of Wall Street Journal articles, as well as some earlier results with a smaller document collection.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 1993
Accession Number: ADA460240

Entities

People

Tomek Strzalkowski

Organizations

New York University

Robust Text Processing in Automated Information Retrieval

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas