Improving English and Chinese Ad-hoc Retrieval: Tipster Text Phase 3

Abstract

We investigated both English and Chinese ad-hoc information retrieval (IR). Part of our objectives is to study the use of term, phrasal and topical concept level evidence, either individually or in combination, to improve retrieval accuracy. For short queries, we studied five term level techniques that together lead to improvements over standard ad-hoc 2-stage retrieval some 20% to 40% for TREC5 & 6 experiments. For long queries, we studied linguistic phrases as evidence to re-rank outputs of term level retrieval. It brings small improvements in both TREC5 & 6 experiments, but needs further confirmation. We also investigated clustering of output documents from term level retrieval. Our aim is to separate relevant and irrelevant documents into different clusters, and to rerank the output list by groups based on query and cluster-profile matching. Investigation is still on-going. For Chinese IR, many results were confirmed or discovered. For example, accurate word segmentation is not as important as first thought, but short-word segmentation is preferable to long-word (phrase). Simple bigram representation can give very good retrieval. A stopword list is not necessary; and presence of non-content terms does not hurt evaluation results much. One only needs screening out statistical stopwords of high frequency. Character indexing by itself is not competitive, but is useful for augmenting short-words or bigrams. Best results were obtained by combining retrievals of bigram and short-word with character representation. Chinese IR retums better precision than English, and it is not clear if this is a language-related, or collection-related phenomenon.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 01, 1998
Accession Number
ADA631830

Entities

People

  • Kui-lam Kwok

Organizations

  • Queens College

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Applied Computer Science
  • Chinese Language
  • Clustering
  • Computer Languages
  • Computer Science
  • Computer Vision
  • Frequency
  • Information Retrieval
  • Language
  • Linguistics
  • Natural Languages
  • Personality
  • Precision
  • Standards
  • Test And Evaluation

Readers

  • Brain and Cognitive Science; Experimental Psychology; Cognitive Neuroscience
  • Information Retrieval

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation