Collection Selection and Results Merging With Topically Organized U.S. Patents and TREC Data

Abstract

We investigate three issues in distributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization of large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, with topically organized collections. We find that it is better to organize collections topically, and that topical collections can be well ranked using either INQUERY's CORI algorithm, or the Kullback-Leibler divergence (KL), but KL is far worse than CORI for non-topically organized collections. For results merging, collections organized by topic require global idfs for the best performance. Contrary to results found elsewhere, normalized scores are not as good as global idfs for merging when the collections are topically organized.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2000
Accession Number
ADA439393

Entities

People

  • Leah S. Larkey
  • Margaret E. Connell

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Clustering
  • Computer Science
  • Contrast
  • Data Sets
  • Databases
  • Education
  • Frequency
  • Hard Copy
  • Information Retrieval
  • Language
  • Natural Languages
  • Precision
  • Signal Processing
  • Statistics
  • United States

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Information Retrieval
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms