Searching a Terabyte of Text Using Partial Replication

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. Using a validated simulator, we compare database partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time, even with fewer resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte text database. We further investigate query locality with respect to time, replica size, and replica updating costs using real logs from THOMAS and Excite, and discuss the sensitivity of our results to these sample points.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 01, 1999
Accession Number
ADA365715

Entities

People

  • Kathryn S. Mckinley
  • Zhihong Lu

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Computer Science
  • Congress
  • Databases
  • Frequency
  • Hierarchies
  • Information Retrieval
  • Information Systems
  • Measurement
  • Networks
  • Replicas
  • Simulations
  • Simulators
  • Statistics
  • Terabytes
  • United States
  • User Interface
  • Websites

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval