Scalable Distributed Architectures for Information Retrieval

Abstract

As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining retrieval accuracy. We first investigate using partial collection replication for IR systems. We examine query locality in real systems, how to select a partial replica based on relevance, how to load-balance between replicas and the original collection, as well as updating overheads and strategies. Our results show that there exists sufficient query locality to justify partial replication for information retrieval. Our proposed replica selection algorithm effectively selects relevant partial replicas, and is inexpensive to implement. Our evidence also indicates that partial replication achieves better performance than caching queries, because the replica selection algorithm finds similarity between non-identical queries, and thus increases observed locality. We use a validated simulator to perform a detailed performance evaluation of distributed IR architectures. We explore how best to build parallel IR servers using symmetric multiprocessors, evaluate the performance of partial collection replication and collection selection, and compare the performance of partial collection replication with collection partitioning as well as collection selection.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 1999
Accession Number
ADA365725

Entities

People

  • Zhihong Lu

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Commerce
  • Computer Science
  • Computing Devices
  • Information Retrieval
  • Internet
  • Intranet
  • Massachusetts
  • Multiprocessors
  • Replicas
  • Schools
  • Simulations
  • Simulators
  • Theses
  • Universities

Fields of Study

  • Computer science

Readers

  • Geospatial Intelligence and Artificial Intelligence Analytics
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval