Scalable Distributed Architectures for Information Retrieval
Abstract
As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining retrieval accuracy. We first investigate using partial collection replication for IR systems. We examine query locality in real systems, how to select a partial replica based on relevance, how to load-balance between replicas and the original collection, as well as updating overheads and strategies. Our results show that there exists sufficient query locality to justify partial replication for information retrieval. Our proposed replica selection algorithm effectively selects relevant partial replicas, and is inexpensive to implement. Our evidence also indicates that partial replication achieves better performance than caching queries, because the replica selection algorithm finds similarity between non-identical queries, and thus increases observed locality. We use a validated simulator to perform a detailed performance evaluation of distributed IR architectures. We explore how best to build parallel IR servers using symmetric multiprocessors, evaluate the performance of partial collection replication and collection selection, and compare the performance of partial collection replication with collection partitioning as well as collection selection.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 01, 1999
- Accession Number
- ADA365725
Entities
People
- Zhihong Lu
Organizations
- University of Massachusetts Amherst