Improving Efficiency of Indexing by Using a Hierarchical Merge Approach

Abstract

As electronic collections become larger and more numerous, systems capable of working with very large collections will be more and more in demand. On the other hand, computer main memory is relatively limited and constrained compared to the amount of data in a large collection. The requirement for larger memory while building big databases can sometimes make this resource a bottleneck for an information indexing system. INQUERY is a state-of-the-art, widely used, full-text information retrieval system. The INQUERY document indexing system consists of two main operations: parsing and merging. The subsystem responsible for parsing is called the Parser. It creates partial inverted lists by scanning, lexically analyzing, and inverting documents. A partial inverted list contains document entries for a subset of the documents in the collection. It must be combined with other partial inverted lists for the same term to create a final inverted list for the document collection. The Parser buffers partial inverted lists in main memory and flushes them to intermediate files when the buffer is full. The subsystem responsible for merging is called the Merger. After all of the documents have been parsed, the Merger combines the intermediate files to produce the final inverted lists for the collection. The aim of this project is to solve the efficiency and scalability problem of the Merger for the INQUERY indexing system. The speed performance of the old Merger (merge_btl) degrades significantly when used in building big databases under tight memory space limitations. The authors have found a better solution by using a hierarchical merge approach. Timing tests on the new Merger indicates that merge time can be significantly reduced. Hierarchical merge nicely solves the problem of scalability and greatly improves the efficiency of building very large databases for information retrieval systems.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1998
Accession Number
ADA478134

Entities

People

  • Aiqun Du
  • Jamie Callan

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Agreements
  • Availability
  • Classification
  • Commerce
  • Computers
  • Congress
  • Contracts
  • Databases
  • Efficiency
  • Information Operations
  • Information Retrieval
  • Instructions
  • Law
  • Massachusetts
  • Monitoring
  • Scalability

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Computer Science/Computer Engineering/Data Science/Digital Signal Processing.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • Microelectronics
  • Space