TREC Chemical IR Track 2009: A Distributed Dimensional Indexing Model for Chemical Patent Search

Abstract

For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instances and 15 Elastic Block Storage (EBS) database shards to support efficient indexing and query processing of the relatively large index generated from indexing each individual word (sans stop words) in the 100G+ collection of chemical patent documents. The query processing algorithm for technology survey search and prior art search uses information extraction techniques and locally aggregated term statistics to help disambiguate candidate entities and terms in context. Query processing for prior art search automatically generates a structured query based on the relative distinctiveness of individual terms and candidate entity phrases from the query patent's claims, abstract, and title sections. For both the technology survey and prior art search, we evaluated several probabilistic retrieval functions for integrating statistics of retrieved named entities with term statistics at multiple levels of document structure to identify relevant patents.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2009
Accession Number
ADA517743

Entities

People

  • Jay Urbain
  • Ophir Frieder

Organizations

  • University of Wisconsin–Milwaukee

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Chemical Phenomena
  • Chemistry
  • Cloud Computing
  • Coefficients
  • Computer Science
  • Databases
  • Electrical Engineering
  • Engineering
  • Information Retrieval
  • Inhibitors
  • Models
  • Natural Languages
  • Probabilistic Models
  • Standards
  • Statistics
  • Surveys

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval