Adjacency and Proximity Searching in the Science Citation Index and Google

Abstract

We have developed simple algorithms that allow adjacency and proximity searching in Google and the Science Citation Index (SCI). The SCI algorithm exploits the fact that SCI stopwords in a search phrase function as a placeholder. Such a phrase serves effectively as a fixed adjacency condition determined by the number n of adjacent stopwords (i.e., retrieve all records where word A and word B are separated by n words in at least one location). The algorithm integrates over search phrases with different numbers of adjacent stopwords to provide a flexible adjacency or proximity capability (i.e., retrieve all records where word A and word B are separated by n or less words in at least one location, where n is the maximum separation desired between A and B in at least one location). The Google algorithm exploits the fact that asterisks (in Google) separating words in a phrase function like word wildcards. The difference between two such phrases (the first phrase containing one less asterisk than the second phrase) serves effectively as a fixed adjacency or proximity condition, with the number of separating words equal to the number of asterisks in the first phrase. The algorithm integrates over these phrase differentials to provide a flexible adjacency or proximity capability (i.e., retrieve all records where word A and word B are separated by n or less words in at least one location, where n is the maximum separation desired between A and B in at least one location).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2005
Accession Number
ADA442888

Entities

People

  • John T. Rigsby
  • Ronald Neil Kostoff
  • Ryan B. Barth

Organizations

  • Office of Naval Research

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Computer Science
  • Databases
  • Engineering
  • Information Operations
  • Information Retrieval
  • Information Science
  • Information Systems
  • Isotope Separation
  • Literature
  • Literature Surveys
  • Military Research
  • Patent Applications
  • Precision
  • Standards
  • Text Mining

Readers

  • Computational Linguistics
  • Operations Research