Intelligent Record Linkage Techniques Based on Information Retrieval, Natural Language Processing, and Machine Learning

Abstract

The objective of this STTR project is to develop an information management system to rapidly and accurately linking records of related information from web-based information sources. The sheer magnitude of information available online via the Internet has overwhelmed the ability of existing search tools to produce useful query responses. Current web-search techniques typically fail to correlate relevant documents that are identified in different ways, such as synonyms and acronyms (aliases). The challenge is to find an approach that can obtain highly accurate matches even when those documents do not share any obvious attributes with the query, and with minimal information requirement from the user. Latent Semantic Analysis (LSA) is a technique for identifying both semantically similar words and semantically similar documents. On the face of it, LSA should work well for the task of discovering aliases. That is, for a given word we can use LSA to produce a rank-ordered list of words that are semantically similar to it and aliases for the name should be high in this list. In this Phase I, we tested this conjecture empirically and found, surprisingly, that under a broad range of circumstances a straightforward application of LSA fails to rank the aliases highly. We then developed a two-stage algorithm that takes the output of LSA, creates a new set of pseudo-documents, and runs LSA again on these new documents. Empirical results show that this two-stage algorithm performs remarkably well in identifying aliases, even in those cases for which a single application of LSA fails miserably. University of Maryland (Baltimore County) is the research institute partner for this effort, under the direction of Professor Charles Nicholas and Tim Oates.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 11, 2002
Accession Number
ADA408937

Entities

People

  • Charles Nicholas
  • Eliot Li
  • Raman K. Mehra
  • Tim Oates

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies
  • Human Systems
  • Materials and Manufacturing Processes
  • Sensors

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Computer Languages
  • Computer Programs
  • Detection
  • Dimensionality Reduction
  • Education
  • Information Retrieval
  • Information Systems
  • Language
  • Learning
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Networks
  • Ontologies
  • Situational Awareness

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Regression Analysis.
  • Research Science/Academic Research

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms