Exploiting Secondary Sources for Unsupervised Record Linkage

Abstract

XML, Web services, and the Semantic Web have opened the door for new and exciting information integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many online sources does not contain enough information to accurately link the records using state of the art record linkage systems. There is an inherent need for learning in these systems, most of the time requiring a user in the loop, to accurately link records across datasets. In this paper we describe a novel approach to exploiting additional data sources to design an unsupervised record linkage method. Our evaluation using real world data sets shows that the performance of unsupervised learning in a record linkage system is on par with traditional supervised learning methods.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2004
Accession Number
ADA459586

Entities

People

  • Craig Knoblock
  • Martin Michalowski
  • Snehal Thakkar

Organizations

  • University of Southern California

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Acquisition
  • Air Force
  • Algorithms
  • Data Sets
  • Databases
  • Information Science
  • Learning
  • Machine Learning
  • Postal Service
  • Precision
  • Supervised Machine Learning
  • Training
  • United States
  • Unsupervised Machine Learning
  • Websites

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Enterprise Information Systems Architecture and Joint Command Capability Interoperability Support.
  • Robotics and Automation.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms