Exploiting Secondary Sources for Unsupervised Record Linkage

Abstract

XML, Web services, and the Semantic Web have opened the door for new and exciting information integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many online sources does not contain enough information to accurately link the records using state of the art record linkage systems. There is an inherent need for learning in these systems, most of the time requiring a user in the loop, to accurately link records across datasets. In this paper we describe a novel approach to exploiting additional data sources to design an unsupervised record linkage method. Our evaluation using real world data sets shows that the performance of unsupervised learning in a record linkage system is on par with traditional supervised learning methods.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 2004
Accession Number: ADA459586

Entities

People

Craig Knoblock
Martin Michalowski
Snehal Thakkar

Organizations

University of Southern California

Exploiting Secondary Sources for Unsupervised Record Linkage

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas