A Conditional Model of Deduplication for Multi-Type Relational Data

Abstract

Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. We evaluate the system on two citation matching datasets, for which we deduplicate both papers and venues. We show that by collectively deduplicating paper and venue deduplication, and up to a 20% error reduction in paper deduplication over competing methods.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2005
Accession Number
ADA439607

Entities

People

  • Andrew McCallum
  • Aron Culotta

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Human Systems

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Clustering
  • Computer Science
  • Data Sets
  • Databases
  • Error Analysis
  • Errors
  • Language
  • Machine Learning
  • Models
  • Natural Languages
  • Probabilistic Models
  • Probability
  • Probability Distributions
  • Random Variables
  • Relational Databases

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Fault Tolerant Diagnosis of Black and White Balloon Isolation Tests Using ¥.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval