Duplicate Record Elimination in Large Data Files.

Abstract

This paper addresses the issue of duplicate elimination in large data files in which many occurrences of the same record may appear. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system. (Author)

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 1981
Accession Number
ADA110052

Entities

People

  • David J. Dewitt
  • Dina Friedland

Organizations

  • University of Wisconsin Madison Department of Computer Science

Tags

Communities of Interest

  • Energy and Power Technologies
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Application Software
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Cost Analysis
  • Database Management Systems
  • Databases
  • Mass Storage
  • Parallel Computing
  • Parallel Processing
  • Probability
  • Relational Database Management Systems
  • Relational Databases
  • Standards
  • Test And Evaluation

Readers

  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Database Systems and Applications