A Method for the Removal of Redundancy in Printed Text.

Abstract

A class of methods for redundancy removal from printed texts, called ID-methods was developed. ID-methods take into account only the statistics associated with word occurrences in printed text. However, it has been shown by means of models that these methods can be used to encode English text at a cost as low as 1.5 binary digits per character. This figure compares favorably with Shannon's upper bound on the entropy of printed English, which was determined by an experiment that implicitly took into account the syntactic structure and the semantics of English. Shannon's bound was 1.3 bit per character. An encoding experiment was performed, which verified the cost predictions and assessed the complexity of using ID-methods. It was found that text could be encoded at a rate that was on the order of a few thousand characters per second. An analysis indicates that text encoded using an ID-method could be decoded at a rate of 250,000 characters per second on a computer such as the IBM 360/75. (Author)

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 1972
Accession Number
AD0751407

Entities

People

  • Robert Donald Cullum

Organizations

  • University of Illinois Urbana–Champaign

Tags

DTIC Thesaurus Topics

  • Bits
  • Coding
  • Computers
  • Computing-Related Activities
  • Data Science
  • Information Science
  • Interdisciplinary Science
  • Mathematics
  • Personality
  • Redundancy
  • Semantics
  • Statistical Analysis
  • Statistics

Readers

  • Computational Linguistics
  • Molecular Genetics
  • Radio communications and signal processing.