Acquaintance: Language-Independent Document Categorization by N-Grams

Abstract

Acquaintance is the name of a novel vector-space n-gram technique for categorizing documents. The technique is completely language-independent, highly garble-resistant, and computationally simple. An unoptimized version of the algorithm was used to process the TREC database in a very short time. Acquaintance is the name of a technique for information processing that combines the robustness of an n-gram-based algorithm with a novel vector-space model. Acquaintance gauges similarity among documents on the basis of common features, permitting document categorization based on a common language, a common topic, or common subtopics. The algorithm is completely language- and topic- independent, and is resistant to garbling even at the 10% to 15% (character) level. Acquaintance is fully described in Damashek, 1995. The TREC-3 conference provided the first public demonstration and evaluation of this new technique, and TREC-4 provided an opportunity to test its usefulness on several types of text retrieval tasks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 1995
Accession Number
ADA470523

Entities

People

  • Stephen Huffman

Organizations

  • United States Department of Defense

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Acquisition
  • Algorithms
  • Databases
  • Department Of Defense
  • Filtration
  • Graph Theory
  • Hash Tables
  • Information Operations
  • Information Processing
  • Language
  • Neurobehavioral Manifestations
  • Nuclear Proliferation
  • Personality
  • Precision
  • Statistics
  • Vector Spaces

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Information Retrieval
  • Systems Analysis and Design

Technology Areas

  • Space