Language Identification by Statistical Analysis

Abstract

An analysis was conducted of English and Spanish text. The statistical analysis determined the independent probability of letters and the joint probability of various letter combinations for large samples of each language. Various methods were tested in an attempt to utilize these characteristics to identify the language of a short sample text. By use of the joint probability of various vowel-consonant relationships and the Kolmogorov- Smirnov Goodness of Fit Test an identification system was defined that provided a significance level of .0077 for a sample of 107 letters (approximately 21 words). Investigation also showed that the space rate or the interword structure in each language contains a measure of intelligence and was useful in identification.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 1974
Accession Number
ADA003518

Entities

People

  • Morton D. Rau

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Alphabets
  • California
  • Classification
  • Consonants
  • Data Science
  • Goodness Of Fit Tests
  • Identification
  • Identification Systems
  • Information Science
  • Knowledge Management
  • Language
  • Linguistics
  • Probability
  • Security
  • Statistical Analysis
  • Statistics
  • United States

Readers

  • Regression Analysis.
  • Speech Processing/Speech Recognition.

Technology Areas

  • Space