Automatically Detecting Authors' Native Language

Abstract

When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2011
Accession Number
ADA543857

Entities

People

  • Charles S. Ahn

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy
  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Acquisition
  • Algorithms
  • Computational Science
  • Computer Programs
  • Computer Science
  • Data Sets
  • Department Of Defense
  • Feature Extraction
  • Grammars
  • Language
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Probability Distributions
  • Supervised Machine Learning

Readers

  • Neural Network Machine Learning.
  • Political Science/ International Relations/ European Studies
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation