Authorship Discovery in Blogs Using Bayesian Classification with Corrective Scaling

Abstract

Widespread availability of free, public blog platforms has facilitated growth in the amount of individually written electronic text available online. Our research leverages an extremely large blog corpus for a study in authorship discovery, both to evaluate a traditional technique as applied to blogs, as well as to demonstrate the implications of authorship discovery in blogs for intelligence and forensic purposes. Our study uses a Bayesian classifier with two important extensions. First, we introduce a postclassification corrective scaling technique to mitigate the over-classification of many samples to a few authors. Second, we propose an n-percent-correct threshold metric, whereby we define a correct result as one where the true author is within some small subset of the original search space rather than requiring that he or she be the single most probable author. Using this technique, we are able to reduce a search space of 2000 authors to 1% of its original size with 91% accuracy when 1000 bigrams are present, or reduce the search space to 10% of its original size with 94% accuracy when only 500 bigrams are present.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2008
Accession Number
ADA483774

Entities

People

  • Grant T. Gehrke

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies
  • Weapons Technologies

DTIC Thesaurus Topics

  • Accuracy
  • Artificial Intelligence
  • Computational Linguistics
  • Computational Science
  • Computer Science
  • Computers
  • Grammars
  • Language
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Online Communications
  • Probability
  • Training
  • United States
  • United States Naval Academy

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Information Retrieval
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation
  • Microelectronics
  • Space