Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff's (2001) Approach

Abstract

There is a distinct lack of tools that provide a comprehensive measure of the similarity between corpora. Finding similar corpora is necessary for the design of certain user studies investigating text processing. It is also useful for ensuring comparability between studies on document analysis conducted across classified and unclassified domains. In this study, human judgements of corpora similarity were obtained as a gold standard. These were then compared to the values provided by Kilgarriff's (2001) chi-square (X2) statistic. The findings indicated a high level of agreement between the participants, with 77% shared variance in overall similarity judgements. The results of the X2 measure also correlated well with the human results, with a correlation of approximately 0.66. Although there are complexities associated with the X2 technique that need to be examined in further research, this study provides extremely promising results, suggesting that a statistical technique could provide results that are comparable to human judgements.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 01, 2009
Accession Number
ADA506585

Entities

People

  • Agata Mccormac
  • Kathryn Parsons
  • Marcus Butavicius

Organizations

  • Defence Science and Technology Group

Tags

DTIC Thesaurus Topics

  • Cognitive Science
  • Computational Science
  • Data Visualization
  • Information Retrieval
  • Information Science
  • Judgment
  • Language
  • Linguistics
  • Machine Translation
  • Natural Language Processing
  • Object Recognition
  • Psychology
  • Regression Analysis
  • Statistical Analysis
  • Statistical Samples
  • Statistics
  • Visualizations

Readers

  • Computational Modeling and Simulation
  • Cybersecurity.