Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus

Abstract

Pairwise similarity judgement correlations between humans and Latent Semantic Analysis (LSA) were explored on a set of 50 news documents. LSA is a modern and commonly used technique for automatic determination of document similarity. LSA users must choose local and global weighting schemes, the number of factors to be retained, stop word lists and whether to background. Global weighting schemes had more effect than local weighting schemes. Use of a stop word list almost always improved performance. Introduction of a background set of similar documents increased larger correlations and reduced smaller ones The correlations ranged between approximately 0 and 0.6 depending on the LSA settings indicating the importance of correct settings The low maximum correlation indicates that information presentation schemes based on LSA may often be at variance with visualisations based on human decisions even using the best settings for a data set.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2004
Accession Number
ADA427585

Entities

People

  • Brandon Pincombe

Organizations

  • Defence Science and Technology Group

Tags

Communities of Interest

  • Autonomy
  • Biomedical
  • Ground and Sea Platforms
  • Space

DTIC Thesaurus Topics

  • Applied Mathematics
  • Artificial Intelligence
  • Birds
  • Cognitive Science
  • Computational Science
  • Employment
  • Fish
  • Information Processing
  • Information Retrieval
  • Information Science
  • Natural Language Processing
  • Personnel Management
  • Terrorists
  • United States
  • United States Government
  • Weighting Functions
  • Word Lists

Fields of Study

  • Computer science

Readers

  • Electronics Engineering
  • Information Retrieval
  • Systems Analysis and Design