Unsupervised Non-topical Classification of Documents

Abstract

We describe the problem of non-topical clustering of documents, the purpose of which is to divide a set of documents into clusters that share some aspect. We present experiments on the British National Corpus that cluster documents by genre. We show that words are superior to part of speech information for genre clustering, but that better results can be obtained by using both. We also demonstrate that the new multi-way distributional clustering approach is highly effective for this task because it requires less feature crafting than other techniques.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA479242

Entities

People

  • James Allan
  • Koji Eguchi
  • Ron Bekkerman

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Autonomy
  • C4I

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Bayesian Networks
  • Classification
  • Clustering
  • Computational Science
  • Computer Science
  • Computer Vision
  • Dimensionality Reduction
  • Feature Selection
  • Information Retrieval
  • Machine Learning
  • Models
  • Neural Networks
  • Random Variables
  • Standards
  • Vector Spaces

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Nuclear Civil Defense.
  • Regression Analysis.