Unsupervised Non-topical Classification of Documents

Abstract

We describe the problem of non-topical clustering of documents, the purpose of which is to divide a set of documents into clusters that share some aspect. We present experiments on the British National Corpus that cluster documents by genre. We show that words are superior to part of speech information for genre clustering, but that better results can be obtained by using both. We also demonstrate that the new multi-way distributional clustering approach is highly effective for this task because it requires less feature crafting than other techniques.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 2006
Accession Number: ADA479242

Entities

People

James Allan
Koji Eguchi
Ron Bekkerman

Organizations

University of Massachusetts Amherst

Unsupervised Non-topical Classification of Documents

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers