The Form is the Substance: Classification of Genres in Text

Abstract

Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases. categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text, classification by format style, i.e. "genre",. and demonstrates. by complementing topic classification. that it can significantly improve retrieval of information. The paper compares use of presentation features to word features and the combination thereof, using Naive Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2001
Accession Number
ADA460898

Entities

People

  • Carol Vaness-dykema
  • Nigel Dewdney
  • Richard Macmillan

Organizations

  • United States Department of Defense

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence Software
  • Bayesian Networks
  • Classification
  • Computational Linguistics
  • Computational Science
  • Data Mining
  • Data Sets
  • Electronic Mail
  • Identification
  • Information Processing
  • Information Retrieval
  • Information Science
  • Linguistics
  • Machine Learning
  • Probability
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation