The Form is the Substance: Classification of Genres in Text
Abstract
Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases. categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text, classification by format style, i.e. "genre",. and demonstrates. by complementing topic classification. that it can significantly improve retrieval of information. The paper compares use of presentation features to word features and the combination thereof, using Naive Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jan 01, 2001
- Accession Number
- ADA460898
Entities
People
- Carol Vaness-dykema
- Nigel Dewdney
- Richard Macmillan
Organizations
- United States Department of Defense