Using Unlabeled Data to Improve Text Classification

Abstract

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2001
Accession Number
ADA461063

Entities

People

  • Kamal P. Nigam

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • Engineered Resilient Systems
  • Ground and Sea Platforms
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Data Mining
  • Electronic Mail
  • Information Processing
  • Information Retrieval
  • Information Science
  • Information Systems
  • Machine Learning
  • Network Science
  • Neural Networks
  • Ontologies
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Neural Networks