Using EM to Classify Text from Labeled and Unlabeled Documents

Abstract

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text, based on the combination of Expectation-Maximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 11, 1998
Accession Number
ADA350490

Entities

People

  • Andrew McCallum
  • Kamal Nigam
  • Sebastian Thrun
  • Tom M. Mitchell

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • Ground and Sea Platforms

DTIC Thesaurus Topics

  • Algorithms
  • Computational Science
  • Computer Science
  • Data Mining
  • Data Sets
  • Electronic Mail
  • Estimators
  • Generative Models
  • Information Processing
  • Information Retrieval
  • Information Science
  • Machine Learning
  • Network Science
  • Probabilistic Models
  • Probability
  • Probability Distributions
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Business Analytics
  • Neural Network Machine Learning.
  • Operations Research

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks