Tandem Learning: A Learning Framework for Document Categorization

Abstract

Supervised machine learning techniques rely on the availability of ample training data in the form of labeled instances. However, in text, users can have a strong intuition about the relevance of features, that is, words that are indicative of a topic. In this work we show that user prior knowledge on features is useful for text classification, a domain with many irrelevant and redundant features. The benefit of feature selection is more pronounced when the objective is to learn a classifier with as few training examples as possible. We will demonstrate the role of feature feedback in training a classifier to suitable performance quickly. We find that aggressive feature feedback is necessary to focus the classifier during the early stages of active learning by mitigating the Hughes phenomenon. We will describe an algorithm for tandem learning that begins with a couple of labeled instances, and then at each iteration recommends features and instances for a user to label. The algorithm contains methods to incorporate feature feedback into Support Vector Machines. We design an oracle to estimate an upper bound on tandem learning performance. Tandem learning using an oracle results in much better performance than learning on only features or only instances. We find that humans can emulate the oracle to an extent that results in performance (accuracy) comparable to that of the oracle. Our unique experimental design helps factor out system error from human error, leading to a better understanding of when and why interactive feature selection works from a user perspective. We also design a set of difficulty measures that capture the inherent instance and feature complexity of a problem. We verify the robustness of our measures by showing how instance and feature complexity are highly correlated. Our complexity measures serve as a tool to understand when tandem learning is beneficial for text classification.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2007
Accession Number
ADA462970

Entities

People

  • Hema Raghavan

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Accuracy
  • Artificial Intelligence Software
  • Automata Theory
  • Cognitive Science
  • Computational Science
  • Computer Languages
  • Computer Science
  • Electronic Mail
  • Experimental Design
  • Human-Computer Interaction
  • Information Retrieval
  • Information Science
  • Machine Learning
  • Network Science
  • Statistical Sampling
  • Supervised Machine Learning
  • Surveys

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks