Coupled Semi-Supervised Learning

Abstract

This thesis argues that successful semi-supervised learning is improved by learning many functions at once in a coupled manner. Given knowledge about constraints between functions to be learned (e.g., f1(x) yields f2(x)), forcing the models that are learned to obey these constraints can yield a more constrained, and therefore easier, set of learning problems. We apply these ideas to bootstrap learning methods as well as semi-supervised logistic regression models, and show that considerable improvements are achieved in both settings. In experimental work, we focus on the problem of extracting factual knowledge from the web. This problem is an ideal case study for the general problems that we study because there is an abundance of unlabeled web page data available, and because thousands or millions of functions are discussed on the web. Chapter 3 focuses on coupling the semi-supervised learning of information extractors that extract information from free text using textual extraction patterns (e.g., "mayor of X" and "Y star quarterback X"). We present an approach in which the input to the learner is an ontology defining a set of target categories and relations to be learned, a handful of seed examples for each, and a set of constraints that couple the various categories and relations (e.g., Person and Sport are mutually exclusive). We show that given this input and millions of unlabeled documents, a semi-supervised learning procedure can, by avoiding violations of the constraints in how its learned extractors label unlabeled data, achieve very significant accuracy improvements over semi-supervised methods that do not avoid such violations. In Chapter 4, we apply the ideas from Chapter 3 to a different type of extraction method, wrapper induction for semi-structured web pages.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2010
Accession Number
ADA528596

Entities

People

  • Andrew Carlson

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Automata Theory
  • Birds
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Data Mining
  • Geography
  • Information Science
  • Machine Learning
  • Named Entity Recognition
  • Natural Language Processing
  • Network Science
  • Neural Networks
  • Ontologies
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Theoretical Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks