Learning to Extract Symbolic Knowledge from the World Wide Web

Abstract

The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs. The first is an ontology that defines the classes (e.g., Company, Person, Employee, Product) and relations (e.g., Employed.By, Produced.By) of interest when creating the knowledge base. The second is a set of training data consisting of labeled regions of hypertext that represent instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 1998
Accession Number
ADA356047

Entities

People

  • Andrew McCallum
  • Dan Pipasquo
  • Dayne Freitag
  • Mark Craven
  • Tom M. Mitchell

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Accuracy
  • Bayesian Networks
  • Computer Science
  • Computers
  • Data Sets
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Neural Networks
  • Ontologies
  • Probabilistic Models
  • Probability
  • Test Sets
  • Websites
  • World Wide Web
  • Xml

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks