Populating the Semantic Web

Abstract

The vision of the Semantic Web is that a vast store of online information "meaningful to computers will unleash a revolution of new possibilities". Unfortunately, the vast majority of information on the Web is formatted to be easily read by human users, not computer applications. In order to make the vision of the Semantic Web a reality, tools for automatically annotating Web content with semantic labels will be required. We describe the ADEL system that automatically extracts records from Web sites and semantically labels the fields. The system exploits similarities in the layout of Web pages in order to learn the grammar that generated these pages. It then uses this grammar to extract structured records from these Web pages. ADEL system also exploits the fact that sites in the same domain will provide the same, or similar data. By collecting labeled examples of data during the training stage, we are able to learn structural descriptions of data fields and later use these descriptions to semantically label new data fields. We show that on a Used Car shopping domain, ADEL achieves precision of 64% and recall of 89% on extracting and labeling data columns.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2004
Accession Number
ADA457907

Entities

People

  • Cenk Gazen
  • Craig Knoblock
  • Kristina Lerman
  • Steven Minton

Organizations

  • University of Southern California

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Air Force
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computer Languages
  • Databases
  • Grammars
  • Induction Systems
  • Information Science
  • Information Systems
  • Language
  • Machine Learning
  • Models
  • Multiagent Systems
  • Neural Networks
  • Probabilistic Models
  • Semi-Supervised Learning
  • Websites

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Vision.
  • Information Retrieval