WIDELink: A Bootstrapping Approach to Identifying, Modeling and Linking On-Line Data Sources

Abstract

A link discovery system must be able to augment its knowledge base by collecting information from diverse, distributed sources. We have developed a system, WideLink, that can automatically extract data from online sources, integrate it into a domain model by automatically labeling it and automatically link it with facts already stored in a knowledge base. The challenge is to locate, extract, and integrate the data that comes from online sources. We addressed these problems by using a bootstrapping approach where the system leverages previously-gathered data, as well as the underlying structure many online data sources have, in order to identify and incorporate new data sources. WideLink systematically explores the structure of online sites so that it is able to retrieve pages on demand from complex web sites (e.g., sites with forms, embedded navigational structures, etc.). The system uses knowledge derived from previously gathered examples to help analyze new types of pages. Using examples of the type of information it is looking for, and characteristic patterns learned from those examples, WideLink can recognize relevant data from new sources, assign it to semantic categories within the domain model, and link it with previously learned facts.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2005
Accession Number
ADA436343

Entities

People

  • Cenk Gazen
  • Craig Knoblock
  • Kristina Lerman
  • Steven Minton

Organizations

  • University of Southern California

Tags

Communities of Interest

  • Autonomy
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Air Force Research Laboratories
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Data Sets
  • Hidden Markov Models
  • Information Science
  • Language
  • Machine Learning
  • Markov Models
  • Navigation
  • Probabilistic Models
  • Probability
  • Standards
  • Unsupervised Machine Learning
  • Web Service
  • Websites

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Gulf War Illness and Chronic Multisymptom Illness in Veterans.
  • Systems Analysis and Design