WIDELink: A Bootstrapping Approach to Identifying, Modeling and Linking On-Line Data Sources
Abstract
A link discovery system must be able to augment its knowledge base by collecting information from diverse, distributed sources. We have developed a system, WideLink, that can automatically extract data from online sources, integrate it into a domain model by automatically labeling it and automatically link it with facts already stored in a knowledge base. The challenge is to locate, extract, and integrate the data that comes from online sources. We addressed these problems by using a bootstrapping approach where the system leverages previously-gathered data, as well as the underlying structure many online data sources have, in order to identify and incorporate new data sources. WideLink systematically explores the structure of online sites so that it is able to retrieve pages on demand from complex web sites (e.g., sites with forms, embedded navigational structures, etc.). The system uses knowledge derived from previously gathered examples to help analyze new types of pages. Using examples of the type of information it is looking for, and characteristic patterns learned from those examples, WideLink can recognize relevant data from new sources, assign it to semantic categories within the domain model, and link it with previously learned facts.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jul 01, 2005
- Accession Number
- ADA436343
Entities
People
- Cenk Gazen
- Craig Knoblock
- Kristina Lerman
- Steven Minton
Organizations
- University of Southern California