Web-Scale Search-Based Data Extraction and Integration
Abstract
In the current age of abundant, digitized geographic data, the classic, manual approach to geospatial feature discovery and gazetteer creation is cost-prohibitive. While geographic data has become increasingly prevalent on the open Web, it remains largely unstructured and difficult to study. This, the GeoEngine project, has developed generalizable methods for automatic gazetteer generation based on the ample, but unstructured data on the open Web. GeoEngine solves this problem with a three tiered architecture: automatic data discovery and extraction, machine-based semantic aggregation and human validation. GeoEngine has produced specific, but generalizable solutions in the following areas: sub-city feature discovery in domestic and foreign locales; neighborhood boundary discovery and refinement; physical feature gazetteer generation and attribute addition; Wikipedia traversal, extraction and auto-correction; and a comprehensive "Places Profile" of Afghanistan. These methods allow for fast, automated gazetteer generation and support for geospatial research by leveraging the abundance of unstructured data on the open Web and provides new ways of thinking about old problems in geographic information systems.
Document Details
- Document Type
- Technical Report
- Publication Date
- Oct 17, 2011
- Accession Number
- ADA554205
Entities
People
- Govind Kabra
- Kevin C. Chang
- Truman Shuck