Data Discovery and Collection in Support of Data Analytics
Abstract
The goal of this effort was to support the D3M domain discovery systems, to achieve acceptable recall rates against ground truth datasets in program evaluations, and deliver an easily re-trainable, model-agnostic data discovery, collection, and extraction system that could be centrally provided and leveraged across multiple programs. This required addressing a number of challenges inherited from the underlying web crawling technologies such as reliable handling of dynamic content, anti-bot mechanisms such as CAPTCHA puzzles, and other annoyances like soft 404 errors, parked domains, and page loading delays.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jul 01, 2022
- Accession Number
- AD1174919
Entities
People
- Jason Hopper