Data Discovery and Collection in Support of Data Analytics

Abstract

The goal of this effort was to support the D3M domain discovery systems, to achieve acceptable recall rates against ground truth datasets in program evaluations, and deliver an easily re-trainable, model-agnostic data discovery, collection, and extraction system that could be centrally provided and leveraged across multiple programs. This required addressing a number of challenges inherited from the underlying web crawling technologies such as reliable handling of dynamic content, anti-bot mechanisms such as CAPTCHA puzzles, and other annoyances like soft 404 errors, parked domains, and page loading delays.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2022
Accession Number
AD1174919

Entities

People

  • Jason Hopper

Tags

Communities of Interest

  • Autonomy
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Air Force
  • Air Force Research Laboratories
  • Artificial Intelligence Software
  • Computer Languages
  • Computer Programming
  • Computer Programs
  • Computers
  • Computing System Architectures
  • Contracts
  • Covid-19
  • Cross Domain
  • Data Analysis
  • Data Sets
  • Extraction
  • Html
  • Information Science
  • Machine Learning
  • Neural Networks
  • New York
  • Standards
  • United States

Fields of Study

  • Computer science

Readers

  • Brain and Cognitive Science; Experimental Psychology; Cognitive Neuroscience
  • Distributed Systems and Data Platform Development