Real-Time News Analysis (RTNA) Scraper Assessment

Abstract

An assessment was conducted to evaluate the performance of the Real-Time News Analysis Scraper application used to extract article body text from online news sources. The application's performance was evaluated by determining the integrity of scraped text outputted, a metric found by calculating the output's similarity to text manually selected from the same articles by a human control group. Levenshtein's edit-distance algorithm was implemented to calculate normalized similarity scores of each scraped and manually selected article text pair; normalized scores were direct indicators of integrity. The Scraper was found to perform unacceptably overall because the majority of scraped articles experienced integrity loss exceeding the established threshold. Results of the assessment were insufficiently detailed to give causal explanations for the Scraper's observed performance. Recommendations were not made for the application's improvement; however, a protocol was outlined in detail for a follow-on assessment.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2007
Accession Number
ADA473948

Entities

People

  • Ann E. Brodeen
  • Christine E. Slocum

Organizations

  • United States Army Research Laboratory

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Computer Programming
  • Data Sets
  • Decoding
  • Department Of Defense
  • Directories
  • Dynamic Programming
  • Filters
  • Java Programming Language
  • Language
  • Markup Languages
  • Military Research
  • Personality
  • Preprocessing
  • Programming Languages
  • Standards
  • Web Browsers

Readers

  • Computational Linguistics
  • Organizational Process Management (OPM).
  • Regression Analysis.