Methods for Evaluating Text Extraction Toolkits: An Exploratory Investigation

Abstract

Text extraction tools are vital for obtaining the textual content of computer files and for using the electronic text in a wide variety of applications, including search and natural language processing. However, when extraction tools fail, they convert once reliable electronic text into garbled text, or no text at all. The techniques and tools for validating the accuracy of these text extraction tools are conspicuously absent from academia and industry. This paper contributes to closing this gap. We discuss an exploratory investigation into a method and a set of tools for evaluating a text extraction toolkit. Although this effort focuses on the popular open source Apache Tika toolkit and the govdocs1 corpus, the method generally applies to other text extraction toolkits and corpora.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 22, 2015
Accession Number
ADA626579

Entities

People

  • Paul M. Herceg
  • Timothy B. Allison

Organizations

  • MITRE Corporation

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Accuracy
  • Application Software
  • Applied Computer Science
  • Attachment
  • Coefficients
  • Computer Programs
  • Computers
  • Corporations
  • Data Sets
  • Digital Information
  • Language
  • Machine Translation
  • Models
  • Operating Systems
  • Probability
  • Statistical Samples
  • Statistical Sampling

Fields of Study

  • Computer science

Readers

  • Computer Science.
  • Distributed Systems and Data Platform Development
  • Theoretical Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • Microelectronics