Reliable Electronic Text: The Elusive Prerequisite for a Host of Human Language Technologies

Abstract

Electronic text for use by human language technologies originates from a number of sources direct keyboard entry, optical character recognition, speech recognition, and text-containing computer files. In particular, text-containing computer files may elude processing by an array of human language technology applications (e.g., search, language ID, machine translation, and text analytics). This paper brings to light the effort required to extract electronic text from these files preserve its integrity, and, for some use cases, preserve its structure. It explores a series of specific human language technologies, highlighting the following aspects for each: relevant use cases, the impact of text extraction or conversion errors, the criticality of dependable text extraction and reliable electronic text, and the importance of experimentation and/or testing prior to use. Overall, this paper promotes the successful use of human language technology by equipping the reader to be discerning about the use of human language technology applications with text-containing files.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 30, 2010
Accession Number
ADA546707

Entities

People

  • Catherine N. Ball
  • Paul M. Herceg

Organizations

  • MITRE Corporation

Tags

DTIC Thesaurus Topics

  • Character Recognition
  • Computational Science
  • Computer Languages
  • Computers
  • Data Mining
  • Digital Information
  • Html
  • Machine Translation
  • Models
  • Named Entity Recognition
  • Natural Language Processing
  • Ontologies
  • Recognition
  • Text Analytics
  • Text Mining
  • Translations
  • Word Processors

Fields of Study

  • Computer science
  • Engineering

Readers

  • Computer Science/Computer Engineering/Data Science/Digital Signal Processing.
  • Geochemistry
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - DoD AI Strategy
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation
  • Microelectronics