A Methodology for End-to-End Evaluation of Arabic Document Image Processing Software

Abstract

This paper describes a methodology for end-to-end evaluation of Arabic document image processing software. Various software solutions have been proposed for digitization and understanding of noisy, complex Arabic document images. Optical-character-recognition-based (OCR-based) solutions have been available for decades; however this technology is often tailored to the most common document image type: clean, monolingual documents. Real-world documents often involve multiple languages, handwriting, logos, signatures, pictures, stylized text, and other document aspects. Real-world documents involve noise introduced by document aging, reproduction, or exposure to environment factors. Document image processing solutions are maturing to deal with such complexities. Such systems include image clean-up algorithms and page segmentation, followed by various recognition or digitization algorithms: OCR, handwritten word recognition (HWR), logo identification, signature identification, sub-image or picture identification. Indexing digitized document renditions into a search engine enables ad hoc querying of the collection. Some researchers have proposed semi-automation, a process in which human readers interpret complex documents and record a spoken rendition; the audio recordings are then processed by a spoken document retrieval (SDR) system, employing automatic speech recognition (ASR) for digitization and an information retrieval solution to enable ad hoc queries. To handle foreign language, machine translation may be included in any of the aforementioned document image processing systems. This array of approaches results in widely varying performance. This paper discusses a methodology for evaluating the end-to-end retrieval performance of these systems: the ad-hoc use case. The methodology can be easily tailored to other languages, and to other document formats (e.g., audio and video).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2006
Accession Number
ADA468394

Entities

People

  • Catherine N. Ball
  • Paul M. Herceg

Organizations

  • MITRE Corporation

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Audio Files
  • Automated Speech Recognition
  • Character Recognition
  • Data Analysis
  • Data Sets
  • Extraction
  • Foreign Languages
  • Identification
  • Image Processing
  • Images
  • Information Retrieval
  • Language
  • Optical Character Recognition
  • Recognition
  • Test And Evaluation
  • Word Recognition

Fields of Study

  • Computer science
  • Engineering

Readers

  • Distributed Systems and Data Platform Development
  • Information Retrieval
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation