Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module

Abstract

Optical character recognition (OCR) is the process of converting an image of a document into text. While progress in OCR research has enabled low error rates for English text in low-noise images, performance is still poor for noisy images and documents in other languages. We intend to create a post-OCR processing module for noisy Arabic documents which can correct OCR errors before passing the resulting Arabic text to a translation system. To this end, we are evaluating an Arabic-script OCR engine on documents with the same content but varying levels of image quality. We have found that OCR text accuracy can be improved with different stages of pre-OCR image processing: (1) filtering out low-contrast images to avoid hallucination of characters, (2) removing marks from images with cleanup software to prevent their misrecognition, and (3) zoning multi-column images with segmentation software to enable recognition of all zones. The specific errors observed in OCR will form the basis of training data for our post-OCR correction module.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2011
Accession Number
ADA554465

Entities

People

  • Brian Kjersten

Organizations

  • United States Army Research Laboratory

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Accuracy
  • Channel Models
  • Character Recognition
  • Contrast
  • Errors
  • Fungi
  • Identification
  • Image Processing
  • Language
  • Low Noise
  • Noise
  • Optical Character Recognition
  • Personality
  • Recognition
  • Test And Evaluation
  • Training
  • Translations

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Image Processing and Computer Vision.
  • Speech Processing/Speech Recognition.