Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module

Abstract

Optical character recognition (OCR) is the process of converting an image of a document into text. While progress in OCR research has enabled low error rates for English text in low-noise images, performance is still poor for noisy images and documents in other languages. We intend to create a post-OCR processing module for noisy Arabic documents which can correct OCR errors before passing the resulting Arabic text to a translation system. To this end, we are evaluating an Arabic-script OCR engine on documents with the same content but varying levels of image quality. We have found that OCR text accuracy can be improved with different stages of pre-OCR image processing: (1) filtering out low-contrast images to avoid hallucination of characters, (2) removing marks from images with cleanup software to prevent their misrecognition, and (3) zoning multi-column images with segmentation software to enable recognition of all zones. The specific errors observed in OCR will form the basis of training data for our post-OCR correction module.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 01, 2011
Accession Number: ADA554465

Entities

People

Brian Kjersten

Organizations

United States Army Research Laboratory

Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers