The Bible, Truth, and Multilingual OCR Evaluation

Abstract

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is difficult to compare the performance of these OCR algorithms across languages. This difficulty arises because most evaluation methodologies rely on the use of a document image dataset in each of the languages and it is difficult to find document datasets in different languages that are similar in content and layout. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translation are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at the University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora such as the Koran and the Bhagavad Gita that have similar properties. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 1998
Accession Number
ADA458666

Entities

People

  • Philip Resnik
  • Tapas Kanungo

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computational Linguistics
  • Computational Science
  • Computer Vision
  • Data Sets
  • Image Processing
  • Information Retrieval
  • Information Science
  • Language
  • Linguistics
  • Machine Translation
  • Natural Language Processing
  • Pattern Recognition
  • Recognition
  • Test And Evaluation

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Vision.
  • Educational Psychology