Portable Language-Independent Adaptive Translation from OCR. Phase 1

Abstract

The objective of MADCAT is to produce a robust, highly accurate transcription engine that ingests documents of multiple types and produces English transcriptions of their content. For addressing the technical challenges implicit in that goal, the BBN-led team proposed a system that embodies integration of five major operations: (1) pre-processing and image enhancement, (2) page segmentation, (3) text recognition, and (4) metadata extraction. In Phase 1 of the MADCAT effort, we made significant improvements in all the above areas. In addition, we developed an end-to-end system for processing the Phase 1 evaluation data. The evaluation system exceeded the Phase 1 program goal of 40% accuracy on 70% of the documents. Below, we summarize the work performed by the BBN-led team in Phase 1 of the MADCAT effort. We highlight our accomplishments by each technical area and also indicate the performers in that area.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 01, 2009
Accession Number
ADA500251

Entities

People

  • Prem Natarajan

Organizations

  • BBN Technologies

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Computational Science
  • Computer Vision
  • Data Sets
  • Databases
  • Decoding
  • Department Of Defense
  • Feature Extraction
  • Hidden Markov Models
  • Information Science
  • Language
  • Machine Learning
  • Markov Models
  • Probability
  • Supervised Machine Learning
  • Test Sets
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Software Engineering.
  • Speech Processing/Speech Recognition.