Portable Language-Independent Adaptive Translation from OCR

Abstract

This quarter, we re-designed the Shape-DNA based rule line cleaning algorithm to minimize the degradation of the shape of text characters. Recall that in the Shape-DNA based cleaning approach, the projection onto the Shape-DNA space produces a rule line distance image that is used to clean the rule lines. However, this cleaning process can and does remove portions of legitimate text characters that resemble rule lines. Therefore, instead of using the rule line distance images for directly cleaning rule lines, we now use this image to model the rule lines present in the document. Specifically, by applying Hough transform to the rule line distance image, we compute a set of model parameters. In addition, we estimate the average thickness of the rule lines using the original input image. Finally, we use both the rule line model parameters and the rule line thickness information with a sliding window to clean the rule lines. Figure 2 shows an example where the performance of the new rule line cleaning algorithm is compared with the performance of the previous version of the shape-DNA cleaning. This reporting period, we also improved the restoration algorithm for removing the artifacts introduced by rule line cleaning. Similar to rule line cleaning algorithm, Shape-DNA based restoration algorithm also includes an off-line training process, where text characters shapes are learned off-line by training about 100 handwritten text images (with no rule lines) and a Shape-DNA database is computed from the shape patterns. These shape blocks from the input image onto the database and by searching for the closest shape pattern in the database. Unlike our previous version, where shape-DNA restoration was applied to entire image, we now use the estimated rule line model parameters to constrain the restoration into the local proximity of detected rule lines.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 15, 2009
Accession Number
ADA510335

Entities

People

  • Prem Natarajan

Organizations

  • BBN Technologies

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence Software
  • Computer Vision
  • Data Sets
  • Databases
  • Department Of Defense
  • Detection
  • Frequency Domain
  • Governments
  • Hidden Markov Models
  • Language
  • Machine Learning
  • Models
  • Recognition
  • Standards
  • Supervised Machine Learning
  • Training

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Image Processing and Computer Vision.
  • Microwave Engineering.

Technology Areas

  • Space
  • Space - Space Objects