Machine Printed Text and Handwriting Identification in Noisy Document Images

Abstract

In this paper we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content, and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise, and we further exploit context to refine the classification. A Markov Random Field (MRF) based approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2003
Accession Number
ADA459230

Entities

People

  • David S. Doermann
  • Huiping Li
  • Yefeng Zheng

Organizations

  • University of Maryland

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Abstracts
  • Classification
  • Computer Vision
  • Formal Languages
  • Handwriting
  • Identification
  • Image Processing
  • Image Recognition
  • Information Operations
  • Instructions
  • Language
  • Recognition
  • Two Dimensional
  • Universities

Fields of Study

  • Computer science
  • Engineering

Readers

  • Computer Science.
  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.