A Methodology for Empirical Performance Evaluation of Page Segmentation Algorithms

Abstract

Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) systems. While numerous page segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation--empirical or theoretical-- of these algorithms. Fore the existing performance evaluation methods, two crucial components are usually missing: 1) automatic training of algorithms with free parameters and 2) statistical and error analysis of experimental results. In this thesis, we use the following five-step methodology to quantitatively compare the performance of page segmentation algorithms: 1) First we create mutually exclusive training and test datasets with groundtruth, 2) we then select a meaningful and computable performance metric, 3) an optimization procedure is then used to search automatically for the optimal parameter values of the segmentation algorithms, 4) the segmentation algorithms are then evaluated on the test dataset, and finally 5) a statistical error analysis is performed to give the statistical significance of the experimental results. The automatic training of algorithms is posed as an optimization problem and a direct search method -- the simplex method -- is sued to search for a set of optimal parameter values. A paired-model statistical analysis and an error analysis are conducted to provide confidence intervals for the experimental results and to interpret the functionalities of algorithms. This methodology is applied to the evaluation of five page segmentation algorithms, of which three are representative research algorithms and the other two are well-known commercial products, on 978 images from the University of Washington III dataset. It is found that the performances of the Voronoi, Docstrum and Caere segmentation algorithms are not significantly different from each other, but they are significantly better than that of ScanSoft's segmentation algorithm, which in turn is significantly better than X-Y cut.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 1999
Accession Number
ADA458685

Entities

People

  • Song Mao
  • Tapas Kanungo

Organizations

  • University of Maryland

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Character Recognition
  • Computer Vision
  • Detection
  • Error Analysis
  • Errors
  • False Alarms
  • Information Operations
  • Language
  • Mathematics
  • Military Research
  • Optical Character Recognition
  • Recognition
  • Simplex Method
  • Statistical Analysis
  • Test And Evaluation

Fields of Study

  • Computer science

Readers

  • Business Analytics
  • Computer Vision.
  • Systems Analysis and Design