A Statistical Approach to Retrieving Historical Manuscript Images without Recognition

Abstract

Handwritten historical document collections in libraries and other areas are often of interest to researchers, students, or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Several solutions are possible: manual annotation (very expensive), handwriting recognition (poor results), and word spotting -- an image matching approach (computationally expensive). In this work, the authors present a novel retrieval approach for historical document collections that does not require recognition. They assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, they extract discrete feature vectors, and estimate the joint probability distribution of features and word labels. For a given feature vector (i.e., a word image), they can then calculate conditional probabilities for all labels in the training vocabulary. Experiments show that this relevance-based language model works very well with a mean average precision of 89% for 4-word queries on a subset of George Washington's manuscripts.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2003
Accession Number
ADA478157

Entities

People

  • R. Manmatha
  • Toni M. Rath
  • Victor Lavrenko

Organizations

  • Naval Information Warfare Systems Command

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Automatic
  • Coefficients
  • Computer Vision
  • Data Sets
  • Databases
  • Handwriting
  • Information Retrieval
  • Language
  • Object Recognition
  • Precision
  • Probability
  • Probability Distributions
  • Recognition
  • Statistical Samples
  • Statistical Sampling
  • Training
  • Vocabulary

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Vision.
  • Library and Information Science