Retrieving Historical Manuscripts Using Shape

Abstract

Convenient access to handwritten historical document collections in libraries generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Currently, extensive manual labor is used to annotate and organize such collections, because handwriting recognition approaches provide only poor results on old documents. In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, we extract discrete feature vectors, and estimate the joint probability distribution of features and word labels. For a given feature vector (i.e. a word image), we can then calculate conditional probabilities for all labels in the training vocabulary. Experiments show that this relevance-based language model works very well with a mean average precision of 89% for 4-word queries on a subset of George Washington's manuscripts. We also show that this approach may be extended to general shapes by using the same model and a similar feature set to retrieve general shapes in two different shape datasets.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2003
Accession Number
ADA477913

Entities

People

  • R. Manmatha
  • Toni M. Rath
  • Victor Lavrenko

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Computer Vision
  • Data Sets
  • Databases
  • Handwriting
  • Information Retrieval
  • Language
  • Object Recognition
  • Precision
  • Probability
  • Probability Distributions
  • Recognition
  • Rotation
  • Standards
  • Statistical Samples
  • Statistical Sampling
  • Training
  • Vocabulary

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Vision.