Full-Text Access to Historical Newspapers

Abstract

Newspapers are rich records of U. S. history. Due to the deterioration of older newspapers, the National Endowment for the Humanities is archiving 19th century newspapers on microfilm. Although microfilm is a good preservation method, it provides limited access to researchers and the general public. We are building a system to provide universal access to digital images and full-text content of historical newspapers. The system has three main components: a) an Optical Character Recognition (OCR) module that converts digitized images into searchable text and identifies regions, b) an Information Retrieval module that applies linguistic information to aid in segmentation, indexing, and retrieval of the noisy OCR'd text, and c) a User Interface module that allows historians and educators to query and view retrieved documents. Thus far, we have developed two OCR techniques targeted to processing historical newspapers and we have built a user interface to search the OCR output and superimpose matches on a page image from the newspaper.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 01, 1999
Accession Number
ADA458699

Entities

People

  • Robert B. Allen
  • Tapas Kanungo

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Counter WMD

DTIC Thesaurus Topics

  • Abstracts
  • Character Recognition
  • Digital Images
  • History
  • Humanities
  • Image Processing
  • Images
  • Information Operations
  • Information Retrieval
  • Language
  • Newspapers
  • Optical Character Recognition
  • Periodicals
  • Recognition
  • Standards
  • User Interface

Readers

  • Computer Science/Computer Engineering/Data Science/Digital Signal Processing.
  • Library and Information Science
  • Military History of the United States in the 20th Century.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval