Parsing and Tagging of Bilingual Dictionary

Abstract

Bilingual dictionaries hold great potential as a source of lexical resources for training and testing automated systems for optical character recognition, machine translation, and cross-language information retrieval. In this paper, we describe a system for extracting term lexicons from printed bilingual dictionaries. Our work was divided into three phases - dictionary segmentation, entry tagging, and generation. In segmentation, pages are divided into logical entries based on structural features learned from selected examples. The extracted entries are associated with functional labels and passed to a tagging module which associates linguistic labels with each word or phrase in the entry. The output of the system is a structure that represents the entries from the dictionary. We have used this approach to parse a variety of dictionaries with both Latin and non-Latin alphabets, and demonstrate the results of term lexicon generation for retrieval from a collection of French news stories using English queries.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2003
Accession Number
ADA459226

Entities

People

  • Burcu Karagol-ayan
  • David S. Doermann
  • Douglas W. Oard
  • Huanfeng Ma
  • Jianqiang Wang

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Character Recognition
  • Computational Science
  • Computer Vision
  • Dictionaries
  • Feature Extraction
  • Hidden Markov Models
  • Identification
  • Information Retrieval
  • Language
  • Machine Learning
  • Markov Models
  • Models
  • Probability
  • Recognition
  • Separators
  • Statistics

Fields of Study

  • Computer science

Readers

  • Computational Linguistics

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation