Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day

Abstract

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one person day of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of context window feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2002
Accession Number
ADA460572

Entities

People

  • David Yarowsky
  • Silviu Cucerzan

Organizations

  • Johns Hopkins University

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Accuracy
  • Agreements
  • Books
  • Computer Science
  • Costs
  • Dictionaries
  • Errors
  • Foreign Languages
  • Grammars
  • Language
  • Learning
  • Probability
  • Sequences
  • Supervised Machine Learning
  • Supervision
  • Test And Evaluation
  • Training

Fields of Study

  • Linguistics

Readers

  • Computational Linguistics

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation
  • Space