A Formal Model of Ambiguity and its Applications in Machine Translation

Abstract

Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth. To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well again leading to improved translation quality.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2010
Accession Number
ADA630855

Entities

People

  • Christopher J. Dyer

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Air Platforms
  • Autonomy
  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Automata Theory
  • Bayesian Networks
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Grammars
  • Information Processing
  • Information Science
  • Language
  • Linguistics
  • Machine Learning
  • Markov Models
  • Natural Language Processing
  • Probabilistic Models
  • Probability
  • Probability Distributions
  • Two Dimensional

Readers

  • Computational Linguistics
  • Operations Research
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Machine Translation