Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect

Abstract

We report on our experience with building a statistical MT system from scratch, including the creation of a small parallel Tamil-English corpus, and the results of a taskbased pilot evaluation of statistical MT systems trained on sets of ca. 1300 and ca. 5000 parallel sentences of Tamil and English data. Our results show that even with apparently incomprehensible system output, humans without any knowledge of Tamil can achieve performance rates as high as 86% accuracy for topic identification, 93% recall for document retrieval, and 64% recall on question answering (plus an additional 14% partially correct answers).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2001
Accession Number
ADA460337

Entities

People

  • Ulrich Germann

Organizations

  • University of Southern California

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Classification
  • Coding
  • Computational Linguistics
  • Computational Science
  • Decoding
  • Information Processing
  • Information Science
  • Language
  • Language Translation
  • Linguistics
  • Machine Translation
  • Natural Language Processing
  • Sri Lanka
  • Test Sets
  • Translations

Readers

  • Computational Linguistics
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation