The Web as a Parallel Corpus

Abstract

Parallel corpora have become an essential resource for work in multi-lingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2002
Accession Number
ADA459234

Entities

People

  • Noah A. Smith
  • Philip Resnick

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Autonomy
  • C4I

DTIC Thesaurus Topics

  • Abstracts
  • Artificial Intelligence
  • Classification
  • Contracts
  • Formal Languages
  • Information Operations
  • Instructions
  • Internet
  • Language
  • Low Density
  • Natural Language Processing
  • Natural Languages
  • Supervised Machine Learning
  • Universities
  • World Wide Web

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Database Systems and Applications
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation