Learning Best Practices for Building Automatic Transcription and Translation Systems

Abstract

Automatic speech recognition (ASR) and machine translation (MT) have a direct and immediate application in Department of Defense(DoD) operational capabilities and in intelligence gathering. Both have received substantive DoD funding. In recent years, research has led to human-like performance in some languages and yielded widely available consumer offerings. Despite such advances, for many languages that interest the DoD, limited training data exists and there is insufficient commercially driven interest to yield operational performance. To support work in such languages, the DoD continues to fund programs that support improvements in the underlying machine learning approaches used for MT and ASR (often in combination with other research goals). The University of Southern CaliforniaÕs Information Sciences Institute (USC-ISI) is actively involved in several such programs. In parallel to supporting algorithmic improvements, these DoD efforts have yielded benchmark datasets for diverse languages. Additional benchmark datasets are available from European Union projects (e.g. ParaCrawl) and research initiatives (e.g. WMT). However, while funded research supports ongoing technical advances, because of compute limitations, research advances are often only tested on program specific datasets. For example, in a year, a program might focus on 2-3 languages and a downstream task that leverages ASR/MT (e.g. cross-lingual search; extracting Òdatabase recordsÓ from non-English text). While research aims to be language independent, researchers do not know how broadly applicable techniques (and challenges) are across languages (as opposed to characteristic to some subset of human language), and furthermore they are limited in the degree that they understand which approaches work under which conditions for which languages. Thus, when faced with an operational need in a new language, technologists are forced to take their best guess. Running todayÕs state-of-the-art tools on a much larger number of languages would help the DoD be prepared to apply human-language based sensors and analysis in the face of an ever changing world. Given the need to rapidly apply technologies in a new region, it would increase the probability that there is existing language technology, and -- more importantly-- provide an empirical basis for generalizing to new languages. The primary limitation in running experiments on additional languages is not experimenter time, nor is it data Ð to meet existing program needs we have developed a robust experimental framework, and benchmark data exists from prior programs. The limitation is rather machine power: as an example, in current research we find that training a state-of-the-art ASR system for a new language requires ~4TB of permanent storage with ~6TB of experiment space, and ~48 GPUs while identifying the best set of parameters. USC-ISI proposes equipment purchases to support testing our ASR and MT on a more diverse set of languages. The will result in (a) report that benchmarks state-of-the-art tools; (b) models and software available to the DoD through existing delivery channels, and (c) journal submissions detailing the results and commentary on the current best approaches to achieve state-of-the- art highlighting cross-lingual commonalities and differences. The proposed experiments will be run by research teams that include a mix of faculty, researchers, and students, thus providing educational and mentoring opportunities to students in the areas of human language technology and machine learning. After the proposed period of performance, these compute resources will be incorporated into the existing USC-ISI wide shared compute pool, and will be available to researchers and students working on other DoD and other agency funded programs at USC-ISI, thus future research opportunities.

Document Details

Document Type
DoD Grant Award
Publication Date
Mar 30, 2020
Source ID
W911NF2010030

Entities

People

  • Marjorie Freedman

Organizations

  • Army Contracting Command
  • United States Army
  • University of Southern California

Tags

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Distributed Systems and Data Platform Development
  • Research Science/Academic Research

Technology Areas

  • AI & ML
  • AI & ML - DoD AI Strategy
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks
  • Space