Leveraging Small-Lexicon Language Models

Abstract

This final report describes the "Leveraging Small-Lexicon Language Models" project, contracted for 18 months under DARPA LORELEI. We focused on Asia-Pacific; a global hotspot of disaster risk with high language density, but few electronic data resources and little off-the-shelf language technology. CRCL provided data and initial analysis for five major families with varied typology: Austroasiatic, Austronesian, Hmong-Mien, Kra-Dai, and Sino-Tibetan (these include about 2,000 languages). We delivered more than 1,000 lects from some 500 distinct ISO 639-3 codes, including over 850,000 lexemes. Data mainly came from smallish, high-quality print lexicons developed for linguistic purposes (language sketch, survey, and comparative analysis); these are the only resources that are widely available throughout the region. Primary effort went to normalizing phonological transcription and semantic glossing (using the MetaForm and MetaGloss frameworks we devised), identifying cognate sets, and producing various types of phonological and semantic analysis of the lexicons; we also distributed a multilingual HA/DR thesaurus of disaster-related terms. All language materials are available for re-use under the CC 4.0 license.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 31, 2016
Accession Number
AD1031925

Entities

People

  • Cooper Doug

Tags

DTIC Thesaurus Topics

  • Computational Linguistics
  • Computer Languages
  • Data Sets
  • Dictionaries
  • Disasters
  • Formal Languages
  • Geographic Regions
  • Grammars
  • Humanitarian Assistance
  • Language
  • Linguistics
  • Machine Translation
  • Materials
  • Natural Language Processing
  • Recognition
  • Semantics
  • Social Media

Readers

  • Computational Linguistics
  • Neurodegenerative Parkinson's Disease and Rickettsial Disease handbook, including the data level of dopamine, BC, neurons, and PD.
  • Software Engineering.

Technology Areas

  • Microelectronics