Leveraging Small-Lexicon Language Models

Abstract

This final report describes the "Leveraging Small-Lexicon Language Models" project, contracted for 18 months under DARPA LORELEI. We focused on Asia-Pacific; a global hotspot of disaster risk with high language density, but few electronic data resources and little off-the-shelf language technology. CRCL provided data and initial analysis for five major families with varied typology: Austroasiatic, Austronesian, Hmong-Mien, Kra-Dai, and Sino-Tibetan (these include about 2,000 languages). We delivered more than 1,000 lects from some 500 distinct ISO 639-3 codes, including over 850,000 lexemes. Data mainly came from smallish, high-quality print lexicons developed for linguistic purposes (language sketch, survey, and comparative analysis); these are the only resources that are widely available throughout the region. Primary effort went to normalizing phonological transcription and semantic glossing (using the MetaForm and MetaGloss frameworks we devised), identifying cognate sets, and producing various types of phonological and semantic analysis of the lexicons; we also distributed a multilingual HA/DR thesaurus of disaster-related terms. All language materials are available for re-use under the CC 4.0 license.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Dec 31, 2016
Accession Number: AD1031925

Entities

People

Cooper Doug

Leveraging Small-Lexicon Language Models

Abstract

Document Details

Entities

People

Tags

DTIC Thesaurus Topics

Readers

Technology Areas