Leveraging Small-Lexicon Language Models
Abstract
This final report describes the "Leveraging Small-Lexicon Language Models" project, contracted for 18 months under DARPA LORELEI. We focused on Asia-Pacific; a global hotspot of disaster risk with high language density, but few electronic data resources and little off-the-shelf language technology. CRCL provided data and initial analysis for five major families with varied typology: Austroasiatic, Austronesian, Hmong-Mien, Kra-Dai, and Sino-Tibetan (these include about 2,000 languages). We delivered more than 1,000 lects from some 500 distinct ISO 639-3 codes, including over 850,000 lexemes. Data mainly came from smallish, high-quality print lexicons developed for linguistic purposes (language sketch, survey, and comparative analysis); these are the only resources that are widely available throughout the region. Primary effort went to normalizing phonological transcription and semantic glossing (using the MetaForm and MetaGloss frameworks we devised), identifying cognate sets, and producing various types of phonological and semantic analysis of the lexicons; we also distributed a multilingual HA/DR thesaurus of disaster-related terms. All language materials are available for re-use under the CC 4.0 license.
Document Details
- Document Type
- Technical Report
- Publication Date
- Dec 31, 2016
- Accession Number
- AD1031925
Entities
People
- Cooper Doug