Finding Relevant Data in a Sea of Languages
Abstract
A cross-language search engine combines language identification, machine translation, information retrieval, and query-biased summarization techniques to enable English monolingual analysts to find foreign language documents relevant to their investigations. About 6,000 languages are currently spoken in the world today, says Elizabeth Salesky of Lincoln Laboratorys Human Language Technology (HLT) Group. Within the law enforcement community, there are not enough multilingual analysts who possess the necessary level of proficiency to understand and analyze content across these languages, she continues. This problem of too many languages and too few specialized analysts is one Salesky and her colleagues are now working to solve for law enforcement agencies, but their work has potential application for the Department of Defense and Intelligence Community. The research team is taking advantage of major advances in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken foreign languages can be used more efficiently. With HLT, an equivalent of 20 times more foreign language analysts are at your disposal, says Salesky. One area in which Laboratory researchers are focusing their efforts is cross-language information retrieval (CLIR). The Cross-LAnguage Search Engine, or CLASE, is a CLIR tool developed by the HLT Groupfor the Federal Bureau of Investigation (FBI). CLASE is a fusion of Laboratory research in language identification, machine translation, information retrieval, and query-biased summarization. CLASE enables English monolingual analysts to help search for and filter foreign language documentstasks that have traditionally been restricted to foreign language analysts. Laboratory researchers considered three algorithmic approaches to CLIR that have emerged in the HLT research community:
Document Details
- Document Type
- Technical Report
- Publication Date
- Apr 26, 2016
- Accession Number
- AD1033691
Entities
People
- Elizabeth E. Salesky
- Jennifer Drexler
- Michael A. Coury
Organizations
- MIT Lincoln Laboratory