Finding Relevant Data in a Sea of Languages

Abstract

A cross-language search engine combines language identification, machine translation, information retrieval, and query-biased summarization techniques to enable English monolingual analysts to find foreign language documents relevant to their investigations. About 6,000 languages are currently spoken in the world today, says Elizabeth Salesky of Lincoln Laboratorys Human Language Technology (HLT) Group. Within the law enforcement community, there are not enough multilingual analysts who possess the necessary level of proficiency to understand and analyze content across these languages, she continues. This problem of too many languages and too few specialized analysts is one Salesky and her colleagues are now working to solve for law enforcement agencies, but their work has potential application for the Department of Defense and Intelligence Community. The research team is taking advantage of major advances in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken foreign languages can be used more efficiently. With HLT, an equivalent of 20 times more foreign language analysts are at your disposal, says Salesky. One area in which Laboratory researchers are focusing their efforts is cross-language information retrieval (CLIR). The Cross-LAnguage Search Engine, or CLASE, is a CLIR tool developed by the HLT Groupfor the Federal Bureau of Investigation (FBI). CLASE is a fusion of Laboratory research in language identification, machine translation, information retrieval, and query-biased summarization. CLASE enables English monolingual analysts to help search for and filter foreign language documentstasks that have traditionally been restricted to foreign language analysts. Laboratory researchers considered three algorithmic approaches to CLIR that have emerged in the HLT research community:

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 26, 2016
Accession Number
AD1033691

Entities

People

  • Elizabeth E. Salesky
  • Jennifer Drexler
  • Michael A. Coury

Organizations

  • MIT Lincoln Laboratory

Tags

Communities of Interest

  • Human Systems

DTIC Thesaurus Topics

  • Accuracy
  • Automated Speech Recognition
  • Automated Text Summarization
  • Department Of Defense
  • Foreign Languages
  • Identification
  • Identification Systems
  • Information Retrieval
  • Language
  • Law Enforcement
  • Machine Translation
  • Natural Language Processing
  • Online Communications
  • Recognition
  • Translations
  • United States
  • United States Government

Readers

  • Computational Linguistics

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation