The Effect of Bilingual Term List Size on Dictionary-Based Cross-Language Information Retrieval

Abstract

Bilingual term lists are extensively used as a resource for dictionary-based Cross-Language Information Retrieval (CLIR), in which the goal is to find documents written in one natural language based on queries that are expressed in another. This paper identifies eight types of terms that affect retrieval effectiveness in CLIR applications through their coverage by general-purpose bilingual term lists, and reports results from an experimental evaluation of the coverage of 35 bilingual term lists in news retrieval application. Retrieval effectiveness was found to be strongly influenced by term list size for lists that contain between 3,000 and 30,000 unique terms per language. Supplemental techniques for named entity translation were found to be useful with even the largest lexicons. The contribution of named entity translation was evaluated in a cross-language experiment involving English and Chinese. Smaller effects were observed from deficiencies in the coverage of domain specific terminology when searching news stories.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA447948

Entities

People

  • Dina Demner-fushman
  • Douglas W. Oard

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Biomedical
  • Cyber

DTIC Thesaurus Topics

  • Ablation
  • Abstracts
  • Chinese Language
  • Computational Complexity
  • Computational Linguistics
  • Computer Science
  • Dictionaries
  • English Language
  • Information Retrieval
  • Language
  • Linguistics
  • Natural Languages
  • Privatization
  • Translations
  • Universities
  • Vocabulary
  • Words (Language)

Fields of Study

  • Computer science

Readers

  • Information Retrieval
  • Speech Processing/Speech Recognition.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation