Arabic Information Retrieval at UMass in TREC-10

Abstract

The University of Massachusetts took on the TREC-10 cross-language track with no prior experience with Arabic, and no Arabic speakers among any of our researchers or students. We intended to implement some standard approaches, and to extend a language modeling approach to handle co-occurrences. Given the lack of resources -- training data, electronic bilingual dictionaries, and stemmers -- and our unfamiliarity with Arabic, we had our hands full carrying out some standard approaches to monolingual and cross-language Arabic retrieval, and did not submit any runs based on novel approaches. We submitted three monolingual runs and one cross-language run. We first describe the models, techniques, and resources we used, then we describe each run in detail. Our official runs performed moderately well, in the second tier (3rd or 4th place). Since submitting these results, we have improved normalization and stemming, improved dictionary construction, expanded Arabic queries, improved estimation and smoothing in language models, and added combination of evidence, increasing performance by a substantial amount.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA456273

Entities

People

  • Leah S. Larkey
  • Margaret E. Connell

Organizations

  • University of Massachusetts Amherst

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Coding
  • Computer Science
  • Dictionaries
  • Information Operations
  • Information Retrieval
  • Language
  • Machine Translation
  • Personality
  • Precision
  • Probability
  • Probability Distributions
  • Standards
  • Stemming
  • Translations
  • Universities
  • Word Lists

Readers

  • Computational Linguistics
  • Computational Modeling and Simulation

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation
  • Microelectronics