RMIT University at TREC 2008: Legal Track

Abstract

This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effectiveness. In this study, we investigate the effects of OCR error minimization - through de-hyphenation of terms, and the removal of corrupted or "noise" terms - on retrieval performance. Our results indicate that removing noise terms can lead to significant savings in terms of index size.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2008
Accession Number
ADA512673

Entities

People

  • Andrew Turpin
  • Falk Scholer
  • Ying Zhang

Organizations

  • RMIT University

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Computer Science
  • Dictionaries
  • Hard Copy
  • Information Operations
  • Information Retrieval
  • Information Systems
  • Language
  • Metadata
  • Natural Languages
  • Precision
  • Schools
  • Standards
  • Universities

Readers

  • Approximation Theory.
  • Computational Linguistics
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval