PARADISE Based Search Engine at TREC 2009 Web Track

Abstract

In this paper, we introduce the PARADISE search engine in TREC09 Web track. PARADISE is the abbreviation for Platform for Applying, Research and Developing Intelligent Search Engine, which is a search engine platform developed by SEWM group, Peking University. The system is designed to support both English and Chinese information retrieval. This system preprocessed and indexed the five hundred million web pages for this year's Web Track. In the preprocessing stage, the templates were removed, the encoding were identified and unified, and the anchor texts and InLink information are extracted with the mapreduce framework (using Hadoop in this system). In retrieval, our runs used an extension of BM25. This model distinguishes terms from different fields and integrated both term counts and position information. Furthermore, some web based features are also considered.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2009
Accession Number
ADA517729

Entities

People

  • Dongdong Shan
  • Dongsheng Zhao
  • Hongfei Yan
  • Jing He

Organizations

  • Peking University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Coding
  • Computational Processes
  • Computer Science
  • Computing-Related Activities
  • Distributed Computing
  • Elimination
  • Information Operations
  • Information Retrieval
  • Language
  • Link Analysis
  • Preprocessing
  • Schools
  • Standards
  • Test Beds
  • Universities

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Library and Information Science

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval