Query-Structure Based Web Page Indexing

Abstract

Indexing is a crucial technique for dealing with the massive amount of data present on the web. In our third participation in the web track at TREC 2012, we explore the idea of building an efficient query-based indexing system over Web page collection. Our prototype explores the trends in user queries and consequently indexes texts using particular attributes available in the documents. This paper provides an in-depth description of our approach for indexing web documents efficiently; that is, topics available in the web documents are discovered with the assistance of knowledge available in Wikipedia. The well-defined articles in Wikipedia are shown to be valuable as a training set when indexing Webpages. Our complex index structure also records information from titles and urls, and pays attention to web domains. Our approach is designed to close the gaps in our approaches from the previous two years, for some queries. Our framework is able to efficiently index the 50 million pages available in the subset B of the ClueWeb09 collection. Our preliminary experiments on the TREC 2012 testing queries showed that our indexing scheme is robust and efficient for both indexing and retrieving relevant web pages, for both the ad-hoc and diversity task.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2012
Accession Number
ADA581513

Entities

People

  • Diana Inkpen
  • Falah H. Al-akashi

Organizations

  • University of Ottawa

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Civil Rights
  • Computer Science
  • Computing-Related Activities
  • Dictionaries
  • Electrical Engineering
  • Electronic Mail
  • Engineering
  • Frequency
  • Hash Tables
  • Information Operations
  • Information Retrieval
  • Link Analysis
  • Mathematics
  • Storage
  • Vocabulary

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Information Retrieval