Facet Classification of Blogs: Know-Center at the TREC 2009 Blog Distillation Task

Abstract

In this paper, we outline our experiments carried out at the TREC 2009 Blog Distillation Task. Our system is based on a plain text index extracted from the XML feeds of the TREC Blogs08 dataset. This index was used to retrieve candidate blogs for the given topics. The resulting blogs were classified using a Support Vector Machine that was trained on a manually labelled subset of the TREC Blogs08 dataset. Our experiments included three runs on different features: firstly on nouns, secondly on stylometric properties, and thirdly on punctuation statistics. The facet identification based on our approach was successful, although a significant number of candidate blogs were not retrieved at all.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2009
Accession Number
ADA517854

Entities

People

  • Andreas Juffinger
  • Elisabeth Lex
  • Michael Granitzer

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Classification
  • Computer Languages
  • Data Mining
  • Distillation
  • Feature Selection
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Online Communications
  • Precision
  • Standards
  • Statistics
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Database Systems and Applications
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval