Adaptive Web-page Content Identification

Abstract

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2007
Accession Number
ADA470494

Entities

People

  • Ben Wellner
  • John Gibson
  • Susan Lubar

Organizations

  • MITRE Corporation

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence Software
  • Computer Languages
  • Data Sets
  • Extraction
  • Gaussian Distributions
  • Identification
  • Language
  • Machine Learning
  • Markov Models
  • Models
  • Natural Language Processing
  • Natural Languages
  • Probabilistic Models
  • Probability
  • Supervised Machine Learning
  • Websites

Fields of Study

  • Computer science

Readers

  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Computational Linguistics
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks