Adaptive Web-page Content Identification

Abstract

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jul 01, 2007
Accession Number: ADA470494

Entities

People

Ben Wellner
John Gibson
Susan Lubar

Organizations

MITRE Corporation

Adaptive Web-page Content Identification

Abstract

Document Details

Entities

People

Organizations

Tags

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas