What's in a URL? Genre Classification from URLs

Abstract

The importance of URLs in the representation of a document cannot be overstated. Shorthand mnemonics such as wiki or blog are often embedded in a URL to convey its functional purpose or genre. Other mnemonics have evolved from use (e.g., a Wordpress particle is strongly suggestive of blogs). Can we leverage from this predictive power to induce the genre of a document from the representation of a URL? This paper presents a methodology for webpage genre classification from URLs which, to our knowledge, has not been previously attempted. Experiments using machine learning techniques to evaluate this claim show promising results and a novel algorithm for character n-gram decomposition is provided. Such a capability could be useful to improve personalized search results, disambiguate content, efficiently crawl the Web in search of relevant documents, and construct behavioral profiles from clickstream data without parsing the entire document.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2012
Accession Number
ADA599843

Entities

People

  • David W. Aha
  • Myriam Abramson

Organizations

  • United States Naval Research Laboratory

Tags

Communities of Interest

  • Autonomy
  • Cyber

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Classification
  • Cognitive Science
  • Computational Linguistics
  • Computational Science
  • Data Mining
  • Feature Extraction
  • Information Processing
  • Information Science
  • Language
  • Learning
  • Linguistics
  • Machine Learning
  • Personality
  • Probability
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Information Retrieval

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks