New Framework for Cross-Domain Document Classification

Abstract

Automatic text document classification is a fundamental problem in machine learning. Given the dynamic nature and the exponential growth of the World Wide Web, one needs the ability to classify not only a massive number of documents, but also documents that belong to wide variety of domains. Some examples of the domains are e-mails, blogs, Wikipedia articles, news articles, newsgroups, online chats, etc. It is the difference in the writing style that differentiates these domains. Text documents are usually classified using supervised learning algorithms that require large set of pre-labeled data. This requirement, of labeled data, poses a challenge in classifying documents that belong to different domains. Our goal is to classify text documents in the testing domain without requiring any labeled documents from the same domain. Our research develops specialized cross-domain learning algorithms based the distributions over words obtained from a collection of text documents by topic models such as Latent Dirichlet Allocation (LDA). Our major contributions include (1) empirically showing that conventional supervised learning algorithms fail to generalize their learned models across different domains and (2) development of novel and specialized cross-domain classification algorithms that show an appreciable improvement over conventional methods used for cross-domain classification that is consistent for different datasets. Our research addresses many real-world needs. Since massive number of new types of text documents is generated daily, it is crucial to have the ability to transfer learned information from one domain to another domain. Cross-domain classification lets us leverage information learned from one domain for use in the classification of documents in a new domain.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2011
Accession Number
ADA543816

Entities

People

  • Anjum Gupta

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automata Theory
  • Bayesian Networks
  • Computational Science
  • Computer Languages
  • Computer Programs
  • Data Mining
  • Dimensionality Reduction
  • Generative Models
  • Information Science
  • Machine Learning
  • Natural Language Processing
  • Network Science
  • Ontologies
  • Probability Distributions
  • Recreation
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks