World Knowledge as Indirect Supervision for Document Clustering

Abstract

One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then, the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this article, we provide an example of using world knowledge for domain-dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then, we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.

Document Details

Document Type
Pub Defense Publication
Publication Date
Dec 26, 2016
Source ID
10.1145/2953881

Entities

People

  • Chenguang Wang
  • Dan Roth
  • Jiawei Han
  • Ming Zhang
  • Yangqiu Song

Organizations

  • Australian RL Commission
  • Defense Advanced Research Projects Agency
  • Hong Kong University of Science and Technology
  • National Institute of General Medical Sciences
  • National Science Foundation
  • Peking University
  • University of Illinois Urbana–Champaign

Tags

Fields of Study

  • Computer science

Readers

  • Artificial Intelligence
  • Computational Linguistics
  • Distributed Systems and Data Platform Development