Deriving Concept Hierarchies From Text

Abstract

This paper presents a means of automatically deriving a hierarchical organization of concepts from a set of documents without use of training data or standard clustering techniques. Instead, salient words and phrases extracted from the documents are organized hierarchically using a type of co-occurrence known as subsumption. The resulting structure is displayed as a series of hierarchical menus. When generated from a set of retrieved documents, a user browsing the menus is provided with a detailed overview of their content in a manner distinct from existing overview and summarization techniques. The methods used to build the structure are simple, but appear to be effective: a smallscale user study reveals that the generated hierarchy possesses properties expected of such a structure in that general terms are placed at the top levels leading to related and more specific terms below. The formation and presentation of the hierarchy is described along with the user study and some other informal evaluations. The organization of a set of documents into a concept hierarchy derived automatically from the set itself is undoubtedly one goal of information retrieval. Were this goal to be achieved, the documents would be organized into a form somewhat like existing manually constructed subject hierarchies, such as the Library of Congress categories, or the Dewey Decimal system. The only difference being that the categories would be customized to the set of documents itself. For example, from a collection of media related articles, the category "Entertainment" might appear near the top level; below it, (amongst others) one might find the category "Movies", a type of entertainment; and below that, there could be the category "Actors & Actresses", an aspect of movies. As can be seen, the arrangement of the categories provides an overview of the topic structure of those articles.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2005
Accession Number
ADA439413

Entities

People

  • Bruce Croft
  • Mark R Sanderson

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Abstracts
  • Automobiles
  • Classification
  • Clustering
  • Commerce
  • Computer Science
  • Electric Automobiles
  • Electric Vehicles
  • Frequency
  • Hierarchies
  • Information Retrieval
  • Poliomyelitis
  • Standards
  • Test And Evaluation
  • Thesauri
  • United States
  • Vehicles

Readers

  • Computational Linguistics
  • Library and Information Science
  • Theoretical Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval