Creating, Using and Updating Thesauri Files for AutoMap and ORA

Abstract

AutoMap [1] is text analysis software that performs Network Text Analysis by running an automated process on a corpus of raw text data to generate one or more meta-networks which include the nodes and links representing relations among entities described. Automap uses thesaurus files [1] when creating meta-networks. These thesaurus files are list which allows the association of words or phrases found in texts with abstract concepts and/or node classes used in the extracted meta-networks. Over time, a large number of thesauri have been created. Many of the extant thesauri contain entries that are relevant to new text analysis projects. But thesaurus re-use is difficult due to the number of thesauri. In this report, we describe one approach to making thesaurus re-use easier by combining and reconciling multiple thesauri into one under user control. With this approach, the process of creating a Meta network out of a raw corpus of text data is more efficient and the user is able to perform a more accurate analysis of the Meta network, as the individual thesauri files can be merged to create a single and large Universal or Master Thesaurus containing all the general abstract concepts, along with several different Domain-specific thesauri. In the following report, we first discuss the differences between a Universal thesaurus and the domain or the project specific thesauri. We then go on to discuss the evolution in the formats of the thesauri used by AutoMap, followed by a discussion of the standard Dynamic Network Analysis (DNA) meta-ontology [1]. We then detail the process used to create a single universal/master thesaurus and several different Domain thesauri. The process involves a mix of two major processes which we refer to as the Split routine and the Merge routine. We shall discuss the Split routine and the merge routine algorithm along with the process that has been used to merge and create a single thesaurus file by combining a large number of thesauri files.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jul 26, 2012
Accession Number: ADA566559

Entities

People

Abhinav Sangal
Kathleen Carley
Michael K. Martin
Neal Altman

Organizations

Carnegie Mellon University

Creating, Using and Updating Thesauri Files for AutoMap and ORA

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers