Creating, Using and Updating Thesauri Files for AutoMap and ORA

Abstract

AutoMap [1] is text analysis software that performs Network Text Analysis by running an automated process on a corpus of raw text data to generate one or more meta-networks which include the nodes and links representing relations among entities described. Automap uses thesaurus files [1] when creating meta-networks. These thesaurus files are list which allows the association of words or phrases found in texts with abstract concepts and/or node classes used in the extracted meta-networks. Over time, a large number of thesauri have been created. Many of the extant thesauri contain entries that are relevant to new text analysis projects. But thesaurus re-use is difficult due to the number of thesauri. In this report, we describe one approach to making thesaurus re-use easier by combining and reconciling multiple thesauri into one under user control. With this approach, the process of creating a Meta network out of a raw corpus of text data is more efficient and the user is able to perform a more accurate analysis of the Meta network, as the individual thesauri files can be merged to create a single and large Universal or Master Thesaurus containing all the general abstract concepts, along with several different Domain-specific thesauri. In the following report, we first discuss the differences between a Universal thesaurus and the domain or the project specific thesauri. We then go on to discuss the evolution in the formats of the thesauri used by AutoMap, followed by a discussion of the standard Dynamic Network Analysis (DNA) meta-ontology [1]. We then detail the process used to create a single universal/master thesaurus and several different Domain thesauri. The process involves a mix of two major processes which we refer to as the Split routine and the Merge routine. We shall discuss the Split routine and the merge routine algorithm along with the process that has been used to merge and create a single thesaurus file by combining a large number of thesauri files.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 26, 2012
Accession Number
ADA566559

Entities

People

  • Abhinav Sangal
  • Kathleen Carley
  • Michael K. Martin
  • Neal Altman

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical
  • C4I

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Analysis
  • Data Management
  • Data Sets
  • Directories
  • Military Research
  • Observation
  • Ontologies
  • Personal Information Managers
  • Specialization
  • Standards
  • Thesauri

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Distributed Systems and Data Platform Development
  • Library and Information Science