Uncovering and Managing the Impact of Methodological Choices for the Computational Construction of Socio-Technical Networks from Texts

Abstract

This thesis is motivated by the need for scalable and reliable methods and technologies that support the construction of network data based on information from text data. Ultimately, the resulting data can be used for answering substantive and graph-theoretical questions about sociotechnical networks. One main limitation with constructing network data from text data is that the validation of the resulting network data can be hard to infeasible, e.g. in the cases of covert, historical and large scale networks. This thesis addresses this problem by identifying the impact of coding choices that must be made when extracting network data from text data on the structure of networks and network analysis results. My findings suggest that conducting reference resolution on text data can alter the identity and weight of 76% of the nodes and 23% of the links, and can cause major changes in the value of commonly used network metrics. Also, performing reference resolution prior to relation extraction leads to the retrieval of completely different sets of key entities in comparison to not applying this pre-processing technique. Based on the outcome of the presented experiments, I recommend strategies for avoiding or mitigating the identified issues in practical applications. When extracting socio-technical networks from texts, the set of relevant node classes might go beyond the classes that are typically supported by tools for named entity extraction. I address this lack of technology by developing an entity extractor that combines an ontology for sociotechnical networks that originates from the social sciences, is theoretically grounded and has been empirically validated in prior work, with a supervised machine learning technique that is based on probabilistic graphical models.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2012
Accession Number
ADA600419

Entities

People

  • Jana Diesner

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical
  • C4I
  • Energy and Power Technologies
  • Engineered Resilient Systems
  • Ground and Sea Platforms

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Cognitive Science
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Data Mining
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Named Entity Recognition
  • Natural Language Processing
  • Network Science
  • Ontologies
  • Self Organizing Systems
  • Social Networking Services

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks