Leveraging Topic Models to Develop Metrics for Evaluating the Quality of Narrative Threads Extracted from News Stories

Abstract

Analysts and software systems are increasingly tasked with making sense of a growing amount of data to help their organizations make decisions involving risk and uncertainty. A key enabler of this work is the ability to quickly discover structure in large amounts of text such as news stories and blogs. Recent work in this area has shown it is possible to automatically link documents from a corpus together to build a narrative structure, called a story chain, without the need for prior domain knowledge. This approach is an unsupervised method that discovers large numbers of story chains of variable quality. In this paper, we describe and evaluate methods to identify the most coherent and informative story chains. We explore two types of topic model based analytics. The first type is a measure of representativeness that captures how well a story chain represents the corpus from which it was generated. This is done by comparing the similarity of topics found over time in a story chain against those expressed in the corpus during the same time period. Our hypothesis is that story chains that have similar topic expression to the corpus will convey narratives that are central to the corpus. This type of analytic could help an analyst quickly focus on the key narratives in a large corpus of documents. The second type is a measure of quality of a story chain and is composed of topic consistency and topic persistence measures. Our hypothesis is that high quality chains would be composed of sequences of stories that have clearly defined primary topics that persist across significant portions of the story chain. We used these analytics to predict the clarity of story chains within one of four categories (1) very clear narrative, 2) somewhat clear narrative, 3) somewhat unclear narrative, 4) very unclear narrative, and found we were able to train a data model to label story chains with the same label as human coders 77 of the time.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 23, 2015
Accession Number
AD1057785

Entities

People

  • Alicia Ruvinsky
  • Jason Schlachter
  • Luis Asencios Reynoso
  • Naren Ramakrishnan
  • Sathappan Muthiah

Organizations

  • Lockheed Martin Advanced Technology Center

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Bayesian Networks
  • Consistency
  • Data Analysis
  • Data Mining
  • Data Sets
  • English Language
  • Governments
  • Language
  • Machine Learning
  • Models
  • Networks
  • Neural Networks
  • Probabilistic Models
  • Probability
  • Test And Evaluation
  • Text Analytics
  • Unsupervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Artificial Intelligence
  • Distributed Systems and Data Platform Development
  • Military History of the United States in the 20th Century.