Fine-Tuning A Multilingual Language Model to Prune Automated Event Data

Abstract

Every day, an enormous volume of written and transcribed media is produced, making it impossible for intelligence analysts to sift through it all without a large human workforce. However, multilingual language models can help intelligence analysts select media articles relevant to their problem set, even if they are written in a foreign or low resource language, by parsing out non-relevant articles. The Global Database of Events Language and Tone (GDELT) is a near real-time media database that releases new collections of open-source articles every 15 minutes, but its automated event coding often leads to a high number of false positive samples.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2023
Accession Number
AD1213533

Entities

People

  • Seth W. Kyler

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Energy and Power Technologies
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Automata Theory
  • Big Data
  • California
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Computer Science
  • Computers
  • Data Science
  • Data Sets
  • Early Warning Systems
  • Intelligence Analysts
  • Language
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Social Media
  • Supervised Machine Learning
  • United States
  • Warning Systems

Readers

  • Artificial Intelligence
  • Geospatial Intelligence and Artificial Intelligence Analytics
  • Speech Processing/Speech Recognition.