Finding Malicious Cyber Discussions in Social Media

Abstract

Security analysts gather essential information on cyber attacks, exploits, vulnerabilities, and victims by manually searching social media sites. This effort can be dramatically reduced using natural language machine learning techniques. Using a new English text corpus containing more than 250k discussions from Stack Exchange, Reddit, and Twitter on cyber and non-cyber topics, we demonstrate the ability to detect more than 90% of the cyber discussions with fewer than 1% false alarms. If an original searched document corpus includes only 5% cyber documents, then our processing provides an enriched corpus for analysts where 83% to 95% of the documents are on cyber topics. Good performance was obtained using TF-IDF features and logistic regression. A classifier trained using prior historical data accurately detected 86% of emergent Heartbleed discussions and retrospective experiments demonstrate that classifier performance remains stable up to a year without retraining.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 02, 2016
Accession Number
AD1033861

Entities

People

  • Alyssa C. Mensch
  • David J. Weller-fahy
  • Joseph P. Campbell
  • Richard P. Lippmann
  • William M. Campbell

Organizations

  • Massachusetts Institute of Technology

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computer Languages
  • Computer Networks
  • Computer Security
  • Cyberattacks
  • Cybersecurity
  • Denial Of Service Attack
  • Department Of Defense
  • Detectors
  • False Alarms
  • Feature Selection
  • Generative Models
  • Information Science
  • Internet
  • Language
  • Machine Learning
  • Media
  • Neural Networks
  • Online Communications
  • Ontologies
  • Preprocessing
  • Probability
  • Social Media
  • Social Networking Services
  • Supervised Machine Learning
  • Training
  • United States Government
  • Warning Systems

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Cybersecurity.
  • Information Retrieval

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • Cyber