Frequency-Based Feature Extraction for Malware Classification

Abstract

Traditional signature-based malware detection is effective, but it can only identify known malicious programs. This thesis attempts to use machine-learning techniques to successfully identify previously unknown malware from a set of Windows executable programs. We analyzed the histogram of 4-, 8-, and 16-bit-sequence values contained in each program. We then analyzed the effectiveness of using these histograms in part or in full as feature vectors for machine learning experiments. We also explored the effect of an offset at the beginning of each program and its impact on classifier performance. We successfully show that a machine learning classifier can be learned from these features, with an f-measure in excess of 90% attained in one of our experiments. Using a part of the histogram as the feature vector did not significantly affect classifier performance up to a point, nor did including an offset. Our results also suggest that features derived from histograms are better suited to tree-based algorithms compared to Bayesian methods.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2018
Accession Number
AD1069562

Entities

People

  • Jonathan P. Erwert

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Cyber

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Forensics
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Mining
  • Digital Data
  • Information Science
  • Machine Languages
  • Machine Learning
  • Malware
  • Network Science
  • Operating Systems
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Cybersecurity.
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • Cyber