Machine Learning Based Malware Detection

Abstract

Current antivirus software is effective at detecting well known threats but cannot keep up with the rate at which new malware is authored nor modern antivirus avoidance techniques, such as using polymorphic code. Some studies have investigated augmenting current antivirus techniques with machine learning, which could potentially detect some previously unknown malware. However, previously proposed methods either do not detect malware with satisfactory performance, or they have only been tested on laboratory software databases that cannot suitably be projected into realistic performance. This work explores several aspects of machine learning based malware detection. First, we propose an approach to learn primarily from program metadata, particularly header data in the 32-bit Windows Portable Executable (PE32) file format. We identify learning methods that learn effectively from this metadata, explore which metadata features can be trivially modified and are not appropriate for malware detection, test it on approximately realistic datasets, and find that it performs favorably compared to Windows API imports, another category of file characteristic that shows promise for machine learning based malware detection. Additionally, we find and explore the drastic performance drop which occurs when using a realistically low proportion of malware in test datasets instead of datasets split evenly between malware and benign software.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 18, 2015
Accession Number
ADA619747

Entities

People

  • Zane A. Markel

Organizations

  • United States Naval Academy

Tags

Communities of Interest

  • Autonomy
  • Cyber

DTIC Thesaurus Topics

  • Algorithms
  • Anti-Virus Software
  • Computer Programming
  • Computer Science
  • Computers
  • Data Science
  • Databases
  • Detection
  • Information Science
  • Machine Learning
  • Malware
  • Mathematical Models
  • Operating Systems
  • Python Programming Language
  • Training
  • Transient Response Analysis
  • United States Naval Academy

Fields of Study

  • Computer science

Readers

  • Cybersecurity.
  • Database Systems and Applications
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Cyber