On the Feasibility of Training an AI to Understand Programs: FY23 Cyber Security Line-Supported Program

Abstract

In this study, the possibility of training an AI on the task of program understanding was investigated. Specifically, the AI would take, as input, mechanically extracted features of programs and output English word-and-sentence descriptions of functionality. This output would be expected to aid a reverse engineer in investigating the capabilities and vulnerabilities of a piece of software. The input features might be static, meaning they are gleaned only from inspection of the software, or they might be dynamic, meaning they are extracted from program executions. In this seedling study, we investigated a number of recent publications, existing datasets, data sources, and embeddings for binaries and English prose. As part of our study, we constructed a novel dataset, which will be made available to the research community for general use. In brief, the results of this study are twofold. First, the dataset we constructed from over a million stack overflow pages is not of high enough quality to be used in training an AI for program understanding. Further, there is some evidence that the embedding used for English prose is too coarse for our purpose, conflating concepts we would have hoped it to distinguish. This report concludes with some ideas for future investigations including using our dataset quality measures to identify or weight higher quality exemplars, and some ideas involving using prose extracted from source and auto-generated web searches.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 26, 2023
Accession Number
AD1201174

Entities

People

  • Alexander M. Interrante-grant
  • Andrew T. Davis
  • Heather N. Preslier
  • Timothy R. Leek

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Cyber

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Automated Text Summarization
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Computer Programs
  • Computers
  • Cybersecurity
  • High Level Languages
  • Information Systems
  • Language
  • Machine Learning
  • Malware
  • Natural Languages
  • Neural Networks
  • Recurrent Neural Networks

Fields of Study

  • Computer science

Readers

  • Economics
  • Geospatial Intelligence and Artificial Intelligence Analytics
  • Neural Network Machine Learning.

Technology Areas

  • Cyber