On the Feasibility of Training an AI to Understand Programs: FY23 Cyber Security Line-Supported Program

Abstract

In this study, the possibility of training an AI on the task of program understanding was investigated. Specifically, the AI would take, as input, mechanically extracted features of programs and output English word-and-sentence descriptions of functionality. This output would be expected to aid a reverse engineer in investigating the capabilities and vulnerabilities of a piece of software. The input features might be static, meaning they are gleaned only from inspection of the software, or they might be dynamic, meaning they are extracted from program executions. In this seedling study, we investigated a number of recent publications, existing datasets, data sources, and embeddings for binaries and English prose. As part of our study, we constructed a novel dataset, which will be made available to the research community for general use. In brief, the results of this study are twofold. First, the dataset we constructed from over a million stack overflow pages is not of high enough quality to be used in training an AI for program understanding. Further, there is some evidence that the embedding used for English prose is too coarse for our purpose, conflating concepts we would have hoped it to distinguish. This report concludes with some ideas for future investigations including using our dataset quality measures to identify or weight higher quality exemplars, and some ideas involving using prose extracted from source and auto-generated web searches.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Apr 26, 2023
Accession Number: AD1201174

Entities

People

Alexander M. Interrante-grant
Andrew T. Davis
Heather N. Preslier
Timothy R. Leek

Organizations

Massachusetts Institute of Technology

On the Feasibility of Training an AI to Understand Programs: FY23 Cyber Security Line-Supported Program

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas