On the Feasibility of Training an AI to Understand Programs: FY23 Cyber Security Line-Supported Program
Abstract
In this study, the possibility of training an AI on the task of program understanding was investigated. Specifically, the AI would take, as input, mechanically extracted features of programs and output English word-and-sentence descriptions of functionality. This output would be expected to aid a reverse engineer in investigating the capabilities and vulnerabilities of a piece of software. The input features might be static, meaning they are gleaned only from inspection of the software, or they might be dynamic, meaning they are extracted from program executions. In this seedling study, we investigated a number of recent publications, existing datasets, data sources, and embeddings for binaries and English prose. As part of our study, we constructed a novel dataset, which will be made available to the research community for general use. In brief, the results of this study are twofold. First, the dataset we constructed from over a million stack overflow pages is not of high enough quality to be used in training an AI for program understanding. Further, there is some evidence that the embedding used for English prose is too coarse for our purpose, conflating concepts we would have hoped it to distinguish. This report concludes with some ideas for future investigations including using our dataset quality measures to identify or weight higher quality exemplars, and some ideas involving using prose extracted from source and auto-generated web searches.
Document Details
- Document Type
- Technical Report
- Publication Date
- Apr 26, 2023
- Accession Number
- AD1201174
Entities
People
- Alexander M. Interrante-grant
- Andrew T. Davis
- Heather N. Preslier
- Timothy R. Leek
Organizations
- Massachusetts Institute of Technology