LLVM Intermediate Representation for Code Weakness Identification
Abstract
Recent effort for code weakness identification focuses on training statistical machine learning (ML) models on source code text as the feature space in addition to more structural features like abstract syntax trees. LLVM intermediate representation (IR) can aid ML models through standardizing code, reducing vocabulary size, and removing some context sensitivity regarding syntax and memory. We investigate the benefit of LLVM IR to train statistical and machine learning models including bag-of-words models, BiLSTMs, and a few varieties of transformer models. We compare these LLVM IR based models to models trained on source C-based models on two different sets of data: synthetic data and more natural data. We find that while using LLVM IR features does not result in more accurate models than their C-based counterparts, we are able to identify context-specific LLVM IR and C tokens that help indicate the presence of weaknesses. Additionally, for a given data set, we find that bag-of-words models can be powerful indicators whether any statistical or ML model is beneficial for code weakness identification before using more complex and time consuming models.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jul 08, 2022
- Accession Number
- AD1178536
Entities
People
- David Svoboda
- Shannon K. Gallagher
- William E Klieber
Organizations
- Carnegie Mellon University