LLVM Intermediate Representation for Code Weakness Identification

Abstract

Recent effort for code weakness identification focuses on training statistical machine learning (ML) models on source code text as the feature space in addition to more structural features like abstract syntax trees. LLVM intermediate representation (IR) can aid ML models through standardizing code, reducing vocabulary size, and removing some context sensitivity regarding syntax and memory. We investigate the benefit of LLVM IR to train statistical and machine learning models including bag-of-words models, BiLSTMs, and a few varieties of transformer models. We compare these LLVM IR based models to models trained on source C-based models on two different sets of data: synthetic data and more natural data. We find that while using LLVM IR features does not result in more accurate models than their C-based counterparts, we are able to identify context-specific LLVM IR and C tokens that help indicate the presence of weaknesses. Additionally, for a given data set, we find that bag-of-words models can be powerful indicators whether any statistical or ML model is beneficial for code weakness identification before using more complex and time consuming models.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jul 08, 2022
Accession Number: AD1178536

Entities

People

David Svoboda
Shannon K. Gallagher
William E Klieber

Organizations

Carnegie Mellon University

LLVM Intermediate Representation for Code Weakness Identification

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas