LLVM Intermediate Representation for Code Weakness Identification

Abstract

Recent effort for code weakness identification focuses on training statistical machine learning (ML) models on source code text as the feature space in addition to more structural features like abstract syntax trees. LLVM intermediate representation (IR) can aid ML models through standardizing code, reducing vocabulary size, and removing some context sensitivity regarding syntax and memory. We investigate the benefit of LLVM IR to train statistical and machine learning models including bag-of-words models, BiLSTMs, and a few varieties of transformer models. We compare these LLVM IR based models to models trained on source C-based models on two different sets of data: synthetic data and more natural data. We find that while using LLVM IR features does not result in more accurate models than their C-based counterparts, we are able to identify context-specific LLVM IR and C tokens that help indicate the presence of weaknesses. Additionally, for a given data set, we find that bag-of-words models can be powerful indicators whether any statistical or ML model is beneficial for code weakness identification before using more complex and time consuming models.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 08, 2022
Accession Number
AD1178536

Entities

People

  • David Svoboda
  • Shannon K. Gallagher
  • William E Klieber

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Accuracy
  • Artificial Intelligence Software
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Computer Programs
  • Data Sets
  • Deep Learning
  • Language
  • Linguistics
  • Machine Learning
  • Neural Networks
  • Recurrent Neural Networks
  • Software Development
  • Supervised Machine Learning
  • Test Sets

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks
  • Space