Advancing the Knowledge about Software Vulnerability Analysis via Large-Scale Benchmarking
Abstract
Research Objectives: The overarching goal of this research is to fill the current gap between ever-growing security measurement and defense tools and the need for the science of software vulnerability analysis, through large-scale benchmarking including objective and quantitative measurements and fair, explainable comparisons. To reach that goal, the central objective is to develop a holistic framework that closely integrates data-driven code transformation, manipulation, and validation techniques with novel data mining and machine learning methods customized for working effectively with program data, so as to enable massive generation of diverse and realistic benchmarks for evaluating diverse vulnerability measurement and defense techniques. Meanwhile, metrics and measures will be developed as well to enable scientific and explainable assessments of those techniques. Research methods: To achieve the holistic framework proposed, we will first learn vulnerable code patterns from existing vulnerable program samples through code mining in open-source software repositories, and derive vulnerability logic rules to represent the mined patterns. Next, we will further mine the version histories of vulnerable programs from their repositories for their fixed versions to identify vulnerability-inducing changes, so as to infer the patterns of the code changes that led to the vulnerable versions. These directly mined patterns will be generalized through rule mutation analysis. With these mined and generalized vulnerability logic rules and vulnerability-inducing change patterns as inputs, we then develop machine learning techniques that generate a large, diverse set of vulnerable code samples. Further program analyses will then be performed to verify the validity of the code samples generated and retrofit them when necessary, resulting in the final executable benchmark programs. These automatically generated vulnerable benchmarks will enable large-scale assessments of given vulnerability defense techniques via associated tools. To that end, a set of quantitative metrics and measures are also needed for fair, scientific comparisons, which will be developed through analytic decomposition of the design space of state-ofthe- art vulnerability analyzers. Research Significance: In response to recognizing the fatal consequences of software vulnerabilities to our society, a large and stillgrowing number of defense techniques against particular types of vulnerabilities are devised. Yet there is a critical lack of scientific assessments of these techniques, which impedes systematical understandings of their strengths and limitations hence the fundamental advancement in software security assurance. This project will potentially fill this gap by offering a holistic framework for evaluating vulnerability analysis techniques at a large scale. To that end, multiple challenges will be addressed, including (1) the insufficiency of empirical examples on automatically generating complete, executable programs, (2) the technical difficulty of approximating realistic programs in terms of covered sorts of code vulnerabilities, and (3) the challenges to machine learning and data mining in inferring complex vulnerability logic rules and vulnerability-inducing code-change patterns and using these as inputs to generate diverse yet realistic code samples, especially with a limited number of training samples. Addressing these challenges represents major advances in both software engineering and artificial intelligence, while producing new knowledge and understanding on how to leverage their deep synergy to empower information and software assurance.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Jun 25, 2021
- Source ID
- W911NF2110027
Entities
People
- Haipeng Cai
Organizations
- Army Contracting Command
- United States Army
- Washington State University