Learn-2-Reason (Phase Two): Advanced Applications of Probabilistic Binary Analysis

Abstract

Motivation. Computing systems in Naval environments heavily rely on COTS software and legacy software, without the availability of source code. Many such systems are overly complex, including functionalities that are non-relevant to Naval operations. Their code base is often out-dated and contains various software vulnerabilities. Efforts to remove unused features for lowering complexity, reducing attack surface, and improving efficiency; as well as efforts to harden the software, have to be built on a precise, robust, and scalable static binary analysis and transformation infrastructure.The key challenge of binary analysis is to recover high level semantic information. Reverse engineering is by nature imprecise and uncertain. Traditional analyses have to make conservative assumptions when facing uncertainty. The accumulated imprecision quickly degrades their usability and causes various precision and scalability issues. Interestingly, statistical learningtechniques are often able to produce useful results when the low level analyses cannot. This is mainly because statistical learning techniques are good at collecting/integrating various hints, and making predictions based on past patterns and experiences. On the down side, statistical learning can hardly take program semantics into consideration and hence cannot ~connect the dots~.Proposed Research. Recently, a new paradigm ~ called ~Learn-2-Reason~ ~ has been proposed to bridge the gaps between statistical learning and formal reasoning for more effective system modeling and analysis. It is not a simple combination of the two, as they will influence, interact and hence improve each other by sharing their inputs and exchanging their knowledge. We propose to instantiate the paradigm in binary analysis as the two align perfectly. Statisticallearning has unique advantages in dealing with the inherent uncertainty in binary analysis while formal reasoning allows connecting the dots such that the predictions from the learning models can be propagated, aggregated, and cross-validated. The fusion and inter-play between learning and reasoning can be achieved by probabilistic inference. Both learning results and formal reasoningrules are encoded as probabilistic constraints (or, conditional probabilities) in a Probabilistic Graphical Model (PGM). PGM inference produces a probability distribution that indicates the most likely results and maximizes the satisfaction of constraints. In the first phase of the project, we have developed a number of probabilistic binary analysis primitives, including probabilistic disassembly, variable identification and type inference, points-to analysis (value-set analysis), CFG/PDG construction, and probabilistic binary rewriting. We have also developed a technique to perform UI driven application reduction without source code. In the second phase of the project, we propose to improve the machine learning models used in the aforementioned primitives. These models are over simplistic and have become the dominant cause for false positives and false negatives. We plan to leverage rigorous formal reasoning on source code to produce ground truth for model training. We also propose to leverage the improved infrastructure to address a number of difficult binary analysis challenges, including probabilistic forced execution to expose hidden malicious behavior, probabilistic differential analysis of multiple binaries, and probabilistic input structure aware fuzzing without requiring explicit grammar. Innovative Claims. The intellectual merit lies in the following. (1) It will demonstrate the power of ~Learn-2-Reason~ by achieving breakthroughs in binary analysis. While traditional analysis is reaching its ceiling, the innovative coupling with learning will enable the next generation binary analysis. (2) The proposed analysis primitives will deliver unprecedented precision and robustness. (3) A highly scalable and effective probabilistic inference technique will be d

Document Details

Document Type
DoD Grant Award
Publication Date
Nov 26, 2019
Source ID
N000142012733

Entities

People

  • Xiangyu Zhang

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Virginia

Tags

Fields of Study

  • Computer science

Readers

  • Artificial Intelligence
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference