Improving adversarial robustness of deep neural networks

Abstract

There is a growing concern about the vulnerability of deep learning, the current state-of-the-art machine learning (ML) approach, to both test/inference and training time attacks. In particular, two broad classes of DNN vulnerabilities are adversarial inputs (small perturbations of good inputs at inference time) and backdoors (corruption of the network during training by injecting adversarial training data). Since the backdoor problem formulation considers a stronger adversary (compared to the adversarial test input based problem formulation) who can control the training data as well as the inputs during testing, we view the backdoor detection and mitigation problem as a superset of the adversarial input detection problem. To this end, the proposed project seeks to develop a robust methodology to defend against potentially backdoored DNNs (i.e., BadNets), i.e., to detect presence of adversarial triggers (backdoors) and to accurately recover the correct output label even when presented with a poisoned/adversarial input. The proposed approach has an offline phase and an online phase. The key idea in the offline phase of the proposed methodology is to learn new models based on the BadNetÕs behavior, but using only clean validation inputs for this learning. The new models seek to leverage the significant amount of work already done in training the Bad- Net in two ways: (1) learning new models initialized with the BadNetÕs weights; and (2) making use of the BadNetÕs hidden layer activations from one or more hidden layers by learning models of how the hidden layer activations relate to the input and output of the DNN. However, since these new models are trained only on clean validation inputs, the intent is for them to preserve only the BadNetÕs ÒgoodÓ behavior while forgetting its ÒbadÓ behavior. These models learned using the BadNet and clean validation data thereby enable probabilistic detection of poisoned inputs during testing and therefore division of the online test inputs into two subsets (a likely-clean subset and a likely-poisoned subset, which is then quarantined). Using this separation of the online test data, our online defense seeks to learn the distinction between the validation and quarantined datasets and to learn a transformation from the validation data distribution to the quarantined data distribution. For this purpose, the online phase of our proposed methodology uses a cycle consistent generative adversarial network (CycleGAN), which then enables transforming clean inputs to their backdoored versions. However, since the correct output labels are known for these backdoored versions generated from clean inputs, this enables fine-tuning of the DNN during the online testing, threeby gradually reducing the adversaryÕs success rate over time during online deployment. The efficacy of our approach will be tested on several publicly available data sets with backdoors that are impossible to detect using existing state-of-the-art techniques.

Document Details

Document Type
DoD Grant Award
Publication Date
Jun 25, 2021
Source ID
W911NF2110155

Entities

People

  • Siddharth Garg

Organizations

  • Army Contracting Command
  • New York University
  • United States Army

Tags

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks