Giter Club home page Giter Club logo

hiding-adversarial-attacks's Introduction

Hiding Adversarial Attacks on Convolutional Neural Networks Using Manipulated Explanations

This is the repository to my M.Sc. thesis where I combined adversarial attacks on Fashion-MNIST images with an adversarial fine-tuning procedure to manipulate the visual explanations produced by the explainability techniques Grad-CAM and Guided Backpropagation.


eXplainable Artificial Intelligence (XAI) has seen a hype in the last few years due to the black-box nature of many deep learning models. We want to better understand the decision-making process of our models and which input features influence the output. Also, some have suggested that explainability techniques could be used for auditing machine learning models, e.g. to check whether they exhibit biases. However, researchers have recently found out that some explainability techniques can be manipulated to produce explanations for images that have nothing to do with the image content (see the papers by Dombrowski et al., Ghorbani et al. and Heo et al.).

In my thesis I empirically investigated whether it is possible to hide a manipulation of the input data from a potential audit that uses explainability techniques. The idea behind this is that a malicious actor could try to make a visual explanation of a manipulated image look very similar to the explanation of the non-manipulated counterpart. This way, an auditor would not be able to tell that a manipulation has taken place by examining the explanations.

To achieve this, I first used an established adversarial attack technique called DeepFool to manipulate the original images in the Fashion-MNIST data set in order to create adversarial images that are misclassified by a pre-trained CNN model. Then, I came up with a training procedure, also called adversarial fine-tuning, that is based on the original and adversarial images, as well as their explanation maps that are created using the two techniques Grad-CAM and Guided Backpropagation. The three goals of the fine-tuning were to...

  1. create visual explanations for the adversarial images after the fine-tuning that look similar to the explanations of the original images before the fine-tuning,
  2. keep the network's classification performance on the original images approximately stable, and
  3. ensure that the adversarial images are consistently misclassified after the fine-tuning.


Conda environment

In order to set up the conda environment:

  1. Clone the repo and cd into the project directory, then create the hiding_adversarial_attacks environment with the help of conda:
    conda env create -f environment.yml
  2. Activate the new environment with:
    conda activate hiding_adversarial_attacks

Download the model weights

  1. Download the whole directory called weights from Google Drive.
  2. Extract the zip file inside the project's models directory.

You should now see two directories: pre-manipulation and post-manipulation. The first one contains the pre-trained Fashion-MNIST classifier checkpoint fashion-mnist-model.ckpt that everything builds upon. If you just want to use these, you can skip step 1. of Running the code. The weights in the post-manipulation directory are the results of my thesis. They can be used in step 4.

The models made available here are a subset of the ones used in my thesis. Please open an issue should you require more.

Download the Fashion-MNIST data set

  1. Download the Fashion-MNIST data set files and from Google Drive.
  2. Move the files inside the project directory data/external/FashionMNIST/processed. The final result should look something like this:
├── data
│   ├── external
|   |   └──FashionMNIST
|   |   |   └──processed
|   |   |   |   ├──
│   │   |   |   └──

(Optional) Logging to Neptune

This project enables you to use for logging training and test results to the cloud. However, this is turned off by default so that the logs will be saved locally. If you're fine with this, you do not need to change anything.

If you want to use Neptune for logging, head over to their website, create an account and a project. Note down the project name and API key and export them as environment variables:

   export NEPTUNE_API_TOKEN=<your-token>
   export NEPTUNE_PROJECT=<your-username/your-project-name>

When running step 1 and step 4 below, add the following to the Python commands: neptune_offline_mode=False.

Logs are also saved locally, together with other training and testing output such as images and csv files. All logs are saved to the project's logs directory, but each of the following steps creates their own subdirectory.

Running the code

The whole process consists of the following 4 steps, all of which are configurable:

  1. Training the Fashion-MNIST classifier (optional: you can skip this and just use the pre-trained weights).
  2. Adversarially attacking the Fashion-MNIST data set (optional: you can also skip this and download the attacked data).
  3. Creating the initial visual explanation maps (optional: can also be skipped and the data downloaded).
  4. Running the adversarial fine-tuning based on the pre-trained model, the adversarial images and the visual explanations.

1) Training the Fashion-MNIST classifier

If you want to skip this, please download the model weights in the pre-manipulation directory as described previously.


If you want to replicate the exact results, you first need to download the Fashion-MNIST data set as described above.

Not so quick

Activate the conda environment, cd into the src/hiding_adversarial_attacks directory and run the following:

python data_set=FashionMNIST classifier=FashionMNISTClassifier

This will train a Fashion-MNIST CNN classifier for 25 epochs using a learning rate of 1.5 with the Adadelta optimizer, static learning rate decay factor of 0.85 and batch size of 64. You can check out other configuration presets for the training in the file src/hiding_adversarial_attacks/config/

Logs and model checkpoints are saved to the directory logs/train_classifier.

To test your model run the following command and add the path to your model checkpoint:

python \
test=True \
data_set=FashionMNIST \
classifier=FashionMNISTClassifier \

⚠️ Beware that all special characters (e.g. "=") need to be escaped with a forward slash ("")!

2) Adversarially attacking the Fashion-MNIST data set


If you want to skip this step, download all of the 8 PyTorch tensors (*.pt) from Google Drive. Move the *.pt files inside the local project directory data/preprocessed/adversarial/data-set=FashionMNIST--attack=DeepFool--eps=0.105--cp-run=HAA-1728.

Not so quick

If you want to create your own adversarial data, cd into the src/hiding_adversarial_attacks directory. Run the following command, adding the path to your trained FashionMNIST checkpoint (*.ckpt) file and a custom identifier that will be appended to the directory where the resulting tensors will be stored after the attack (I used the ID of the Neptune run of the model training (HAA-1728), but you're free to choose whatever you like):

python \
data_set=FashionMNIST \
classifier=FashionMNISTClassifier \
checkpoint=<path-to-checkpoint> \
checkpoint_run=<custom-checkpoint-identifier> \

This will run an adversarial attack called DeepFool on the Fashion-MNIST data set and create 8 PyTorch tensors in a directory at data/preprocessed/adversarial.

You can also change the attack.epsilons parameter by replacing or adding different values. This will affect the adversarial attack size: the larger the values, the more visible the changes will be in the attacked images. For more configuration options, see the file src/hiding_adversarial_attacks/config/

Logs and model checkpoints are saved to the directory logs/run_attack_on_data.

3) Creating the initial visual explanation maps


If you want to skip this step, you can download the pre-created explanation maps from Google Drive. Please download both directories exp=GradCAM--l=conv2--ra=False and exp=GuidedBackprop as *.zip files from here and unzip them inside data/preprocessed/adversarial/data-set=FashionMNIST--attack=DeepFool--eps=0.105--cp-run=HAA-1728.

Not so quick

If you want to create the explanation maps yourself, first cd into the src/hiding_adversarial_attacks directory. Run the following command for creating the Grad-CAM explanation maps, adding the path to your Fashion-MNIST classifier checkpoint and the directory containing the results of the adversarial attack from step 2 (in case you downloaded the data, set this to data_path=<path-to-local-project-root>/data/preprocessed/adversarial/data-set=FashionMNIST--attack=DeepFool--eps=0.105--cp-run=HAA-1728).

python \
explainer=GradCamExplainer \
classifier=FashionMNISTClassifier \
data_set=AdversarialFashionMNIST \
checkpoint=<path-to-checkpoint> \

For creating Guided Backpropagation explanations, replace explainer="GradCamExplainer" with explainer="GuidedBackpropExplainer". Also check out the file src/hiding_adversarial_attacks/config/ for more configuration options and additional explainability algorithms.

Logs and model checkpoints are saved to the directory logs/explanation.

4) Running the adversarial fine-tuning

⚠️ Please make sure you followed the steps before (at least downloading the required data & model checkpoint).

Quick (testing only)

If you just want to run the test stage on the manipulated models provided (see Download the model weights above), then you can run the following code to test the manipulation for Grad-CAM and Fashion-MNIST class Sandal (ID: 5):

python \
test=True explainer="GradCamExplainer" similarity_loss="PCC" \
data_set="AdversarialFashionMNISTWithExplanations" classifier="FashionMNISTClassifier" \
data_path=<path-to-the-explanations-directory-from-step-3> \
explanations_path=<path-to-the-explanations-directory-from-step-3> \
classifier_checkpoint=<path-to-checkpoint> \
checkpoint=["<path-to-local-project-root>/models/post-manipulation/Grad-CAM/Sandal/final-model.ckpt"] \
included_classes=[5] \
normalize_explanations=True \

Make sure you add the correct paths for data_path, explanations_path, classifier_checkpoint (checkpoint of original model from step 1) and checkpoint (path to manipulated model that you downloaded). If you have a GPU available you can speed things up by setting gpus=1.

If you want to run the test for class Coat, make sure to change the checkpoint to checkpoint=["<path-to-local-project-root>/models/post-manipulation/Grad-CAM/Coat/final-model.ckpt"] and set included_classes=[4].

For testing Guided Backpropagation, change the explainer to explainer="GuidedBackpropExplainer" and specify the corresponding model checkpoint in the models/post-manipulation directory.

Not so quick (training and testing)

To replicate the adversarial manipulation of Grad-CAM explanations of class Sandal that I used in my thesis, you can use the following command:

python \
classifier="FashionMNISTClassifier" \
data_set="AdversarialFashionMNISTWithExplanations" \
explainer="GradCamExplainer" \
similarity_loss="PCC" \
data_path=<path-to-the-explanations-directory-from-step-3> \
explanations_path=<path-to-the-explanations-directory-from-step-3> \
classifier_checkpoint=<path-to-checkpoint> \
included_classes=[5] \
max_epochs=30 \
lr=0.00003 \
gamma=0.7 \
steps_lr=5 \
weight_decay=0.0 \
loss_weight_similarity=1.0 \
ce_class_weight=1 \
batch_size=64 \
normalize_explanations=True \
convert_to_softplus=True \

If you have a GPU available, you can change the last line to gpus=1.

The table below specifies the hyperparameters used for the different explainability techniques and Fashion-MNIST target classes used in my thesis. If you want to run the respective training, make sure to replace all of the hyperparameters in the command above accordingly. You should also specify the correct class ID: included_classes=[5] for Sandal and included_classes=[4] for Coat.

Guided Backpropagation
Guided Backpropagation
lr 0.00003 0.00035 0.00017 0.00006
loss_weight_similarity 1 1 2 1
ce_class_weight 1 7 130 3
batch_size 64 64 64 64
weight_decay 0.00 0.00 0.00 0.001
steps_lr 5 5 8 8
gamma 0.7 0.7 0.3 0.7
max_epochs 30 30 30 30

The outputs of your training run will be saved at logs/manipulate_model. The manipulated model checkpoints can then be found inside the run's checkpoints directory.

In order to perform a test run on your manipulated model, you can use the command under the Quick section, adding the checkpoint of your trained model, as well as the correct included_classes and explainer.

Project Organization

├──              <- List of developers and maintainers.
├──            <- Changelog to keep track of new features and fixes.
├── LICENSE.txt             <- License as chosen on the command-line.
├──               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├──                <- Use `python develop` to install for development or
|                              or create a distribution with `python bdist_wheel`.
├── src
│   └── hiding_adversarial_attacks <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `py.test`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.


This project has been set up using PyScaffold 4.0 and the dsproject extension 0.6. For details and usage information on PyScaffold see

hiding-adversarial-attacks's People


stefaniestoppel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.