This repository contains code for sa-bAbI code generator and for open source code tranformation.
Our team project is based on cmu-sei/sa-bAbi. We add 5 kinds of new vulnerabilities and try applying the model to some transformed open source code.
Contributors:
A minimal example to get you an end to end analysis of generated code.
Git clone or unzip the archive and descend into the working directory.
$ unzip SA-bAbI-<version>.zip
$ cd sa-babi-<version>
Build the docker images and setup the workspace
$ docker-compose build
$ mkdir working
Generate two datasets, a training set and test set. The SA_SEED
variable
is optional, but used here for reproducibility.
$ SA_SEED=0 ./sa_e2e.sh working/sa-train-1000 1000
$ SA_SEED=1 ./sa_e2e.sh working/sa-test-100 100
or use ./sa_e2e_no_tools.sh
with no tools analyzing, and this will be much faster.
$ SA_SEED=0 ./sa_e2e_no_tools.sh working/sa-train-1000 1000
$ SA_SEED=1 ./sa_e2e_no_tools.sh working/sa-test-100 100
Within the output directory you will find:
alerts/
containing CSV files of the alerts each static analyzer found per fileclang_sa/
containing XML output from clangcppcheck/
containing XML output from cppcheckframa-c/
containing text output from frama-csrc/
containing the generated sa-bAbI .c filestokens/
containing the tokenized .c filestool_confusion_matrix.csv
reporting results on the whole datasettool_confusion_matrix_sound.csv
reporting results on the sound subsample of the dataset
Build and activate the conda environment for the deep learning component
$ cd pipeline
$ conda env create -f environment.yml
$ source activate sa_babi
Train the deep learning model on the training data.
Note that the pipeline/constants.py
file is already setup for
working/sa-train-1000
. If you wish to
use different data, modify pipeline/constants.py
appropriately before
running the following command.
$ python train.py
The current validation script only serves the needs of the arXiv.org paper. TODO: Develop a more general validation script.
- predict directly: we have put
models
,vocal.pkl
,instances.npy
inworking/sa-train-10000
, these files will be used to predict. You can use them to predict directly.
$ ./sa_gen_tokens.sh working/sa-test-trueC
$ cd pipeline
$ source activate sa_babi
$ python test_trueC.py
- retrain model and then predict: generate new sa train data and train the model, and then predict.
$ SA_SEED=0 ./sa_e2e_no_tools.sh working/sa-train-10000 10000
$ cd pipeline
$ source activate sa_babi
$ python train.py
$ cd ..
$ ./sa_gen_tokens.sh working/sa-test-trueC
$ cd pipeline
$ source activate sa_babi
$ python test_trueC.py
This repository, in part, contains an automated system for:
- Generating source code
- Generating tokens from the source
- Running open-source static analysis tools on the source
- Scoring the tool outputs against ground truth
This system is built with docker and docker compose.
This system has been primarily run on Linux (Ubuntu 16.04), but should work on any Linux-like system (including Mac), assuming the following are present:
- docker
- docker-compose
- bash
- realpath (obtainable through the coreutils homebrew package on Mac)
The necessary docker images can be built with
docker-compose build
The sa_e2e.sh
script runs the tool pipeline end-to-end. The usage
for this script is:
bash sa_e2e.sh <working_dir> <num_instances>
Where <working_dir>
is the path to a local directory where outputs will be
stored (will be created if it doesn't exist), and <num_instances>
is the
number of testcases that will be generated.
After this script is run, <working_dir>
will contain:
- Generated source files in the
src
directory - Tokenized source files in the
tokens
directory. - Raw tool outputs in directories named with tool names (e.g. cppcheck)
- Aggregated tool alerts in the
alerts
directory. The alerts are in a common csv format. - Confusion matrices for tools in
tool_confusion_matrix.csv
andtool_confusion_matrix_sound.csv
The testcases are randomly generated based on a seed. By default, this
seed is set to a random value, but you can set it to a specific value
by setting the SA_SEED
environment variable, e.g.
SA_SEED=10 bash sa_e2e.sh <working_dir> <num_instances>
Initial release