Giter Club home page Giter Club logo

tailor's Introduction

Tailor

This repository contains the artifact of our ICSE'23 paper titled

Learning graph-based code representations for source-level functional similarity detection

Introduction

Tailor is a new code representation learning framework tailed to detect code functional similarity. Build upon a customized graph neural network, Tailor explicitly models graph-structured code features from code property graphs to provide better program functionality classification.

Overview

Getting Started

Requirement

  • Hardware requirement: GPU: two GPUs, each with 32GB memory; Physical memory: 64 GB
  • OS requirement: Ubuntu (Our implementation has been tested using Ubuntu 20.04 and 18.04)
  • We provide two ways to install Tailor: (1) Build on a docker image provided by us: You need to install docker (2) Build on your machine from scratch: You need to install Miniconda and the corresponding CUDA Version 10.0. (This implementation has been tested using Python 3.6.5 and tensorflow 1.14)

Install

We strongly encourage you to use the docker image provided by us. This is the image we used in our evaluation. We have set up the environment and installed all the dependencies so you can reproduce the results easily.

Build on a docker image provided by us:

  • Download Tailor docker image (tailor_image.tar) from Zenodo: https://doi.org/10.5281/zenodo.7533280
  • Make sure you have installed docker in Ubuntu
  • Load Tailor image: $ docker load < tailor_image.tar
  • Initialize a container: $ docker run -it --gpus all tailor_image bash (If you encounter the problem of docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]., please install the nvidia-container-toolkit and nvidia-docker2 packages using apt. Note: If you encounter the problem of "Unable to locate package nvidia-container-toolkit", please refer to this solution)
  • Go to Tailor: $ cd /home/Tailor

Build on your machine from scratch:

  • Download source code and data (Tailor) from https://github.com/jun-zeng/Tailor or Zenodo: https://doi.org/10.5281/zenodo.7533280
  • Make sure you have installed Miniconda and CUDA 10.0 in Ubuntu
  • Create Tailor virtual environment: conda env create -f environment.yml
  • Activate Tailor virtual environment: conda activate tailor
  • Compile Cython: cd cpgnn && python setup.py build_ext --inplace

Status

We apply for the functional and available badges in this artifact evaluation. To receive these badges, we have publicly released all the source code and data of Tailor. Besides, we have documented step-to-step guidance to run experiments and reproduce results reported in our paper. We have also prepared a docker image to set up our experimental environment.

Repo Structure

.
├── cpg             # Build code property graph (CPG) from source code
├── cpgnn           # Learn code representations from CPG
├── datasets        # Experimental Dataset
├── log             # Evaluation Logs
└── README.md

Run Experiments and Validate Results

Usage Guidance

  • Code property graph (CPG) construction

    • Show usage and help:
    python driver.py -h
    
    usage: driver [-h] [-l LOGGING] [--src_path SRC_PATH] 
                  [--clone_classification CLONE_CLASSIFICATION] 
                  [--encoding] [--encode_path ENCODE_PATH] 
                  [--iresult_path IRESULT_PATH] [--store_iresult] 
                  [--load_iresult] [--statistics] [--lang LANG]
    • Notes: To drive code property graph construction, you need to specify the path to retrieve code fragments (--src_path) and the path to store CPG results (--encode_path). Besides, for C programs, you should specify code clone or classification (--clone_classification).
  • CPG-based neural network (CPGNN)

    • Show usage and help:
    python main_oj.py -h
    
    usage: driver [-h] [-l LOGGING] [--gpu_id GPU_ID] [--dataset DATASET]
              [--splitlabel] [--pretrain PRETRAIN] [--model_type MODEL_TYPE]
              [--adj_type ADJ_TYPE] [--epoch EPOCH] [--lr LR] [--regs [REGS]]
              [--opt_type OPT_TYPE] [--mess_dropout [MESS_DROPOUT]]
              [--early_stop] [--type_dim TYPE_DIM]
              [--word2vec_window WORD2VEC_WINDOW]
              [--word2vec_count WORD2VEC_COUNT]
              [--word2vec_worker WORD2VEC_WORKER] [--word2vec_save]
              [--embed_init EMBED_INIT] [--layer_size [LAYER_SIZE]]
              [--agg_type [AGG_TYPE]] [--save_model]
              [--classification_num CLASSIFICATION_NUM]
              [--clone_threshold CLONE_THRESHOLD]
              [--batch_size_clone BATCH_SIZE_CLONE]
              [--clone_test_unsupervised] [--clone_test_supervised]
              [--clone_val_size CLONE_VAL_SIZE]
              [--clone_test_size CLONE_TEST_SIZE] [--classification_test]
              [--class_val_size CLASS_VAL_SIZE]
              [--class_test_size CLASS_TEST_SIZE]
              [--batch_size_classification BATCH_SIZE_CLASSIFICATION]
              [--classification_visualization] [--report REPORT]
    • Notes: Simplicity is one of our guiding principles in designing CPGNN, where only two hyper-parameters --- the embedding size of type/token symbols (--type_dim) and the number of propagation iterations (--layer_size) --- are needed to be revised towards more effective directions. Note that you need to define two GPUs for training (e.g., --gpu_id 0,1).

Dataset

Tailor has been evaluated on two public datasets: OJClone and BigCloneBench.

  • OJClone dataset

    • OJClone dataset is collected from a pedagogical online judge system, which contains 52,000 C programs belonging to 104 programming tasks. In our experiments, we use the first 15 tasks for code clone detection and all the tasks for source code classification.
  • BigCloneBench dataset

    • The BigCloneBench dataset is a popular code clone benchmark collected from 25,000 Java software systems, which contains 6,000,000 clone pairs and 260,000 non-clone pairs. Additionally, clone pairs can be categorized as Type-1, Type-2, Strongly Type-3, Moderately Type-3, and Weakly Type-3/Type-4 according to line-level and token-level similarity. In our experiments, we sample 20,000 pairs for each clone types as positive samples and 20,000 non-clone pairs as negative samples. Note that the BigCloneBench dataset is not used in source code classification due to the lack of ground truth.

Reproducibility

To demonstrate the reproducibility of experimental results reported in the paper and facilitate future research, we provide the best parameter settings in the script, and show our evaluation logs. Please note that we also include the scripts and training logs for the ablation studies in the paper (e.g., how different GNN variants affect Tailor?).

Additionally, we include two examples of code property graphs build upon C and Java programs.

  • OJClone dataset

    • CPG Construction and Encoding for Code Clone Detection (~ 8 mins)
    # If you want to skip this procedure, you can download oj_clone_encoding.tar.gz from https://drive.google.com/file/d/1sQuMFwuelufxoP3_iAbpYO2rfdc3OzPU
    # Make sure you have decompressed datasets/ojclone.tar.gz
    cd cpg
    python driver.py --lang c --clone_classification clone --src_path ../datasets/ojclone --statistics --encoding --encode_path ../cpgnn/data/oj_clone_encoding
    • Code Clone Detection (~ 2 hours)
    # Make sure you have oj_clone_encoding under cpgnn/data
    cd cpgnn
    python main_oj.py --clone_test_supervised --epoch 30 --classification_num 15 --clone_threshold 0.5 --dataset oj_clone_encoding --type_dim 16 --layer_size [32,32,32,32,32] --batch_size_clone 512 --gpu_id 0,1 --report clone_oj
    • CPG Construction and Encoding for Source Code Classification (~ 40 mins)
    # If you want to skip this procedure, you can download oj_classification_encoding.tar.gz from https://drive.google.com/file/d/1u9s4K43NluxFMVLKoqFoLVJIKRIANuA2
    # Make sure you have decompressed datasets/ojclassification.tar.gz
    cd cpg
    python driver.py --lang c --clone_classification classification --src_path ../datasets/ojclassification --statistics --encoding --encode_path ../cpgnn/data/oj_classification_encoding
    • Source Code Classification (~ 8 hours)
    # Make sure you have oj_classification_encoding under cpgnn/data.
    cd cpgnn
    python main_oj.py --classification_test --epoch 251 --classification_num 104 --dataset oj_classification_encoding --type_dim 16 --layer_size [32,32,32,32,32] --batch_size_classification 384 --gpu_id 0,1 --report classification_oj
  • BigCloneBench dataset

    • CPG Construction and Encoding for Code Clone Detection (~ 30 mins)
    # If you want to skip this procedure, you can download bcb_clone_encoding.tar.gz from https://drive.google.com/file/d/1AQmGqxsMavWbd0fkphHRNr-nXic9wcj8
    # Make sure you have decompressed datasets/bigclonebench.tar.gz
    cd cpg
    python driver.py --lang java --src_path ../datasets/bigclonebench --statistics --encoding --encode_path ../cpgnn/data/bcb_clone_encoding
    • Code Clone Detection (~ 4 hours)
    # Make sure you have bcb_clone_encoding under Tailor/cpgnn/data
    cd cpgnn
    python main_bcb.py --clone_test_supervised --epoch 30 --clone_threshold 0.5 --dataset bcb_clone_encoding --type_dim 16 --layer_size [32,32,32,32] --batch_size_clone 384 --gpu_id 0,1 --report clone_bcb

License

Tailor is released under the GPL-3.0 license.

Publication

Please consider to cite our paper if you use the code in your research project.

@inproceedings{Tailor23ICSE,
  author    = {Jiahao Liu and
               Jun Zeng and
               Xia Wang and
               Zhenkai Liang},
  title     = {Learning Graph-based Code Representations for Source-level Functional Similarity Detection},
  booktitle = {ICSE},
  year      = {2023}
}

Contact

Jun ZENG ([email protected]) and Jiahao LIU ([email protected])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.