Giter Club home page Giter Club logo

codegen's Introduction

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.

We provide reference implementations of the following papers:

We also provide pre-trained models for language modeling, translation and deobfuscation.

You can find some documentation for each projects in the docs folder:

Dependencies

Run install_env.sh. We use black code formatter.

Data

Source code processors

This repository contains programming languages processors for C++, Java and Python. These processors include:

  • tokenization and detokenization
  • obfuscation
  • function extractions

These processors are based on TreeSitter parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.

Example of code tokenization:

from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor

java_code = r"""class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}"""
java_processor = JavaProcessor(root_folder="<YOUR_TREESITER_FOLDER>")
tokenized_java_code = java_processor.tokenize_code(java_code)
print(tokenized_java_code)

BPE

This repository provides wrappers for fast BPE and Roberta BPE at file level.

Dataset Preprocessing

This repository contains a pipeline to create programming languages datasets. Now it supports four datasets modes:

  • Monolingual (ex: Java source code)
  • Monolingual Functions (ex: Java functions)
  • Monolingual Obfuscated (ex: Obfuscated Java source code.)
  • Monolingual Obfuscated Functions (ex: Obfuscated Java functions)

First, download C++ / Java / Python source code from Google BigQuery. To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given here.

The pipeline does the following:

  • Source code extraction from json (.json.gz) and tokenization (.tok)
  • Train BPE codes and vocab
  • Apply BPE (.bpe)
  • Binarization (.pth)
  • Symlink folder with appropriate file names for .pth (XLM-syml). To be given as data_path argument for training.

To run the pipeline :

python -m codegen_sources.preprocessing.preprocess \
<DATA_PATH> \                            # folder containing json.gz
--langs java cpp python  \               # languages to process
--mode monolingual_functions \           # dataset mode
--bpe_mode=fast_bpe \                    # BPE mode. by default it is fast_BPE. can be roberta_bpe
--local=True \                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--train_splits=1                         # Number of trainings splits

If you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. These tests test the pipeline on different modes. It will give you an overview of the possible options.

Also, we provide the BPE codes and vocabulary here. These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as fastbpe_code_path and fastbpe_vocab_path arguments.

In TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.

Model

Overview

In this repository, we provide code to train transformer-based models (code based on XLM repository). The available training tasks are the following:

  • Masked Language Model (MLM)
  • Causal Language Model (CLM)
  • Supervised Machine translation (MT)
  • Classification
  • Deobfuscation = DOBF
  • Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT)

We evaluate our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).

Also, we provide wrappers to fine-tune and evaluate our models on CodeXGLUE benchmark.

Download models

You can donwload the following models :

Re train specific models

To have details on how to retrain specific models, please refer to the README specific to each model.

References

TransCoder model (NeurIPS 2020)

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

DOBF

[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.

@article{roziere2021dobf,
  title={{DOBF}: A Deobfuscation Pre-Training Objective for Programming Languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},
  journal={arXiv preprint arXiv:2102.07492},
  year={2021}
}

TransCoder-ST

[3] B. Roziere, J.M. Zhang, F. Charton, M. Harman, G. Synnaeve, G. Lample Leveraging Automated Unit Tests for Unsupervised Code Translation.

@article{roziere2021leveraging,
  title={Leveraging Automated Unit Tests for Unsupervised Code Translation},
  author={Roziere, Baptiste and Zhang, Jie M and Charton, Francois and Harman, Mark and Synnaeve, Gabriel and Lample, Guillaume},
  journal={arXiv preprint arXiv:2110.06773},
  year={2021}
}

*Equal Contribution

License

The validation and test parallel datasets from GeeksForGeeks, and the evaluation scripts under data/transcoder_evaluation_gfg are released under the Creative Commons Attribution-ShareAlike 2.0 license. See https://creativecommons.org/licenses/by-sa/2.0/ for more information.

The rest of the CodeGen repository is under the MIT license. See LICENSE for more details.

codegen's People

Contributors

brozi avatar alexshypula avatar malachaux avatar bzz avatar yazdanbakhsh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.