Giter Club home page Giter Club logo

codegen's Introduction

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.

We provide reference implementations of the following papers:

We also provide pre-trained models for language modeling, translation and deobfuscation.

You can find some documentation for each projects in the docs folder:

Dependencies

Run install_env.sh. We use black code formatter.

Data

Source code processors

This repository contains programming languages processors for C++, Java and Python. These processors include:

  • tokenization and detokenization
  • obfuscation
  • function extractions

These processors are based on TreeSitter parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.

Example of code tokenization:

from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor

java_code = r"""class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}"""
java_processor = JavaProcessor(root_folder="<YOUR_TREESITER_FOLDER>")
tokenized_java_code = java_processor.tokenize_code(java_code)
print(tokenized_java_code)

BPE

This repository provides wrappers for fast BPE and Roberta BPE at file level.

Dataset Preprocessing

This repository contains a pipeline to create programming languages datasets. Now it supports four datasets modes:

  • Monolingual (ex: Java source code)
  • Monolingual Functions (ex: Java functions)
  • Monolingual Obfuscated (ex: Obfuscated Java source code.)
  • Monolingual Obfuscated Functions (ex: Obfuscated Java functions)

First, download C++ / Java / Python source code from Google BigQuery. To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given here.

The pipeline does the following:

  • Source code extraction from json (.json.gz) and tokenization (.tok)
  • Train BPE codes and vocab
  • Apply BPE (.bpe)
  • Binarization (.pth)
  • Symlink folder with appropriate file names for .pth (XLM-syml). To be given as data_path argument for training.

To run the pipeline :

python -m codegen_sources.preprocessing.preprocess \
<DATA_PATH> \                            # folder containing json.gz
--langs java cpp python  \               # languages to process
--mode monolingual_functions \           # dataset mode
--bpe_mode=fast \                    # BPE mode. by default it is fast. can be roberta
--local=True \                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--train_splits=1                         # Number of trainings splits

If you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. These tests test the pipeline on different modes. It will give you an overview of the possible options.

Also, we provide the BPE codes and vocabulary here. These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as fastbpe_code_path and fastbpe_vocab_path arguments.

In TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.

Model

Overview

In this repository, we provide code to train transformer-based models (code based on XLM repository). The available training tasks are the following:

  • Masked Language Model (MLM)
  • Causal Language Model (CLM)
  • Supervised Machine translation (MT)
  • Classification
  • Deobfuscation = DOBF
  • Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT)

We evaluate our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).

Also, we provide wrappers to fine-tune and evaluate our models on CodeXGLUE benchmark.

Download models

You can download the following models:

Re train specific models

To have details on how to retrain specific models, please refer to the README specific to each model.

References

TransCoder model (NeurIPS 2020)

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

DOBF

[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.

@article{roziere2021dobf,
  title={{DOBF}: A Deobfuscation Pre-Training Objective for Programming Languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},
  journal={arXiv preprint arXiv:2102.07492},
  year={2021}
}

TransCoder-ST

[3] B. Roziere, J.M. Zhang, F. Charton, M. Harman, G. Synnaeve, G. Lample Leveraging Automated Unit Tests for Unsupervised Code Translation.

@article{roziere2021leveraging,
  title={Leveraging Automated Unit Tests for Unsupervised Code Translation},
  author={Roziere, Baptiste and Zhang, Jie M and Charton, Francois and Harman, Mark and Synnaeve, Gabriel and Lample, Guillaume},
  journal={ICLR},
  year={2022}
}

TransCoder-IR

@article{szafraniec2022code,
  title={Code translation with Compiler Representations},
  author={Szafraniec, Marc and Roziere, Baptiste and Charton, Hugh Leather Francois and Labatut, Patrick and Synnaeve, Gabriel},
  journal={ICLR},
  year={2023}
}

* Equal Contribution

License

The validation and test parallel datasets from GeeksForGeeks, and the evaluation scripts under data/transcoder_evaluation_gfg are released under the Creative Commons Attribution-ShareAlike 2.0 license. See https://creativecommons.org/licenses/by-sa/2.0/ for more information.

The rest of the CodeGen repository is under the MIT license. See LICENSE for more details.

codegen's People

Contributors

alexshypula avatar bigfootjon avatar brozi avatar bzz avatar malachaux avatar marcszafraniec avatar yazdanbakhsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codegen's Issues

Transcoder fails with error: CUBLAS_STATUS_NOT_INITIALIZED

After updating NVIDIA driver to 510.06, CUDA is finally recognized.

(codeGen_env) usr1@mak:~/projects/CodeGen$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
**Is CUDA available: True**
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080
Nvidia driver version: 510.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0
[pip3] torchaudio==0.7.0a0+ac17b64
[pip3] torchvision==0.8.1
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.0.3               h15472ef_9    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            12_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            12_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            12_linux64_mkl    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640
[conda] numpy                     1.19.5           py36hfc0c790_2    conda-forge
[conda] pytorch                   1.7.0           py3.6_cuda11.0.221_cudnn8.0.3_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.7.0                      py36    pytorch
[conda] torchvision               0.8.1                py36_cu110    pytorch
(codeGen_env) usr1@mak:~/projects/CodeGen$

Successfully compiled cuda extensions but had to comment out this section in apex to suppress a runtime error as per the output suggestion. This may (or may not) be the cause of CUBLAS_STATUS_NOT_INITIALIZED (see below).

apex - commented out

   if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
   ...

Cuda Ext Compilation

(codeGen_env) usr1@mak:~/projects/CodeGen/apex$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/pip/_internal/commands/install.py:245: UserWarning: Disabling all use of wheels due to the use of --build-option / --global-option / --install-option.
  cmdoptions.check_install_build_global(options)
Using pip 21.3.1 from /home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/pip (python 3.6)
Processing /home/usr1/projects/CodeGen/apex
  Running command python setup.py egg_info

  torch.__version__  = 1.7.0

  running egg_info
  creating /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info
  writing /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/dependency_links.txt

  writing top-level names to /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
  /home/usr1/projects/CodeGen/apex/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
    warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
  Preparing metadata (setup.py) ... done
Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
    Running command /home/usr1/anaconda3/envs/codeGen_env/bin/python3.6 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/usr1/projects/CodeGen/apex/setup.py'"'"'; __file__='"'"'/home/usr1/projects/CodeGen/apex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-tezgj0d4/install-record.txt --single-version-externally-managed --compile --install-headers /home/usr1/anaconda3/envs/codeGen_env/include/python3.6m/apex

    torch.__version__  = 1.7.0

    /home/usr1/projects/CodeGen/apex/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Sun_Jul_28_19:07:16_PDT_2019
    Cuda compilation tools, release 10.1, V10.1.243
    from /usr/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/usr1/projects/CodeGen/apex/setup.py", line 159, in <module>
        check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
      File "/home/usr1/projects/CodeGen/apex/setup.py", line 103, in check_cuda_torch_binary_vs_bare_metal
        "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.0.
    In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

Translation
cpp --> Java
Source lang file is C (not cpp), about 600 lines of code like this (nothing fancy):

int DBG_change(int level_offset)
{
	int retval = 0;

	if ((debug_level += level_offset) < MSG_DEBUG_LEVEL_NONE)
	{
		debug_level = MSG_DEBUG_LEVEL_NONE;
		debug_status = 0;
	}

	if (! debug_file_ptr && debug_level >= MSG_DEBUG_LEVEL_NONE)
	{
		retval = DBG_setup((char *) NULL, (char *) NULL);
	}
	else if (debug_file_ptr && debug_level <= MSG_DEBUG_LEVEL_NONE)
		DBG_close();

	return (retval);
}

static void DBG_set_level(char *app_name)
{
	char *envptr;
	char debug_env[MAX_ENVVARLEN + 1];

	if ((app_name != NULL) && *app_name)
	{
		strncpy(debug_env, app_name, MAX_ENVVARLEN);
		strncat(debug_env, "_DEBUG_LEVEL", MAX_ENVVARLEN - strlen(debug_env));
	}
	else
	{
		strcpy(debug_env, "DEBUG_LEVEL");
	}

	envptr = (char *) getenv(debug_env);

	if (envptr == NULL)
		debug_level = MSG_DEBUG_LEVEL_NONE;
	else if (! strcmp(envptr, "TRUE")  || ! strcmp(envptr, "true"))
		debug_level = MSG_DEBUG_LEVEL_ON_MIN;
	else if (! strcmp(envptr, "MIN")  || ! strcmp(envptr, "min"))
		debug_level = MSG_DEBUG_LEVEL_ON_MIN;
	else if (! strcmp(envptr, "NORM")  || ! strcmp(envptr, "norm"))
		debug_level = MSG_DEBUG_LEVEL_ON_NORM;
	else if (! strcmp(envptr, "MAX")   || ! strcmp(envptr, "max"))
		debug_level = MSG_DEBUG_LEVEL_ON_MAX;
	else if (! strcmp(envptr, "FALSE") || ! strcmp(envptr, "false"))
		debug_level = MSG_DEBUG_LEVEL_NONE;
	else if (isdigit(envptr[0]))
	{
		if ((debug_level = (int) atoi(envptr)) <= MSG_DEBUG_LEVEL_NONE)
			debug_level = MSG_DEBUG_LEVEL_NONE;
	}
	else
	{
		debug_level = MSG_DEBUG_LEVEL_NONE;
	}
}

python -m codegen_sources.model.translate
Error: CUBLAS_STATUS_NOT_INITIALIZED

(codeGen_env) usr1@mak:~/projects/CodeGen$ python -m codegen_sources.model.translate --src_lang cpp --tgt_lang java --model_path TransCoder_model_1.pth --beam_size 10 < csrc.c
adding to path /home/usr1/projects/CodeGen
INFO - 12/13/21 16:50:08 - 0:00:05 - ============ Model Reloading
INFO - 12/13/21 16:50:08 - 0:00:05 - Reloading encoder from TransCoder_model_1.pth ...
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 12/13/21 16:50:13 - 0:00:10 - Reloading decoders from TransCoder_model_1.pth ...
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of parameters (encoder): 143239641
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of parameters (decoders): 168442329
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of decoders: 1
...
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/usr1/projects/CodeGen/codegen_sources/model/translate.py", line 276, in <module>
    beam_size=params.beam_size,
  File "/home/usr1/projects/CodeGen/codegen_sources/model/translate.py", line 192, in translate
    enc1 = self.encoder("fwd", x=x1, lengths=len1, langs=langs1, causal=False)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 433, in forward
    return self.fwd(**kwargs)
  File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 526, in fwd
    attn = self.attentions[i](tensor, attn_mask, use_cache=use_cache)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 243, in forward
    q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

It's worth noting I haven't trained any models; just using TransCoder_model_1.pth as-is and doing a simple cpp->Java translation as a test. What is the significance of this error?

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Outputted files have erroneous characters [bug on mac os]

Hi, I was trying to reproduce the results in the TransCoder paper, but ran into some issues when it computed the computational accuracy.
It seems that when writing things such as id’s or outputted programs to files or outputs in the log, the text has some issues.

For example, in ids.java_sa-python_sa.text.txt (created by create_reference_files in evaluator.py), the lines look like “CHECK_@@ WHE@@ THER_@@ GI@@ V@@ EN_@@ NUMBER_@@ EV@@ EN_@@ O@@ DD”.

The scripts outputted by the model (e.g. in eval_scripts/java_sa-python_sa.test/) that are used to compute the computational accuracy similarly have erroneous characters and spaces (which causes syntax errors), e.g.:

def f_filled is_@@ ap ( arr , n ) :
    if n == 1 : return True
    arr.sort ( )
    d = arr [ 1 ] - arr [ 0 ]
    for i in range ( 2 , n ) :
        if arr [ i ] - arr [ i - 1 ] != d : return False
    return True

If it is relevant, I am on Mac OS Catalina with python 3.9 and this is the command I have been running to evaluate the TransCoder models provided:

python codegen_sources/model/train.py \
--eval_only True \
--reload_model 'TransCoder_model_1.pth,TransCoder_model_2.pth' \
--data_path "test_dataset" \
--exp_name transcoder \
--dump_path 'dump' \
--lgs 'java_sa-python_sa'  \
--bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa'  \
--ae_steps 'python_sa,java_sa'  \
--mt_steps 'java_sa-python_sa,python_sa-java_sa' \
--encoder_only False \
--emb_dim 1024 \
--n_heads 8 \
--n_layers 0 \
--n_layers_encoder 6  \
--n_layers_decoder 6 \
--eval_bleu true \
--eval_computation true \
--has_sentences_ids true

Thank you.

Can this be used to translate between SQL dialects?

I am working on a project where I have to convert Teradata SQL queries to redshift queries. This has to be then generalized for other dialects to like for eg MySQL, BigQuery, PGSQL, etc.

  • So could this model be used to achieve this?
  • How should I proceed with it?

Question for validation and test sets

Hi,

as the paper "Unsupervised Translation of programming languages" mentioned, there are 852 parallel functions. So I checked the data in this repo fold and found (each file contain one function code with unit test cases, actually there are 852 union filenames in python/java/cpp folder):

  • 698 cpp functions
  • 717 java functions
  • 702 python functions

the number of the test/validate dataset sizes is different as mentioned in Table 4 of the raw paper :

  • c++ 466/231
  • java 234/481
  • python 237/463

And another question is function pairs number in Table 5 of the raw paper. I'm wander why the C++ -> java hava 481 tests functions while the java -> c++ only hava 466 test functions. If my understanding is right, there should have a same number of tests for giving parallel functions (java to c++ or c++ to java). Why the count of test functions if different for c++->java and java->c++ ? (same for other language pairs)

really thanks

Clarification questions

Hi, I have a few questions regarding TransCoder's training data and optimization setting.

  1. From the paper, it is clear that TransCoder is trained using Standalone functions during the DAE+BT training stage. But is TransCoder only trained using Standalone functions in the MLM stage too?
  2. During the MLM stage, only the encoder part of TransCoder is pre-trained, right?
  3. For the MLM pre-training, max_epoch and epoch_size are set to 100k. If I understand correctly, epoch_size basically refers to the number of instances used in each epoch. Is it correct? Also, for MLM pre-training, the following are set:
--validation_metrics _valid_mlm_ppl \
--stopping_criterion '_valid_mlm_ppl,10' 

So, I am assuming TransCoder pre-training is stopped based on the stopping_criterion. Before, the MLM pre-training was stopped, how many optimization steps were executed?

  1. Unlike the MLM pre-training stage, for the DAE+BT stage training, there is no stopping_criterion is set. And the epoch_size was set to 50000 and the max_epoch was set to 10000000. So, when the training stops? How many optimization steps were executed during this stage?

Ablation on data size

Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?

Preprocess complete but running into issues when training

Hello,
I ran the preprocess step and it appeared to complete successfully; however, after I run the following command to train

python codegen_sources/model/train.py --exp_name transcoder --dump_path './model1' --data_path '/content/CodeGen/data/test_dataset/XLM-syml' --split_data_accross_gpu local --bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa' --ae_steps 'python_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 12 --n_layers_decoder 6 --emb_dim 768 --n_heads 12 --lgs 'java_sa-python_sa' --max_vocab 64000 --gelu_activation true --roberta_mode true --lgs_mapping 'java_sa:java_obfuscated,python_sa:python_obfuscated' --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 50000 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_python_sa-java_sa_mt_comp_acc'

I ran into the following error.

INFO - 08/04/21 18:32:16 - 0:00:00 - ============ Monolingual data (python_sa)
INFO - 08/04/21 18:32:16 - 0:00:00 - Loading data from /content/CodeGen/data/test_dataset/XLM-syml/train.python_sa.0.pth ...
INFO - 08/04/21 18:32:16 - 0:00:00 - 37514 words (50775 unique) in 250 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 08/04/21 18:32:16 - 0:00:00 - Selecting 64000 most frequent words ...
INFO - 08/04/21 18:32:16 - 0:00:00 - Maximum vocabulary size: 64000. Dictionary size: 50775 -> 50775 (removed 0 words).
INFO - 08/04/21 18:32:16 - 0:00:00 - Now 0 unknown words covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Removed 0 empty sentences.
INFO - 08/04/21 18:32:17 - 0:00:01 - Removed 0 empty sentences.
INFO - 08/04/21 18:32:17 - 0:00:01 - Removed 111 too long sentences.

INFO - 08/04/21 18:32:17 - 0:00:01 - Loading data from /content/CodeGen/data/test_dataset/XLM-syml/valid.python_sa.pth ...
INFO - 08/04/21 18:32:17 - 0:00:01 - 3088 words (50775 unique) in 25 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Selecting 64000 most frequent words ...
INFO - 08/04/21 18:32:17 - 0:00:01 - Maximum vocabulary size: 64000. Dictionary size: 50775 -> 50775 (removed 0 words).
INFO - 08/04/21 18:32:17 - 0:00:01 - Now 0 unknown words covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Removed 0 empty sentences.

INFO - 08/04/21 18:32:17 - 0:00:01 - Loading data from /content/CodeGen/data/test_dataset/XLM-syml/test.python_sa.pth ...
INFO - 08/04/21 18:32:17 - 0:00:01 - 8581 words (50775 unique) in 43 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Selecting 64000 most frequent words ...
INFO - 08/04/21 18:32:17 - 0:00:01 - Maximum vocabulary size: 64000. Dictionary size: 50775 -> 50775 (removed 0 words).
INFO - 08/04/21 18:32:17 - 0:00:01 - Now 0 unknown words covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Removed 0 empty sentences.

INFO - 08/04/21 18:32:17 - 0:00:01 - ============ Parallel data (java_sa-python_sa)
INFO - 08/04/21 18:32:17 - 0:00:01 - Loading data from /content/CodeGen/data/test_dataset/XLM-syml/valid.java_sa-python_sa.java_sa.pth ...
INFO - 08/04/21 18:32:17 - 0:00:01 - 66397 words (64461 unique) in 470 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Selecting 64000 most frequent words ...
INFO - 08/04/21 18:32:17 - 0:00:01 - Maximum vocabulary size: 64000. Dictionary size: 64461 -> 64000 (removed 461 words).
INFO - 08/04/21 18:32:17 - 0:00:01 - Now 0 unknown words covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Loading data from /content/CodeGen/data/test_dataset/XLM-syml/valid.java_sa-python_sa.python_sa.pth ...
INFO - 08/04/21 18:32:17 - 0:00:01 - 63023 words (64461 unique) in 470 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 08/04/21 18:32:17 - 0:00:01 - Selecting 64000 most frequent words ...
INFO - 08/04/21 18:32:17 - 0:00:01 - Maximum vocabulary size: 64000. Dictionary size: 64461 -> 64000 (removed 461 words).
INFO - 08/04/21 18:32:17 - 0:00:01 - Now 0 unknown words covering 0.00% of the data.
Traceback (most recent call last):
File "codegen_sources/model/train.py", line 701, in
main(params)
File "codegen_sources/model/train.py", line 556, in main
data = load_data(params)
File "/content/CodeGen/codegen_sources/model/src/data/loader.py", line 582, in load_data
load_para_data(params, data)
File "/content/CodeGen/codegen_sources/model/src/data/loader.py", line 278, in load_para_data
set_dico_parameters(params, data, src_data["dico"])
File "/content/CodeGen/codegen_sources/model/src/data/loader.py", line 117, in set_dico_parameters
assert data["dico"] == dico
AssertionError

I was having this origional issue on the transcoder as well(didnt end up solving it, saw this came out and switched to try CodeGen). It seems to be related to the parallel data; however, I pulled that from your source so Im a little bit perplexedand not sure where to look

Timeout Error

I am trying to reproduce the results in Transcoder paper. For that I have started with the preprocessing part. For some of the files, I get the following error. I saw that the timeout for extract_data_for_line is set to 60 second, hence these timeout exceptions occur. Is this ok or should I extend the timeout for this function?

INFO - 08/29/21 22:58:02 - 0:14:58 - Timeout error extracting data
INFO - 08/29/21 23:02:49 - 0:19:45 - Timeout error extracting data
INFO - 08/29/21 23:08:52 - 0:25:48 - Timeout error extracting data
INFO - 08/29/21 23:12:51 - 0:29:47 - Timeout error extracting data
INFO - 08/29/21 23:13:18 - 0:30:14 - Timeout error extracting data
INFO - 08/29/21 23:13:57 - 0:30:53 - Timeout error extracting data
INFO - 08/29/21 23:14:50 - 0:31:46 - Timeout error extracting data
INFO - 08/29/21 23:15:03 - 0:31:59 - Timeout error extracting data
INFO - 08/29/21 23:15:56 - 0:32:52 - Timeout error extracting data
INFO - 08/29/21 23:16:23 - 0:33:18 - Timeout error extracting data

Parallel datasets

Hi,
I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa.
I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

  1. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10
  2. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

As a result, the following files were created inside the folder XLM-syml:

  1. test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth
  2. train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth
  3. valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

Post that, I trained the MLM model using the following command:
python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error.
Command:
python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

Could you please help me as to how do I get these parallel datasets?
Also, is there something/some step that I am missing or doing incorrectly?

Could any one help me with this error,failed to learn bpe

INFO - 11/18/21 08:01:43 - 0:01:18 - training bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb...
Traceback (most recent call last):
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 214, in
preprocess(args)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 102, in preprocess
dataset.learn_bpe(ncodes=args.ncodes, executor=cluster_train_bpe)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/dataset_mode.py", line 589, in learn_bpe
self._learn_bpe(ncodes, executor)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/monolingual_functions_mode.py", line 123, in _learn_bpe
job.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 263, in result
r = self.results()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 291, in results
raise job_exception # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:

Traceback (most recent call last):
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/utils.py", line 122, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/dina/CodeGen/codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py", line 53, in learn_bpe_file
assert (
AssertionError: failed to learn bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb, command: /home/dina/CodeGen/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb > /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.codes


You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:

  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.err
  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.out

[Question]Making use of TreeSitterLangProcessor

Hi,

I am trying to use TransCoder to translate between JavaScript and Python, so I am trying to build a JavaScript processor for the processing.py pipeline (as mentioned in #42 (comment) ) .

To bulid the processor I need a tokenizer (mentioned in #48 (comment) ), I want to use ANTLR4 which is a parer that contains a lexer.

However, I don't understand how to make use of the TreeSitterLangProcessor class. I tried to reference the java processor and cpp processor and found that I need to provide three init params (specific to one language) by inheriting the TreeSitterLangProcessor:

question 1 - what those init params represent?

I don't understand how to find out what value should be stored in the TOKEN2CHAR and ast_nodes_type_string for a new language(JavaScript). For example:

  • why "STOKEN00" refers to "//" in JAVA_TOKEN2CHAR, where does the mapping come from?
  • why ast_nodes_type_string in java processor has 'character_literal' while the java processor seems to call the same thing as 'char_literal'? how can I find out what to save in the ast_nodes_type_string for a JavaScript processor?

question 2 - when should I use TreeSitterLangProcessor and when not?

Why python processor didn't make use of the TreeSitterLangProcessor class? In which case it is better to use TreeSitterLangProcessor class and in which case better not?

question 3 - why I need a tokenizer given TreeSitterLangProcessor?

As mentioned in question 1, seems if I inherite my JavaScript processor from TreeSitterLangProcessor, those three init params are the only things I need to provide myself and the rest (tokenize and detokenize) is handled by the TreeSitterLangProcessor.

Then why would I need a JavaScript tokenizer (mentioned in #48 (comment) ) such as ANTLR4?


Hope I described my questions clearly and sorry that I am still confused on this after two issues regarding adding a new language.

Thanks for the awesome paper with well-structured repository, and thanks for anyone's help in advance!

Memory Usage Preprocessing

When running preprocessing on all the data, it seems that the job consumes almost all the memory on the system and the swap memory. It makes the whole job very slow (after 1-2 days only 10-100 files processed). I am wondering if there is any way to put a limit on the number of files that are loaded into the memory concurrently.

I see that there is an option job_mem when the job is running on clusters, but not when the job is running locally.

Training script stuck (using eval only)

I'm trying to run

python codegen_sources/model/train.py --eval_only True --reload_model 'TransCoder_model_2.pth,TransCoder_model_2.pth' --data_path "test_dataset" --exp_name transcoder --dump_path 'dump' --lgs 'java_sa-python_sa'  --bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa'  --ae_steps 'python_sa,java_sa'  --mt_steps 'java_sa-python_sa,python_sa-java_sa' --encoder_only False --emb_dim 1024 --n_heads 8 --n_layers 0 --n_layers_encoder 6  --n_layers_decoder 6 --eval_bleu true --eval_computation true --has_sentences_ids true

but it is unable to find the following files in the test_dataset (downloaded from the transcoder doc)

test_dataset/train.java_sa.pth not found
test_dataset/valid.java_sa.pth not found
test_dataset/test.java_sa.pth not found
test_dataset/train.python_sa.pth not found
test_dataset/valid.python_sa.pth not found
test_dataset/test.python_sa.pth not found
test_dataset/train.java_sa-python_sa.java_sa.pth not found
test_dataset/train.java_sa-python_sa.python_sa.pth not found

and gets stuck after this log message:

SLURM job: False
0 - Number of nodes: 1
0 - Node ID        : 0
0 - Local rank     : 0
0 - Global rank    : 0
0 - World size     : 1
0 - GPUs per node  : 1
0 - Master         : True
0 - Multi-node     : False
0 - Multi-GPU      : False
0 - Hostname       : <host>

Any ideas why this might be happening?

I also tried running this translation script, which similarly seems to get stuck:

python -m codegen_sources.model.translate --src_lang python --tgt_lang java --model_path TransCoder_model_2.pth.1 --beam_size 1 < hello.py
adding to path /srv/home/akshit/CodeGen
INFO - 09/27/21 05:56:56 - 0:00:06 - ============ Model Reloading
INFO - 09/27/21 05:56:56 - 0:00:06 - Reloading encoder from TransCoder_model_2.pth.1 ...

Code translation inference optimzation

I noticed the inference time for code translation is kind slow. I assume it only uses CPU whileing use the translate.py? I cannot find any other information about if it can use GPU to speed up the inference time

Question for model training

Hi,
I'm trying to train the models from scratch.
One question for Unsupervised Translation of Programming Languages :

as my understand, there are 3 steps:
step 1: train the xlm model on all python/java/cpp codes
step 2: initialize the encoder and decoder parameters from step1 and train the DAE unsupervised task on code functions.
step 3: back-translation step

So the final models are generate by above 3 steps ? or the step 2 and step 3 are trained at the same time?

Thanks

Preprocess step is completing but pth files are not generating as expected in XLM folder

Hello,
After running preprocess steps from below command:

python -m codegen_sources.preprocessing.preprocess /path/data/mydata2 --langs java cpp --mode monolingual_functions --bpe_mode=fast --local=True --train_splits=1 --fastbpe_code_path=/path/data/bpe/cpp-java-python/ --fastbpe_vocab_path=/path/data/bpe/cpp-java-python/

XLM-syml folder is getting generated with file name like:
test.cpp_cl.pth
test.cpp_sa.pth
test.java_cl.pth
....
train.cpp_cl.0.pth
train.cpp_sa.0.pth
...

But when using these folder/files in Training step(MLM) is giving error like:

XLM-syml/train.java.pth not found
XLM-syml/valid.java.pth not found
XLM-syml/test.java.pth not found
XLM-syml/train.cpp.pth not found
XLM-syml/valid.cpp.pth not found
XLM-syml/test.cpp.pth not found

Train command:
python train.py --exp_name mlm --dump_path '/path/CodeGen/data/models' --data_path '/path/data/mydata2/XLM-syml' --split_data_accross_gpu local --mlm_steps 'java,cpp' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 12 --emb_dim 768 --n_heads 12 --lgs 'java-cpp' --max_vocab 64000 --gelu_activation true --roberta_mode false --amp 2 --fp16 true --batch_size 8 --bptt 512 --epoch_size 1000 --max_epoch 2000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

I think files are getting generated with suffix like _cl.pth or _sa.pth which is not being considered in training step? OR I am doing something wrong?

Thanks

Small Training Dataset

Since the tokenization on all the dataset takes a lot of time, I have decided to create a small dataset with only 10-20 of the json.gz files. Once training starts, it gives the following error. Is it because the tokenization/BPE have not seen this character?

File "/CodeGen/codegen_sources/model/train.py", line 701, in <module> main(params) File "/CodeGen/codegen_sources/model/train.py", line 609, in main trainer.mlm_step( File 
"/CodeGen/codegen_sources/model/src/trainer.py", line 1005, in mlm_step show_batch( File 
"/CodeGen/codegen_sources/model/src/utils.py", line 74, in show_batch f"{label} sent: 
{restore_segmentation_sentence(source_sentence, roberta_mode)}" File "/CodeGen/codegen_sources/model/src/utils.py", 
line 563, in restore_segmentation_sentence return restore_roberta_segmentation_sentence(sentence) File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in restore_roberta_segmentation_sentence res = 
bytearray([byte_decoder[c] for c in text]).decode("utf-8", errors="replace") File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in <listcomp> res = bytearray([byte_decoder[c] for c in 
text]).decode("utf-8", errors="replace") KeyError: '郞'

problem with transcoder

showing below error
File "TransCoder/translate.py", line 171, in
translator = Translator(params)
File "TransCoder/translate.py", line 83, in init
encoder, decoder = build_model(self.reloaded_params, self.dico)
File "/content/TransCoder/XLM/src/model/init.py", line 181, in build_model
enc_path, map_location=lambda storage, loc: storage.cuda(params.local_rank))
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 830, in restore_location
result = map_location(storage, location)
File "/content/TransCoder/XLM/src/model/init.py", line 181, in
enc_path, map_location=lambda storage, loc: storage.cuda(params.local_rank))
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 71, in _cuda
with torch.cuda.device(device):
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 225, in enter
self.prev_idx = torch.cuda.current_device()
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 432, in current_device
_lazy_init()
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 172, in _lazy_init
torch._C._cuda_init()

Portion of translations from select_successful_tests.py fails testcases

I have had issues being able to improve the pre-trained model after fine-tuning. If I run this sanity check on the output dataframe from select_successful_tests.py in this zipped file, I have found that on a subset of 500 translations, only around 53% are successful.

My suggestion is to remove the following line
https://github.com/facebookresearch/CodeGen/blob/main/codegen_sources/test_generation/select_successful_tests.py#L187

I will also submit a PR to correct this.

test_pairs.py.zip

Training MLM

I have followed the preprocessing step for both monolingual and monolingual_functions. The generated files in XLM-syml are as follows:

  • train.[cpp | java | python]_cl.[0..NGPU].pth
  • train.[cpp | java | python]_monolingual.[0..NGPU].pth
  • train.[cpp | java | python]_sa.[0..NGPU].pth
  • test.[cpp | java | python]_cl.pth
  • test.[cpp | java | python]_monolingual.pth
  • test.[cpp | java | python]_sa.pth
  • valid.[cpp | java | python]_cl.pth
  • valid.[cpp | java | python]_monolingual.pth
  • valid.[cpp | java | python]_sa.pth

However, whenever I start the training using the script in the README (copied below), I get the following the file not found error. It seems to me that the script is looking for different files.

Error

File "/CodeGen/codegen_sources/model/train.py", line 697, in <module> check_data_params(params) File "/CodeGen/codegen_sources/model/src/data/loader.py", line 470, in check_data_params assert all( AssertionError: [['/.../transcoder_data/train_data_small/XLM-syml/train.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.java.pth'], ['/.../transcoder_data/train_data_small/XLM-syml/train.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.python.pth']]

Training Scripts

python3 -m torch.distributed.launch --nproc_per_node=$NGPU codegen_sources/model/train.py \
--exp_name mlm \
--dump_path '/.../transcoder_data/train_data_small_dump' \
--data_path '/.../transcoder_data/train_data_small/XLM-syml' \
--mlm_steps 'java,python' \
--add_eof_to_stream true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--encoder_only true \
--n_layers 12  \
--emb_dim 768  \
--n_heads 12  \
--lgs 'java-python' \
--max_vocab 64000 \
--gelu_activation true \
--roberta_mode true \
--amp 2  \
--fp16 true  \
--batch_size 32 \
--bptt 512 \
--epoch_size 100000 \
--max_epoch 100000 \
--split_data_accross_gpu global \
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' \
--save_periodic 0 \
--validation_metrics _valid_mlm_ppl \
--stopping_criterion '_valid_mlm_ppl,10'

Code clustering

Can i use the DOBF model for code clustering to find similar code patterns? If so can you guide which model to begin with or if you have any examples?

cannot import name 'tokenize_v14_international' from 'sacrebleu'

I am trying to run the preprocessing.py file with following argument
run codegen_sources/preprocessing/new_preprocess.py data/test_dataset --mode obfuscation --langs python --mode obfuscation --train_splits 7 --job_mem 250 --tokenization_timeout 400 --bpe_timeout 220 --train_bpe_timeout 400 --bpe_mode fast --fastbpe_use_vocab False --fastbpe_vocab_path CodeGen/data/newtest_dataset --keep_comments False --fastbpe_code_path C:/Users/sushantk/Anaconda3/CodeGen/codegen_sources/model/tools/fastBPE --ncodes 40000 --percent_test_valid 2

and I am getting the following error
`ImportError Traceback (most recent call last)
~\Anaconda3\CodeGen\codegen_sources\preprocessing\new_preprocess.py in
13 from codegen_sources.preprocessing.bpe_modes.fast_bpe_mode import FastBPEMode
14 from codegen_sources.preprocessing.bpe_modes.roberta_bpe_mode import RobertaBPEMode
---> 15 from codegen_sources.preprocessing.dataset_modes.monolingual_functions_mode import (
16 MonolingualFunctionsMode,
17 )

~\Anaconda3\CodeGen\codegen_sources\preprocessing\dataset_modes\monolingual_functions_mode.py in
12
13 import submitit
---> 14 from codegen_sources.preprocessing.dataset_modes.dataset_mode import DatasetMode
15 from codegen_sources.preprocessing.lang_processors.lang_processor import LangProcessor
16 from codegen_sources.preprocessing.obfuscation.utils_deobfuscation import REPLACE_DICT

~\Anaconda3\CodeGen\codegen_sources\preprocessing\dataset_modes\dataset_mode.py in
27 from codegen_sources.preprocessing.bpe_modes.bpe_mode import BPEMode
28 from codegen_sources.preprocessing.obfuscation.utils_deobfuscation import SEPARATOR
---> 29 from codegen_sources.preprocessing.lang_processors.cpp_processor import CppProcessor
30 from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor
31 from codegen_sources.preprocessing.lang_processors.python_processor import (

~\Anaconda3\CodeGen\codegen_sources\preprocessing\lang_processors\cpp_processor.py in
5 # LICENSE file in the root directory of this source tree.
6 #
----> 7 from codegen_sources.preprocessing.lang_processors.tree_sitter_processor import (
8 TreeSitterLangProcessor,
9 NEW_LINE,

~\Anaconda3\CodeGen\codegen_sources\preprocessing\lang_processors\tree_sitter_processor.py in
6 #
7 from codegen_sources.preprocessing.lang_processors.lang_processor import LangProcessor
----> 8 from codegen_sources.preprocessing.lang_processors.tokenization_utils import (
9 process_string,
10 replace_tokens,

~\Anaconda3\CodeGen\codegen_sources\preprocessing\lang_processors\tokenization_utils.py in
6 #
7 import re
----> 8 from sacrebleu import tokenize_v14_international
9
10 # IMPORTED

ImportError: cannot import name 'tokenize_v14_international' from 'sacrebleu' (C:\Users\sushantk\Anaconda3\lib\site-packages\sacrebleu_init_.py)
`

Could any one help me with this error, failed to learn bpe.

When I run the command

python -m codegen_sources.preprocessing.preprocess /home/dina/CodeGen/data/test_dataset --langs java cpp python --mode monolingual_functions --bpe_mode=fast --local=True --train_splits=1
####### Error ########
INFO - 11/18/21 08:01:43 - 0:01:18 - training bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb...
Traceback (most recent call last):
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 214, in
preprocess(args)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 102, in preprocess
dataset.learn_bpe(ncodes=args.ncodes, executor=cluster_train_bpe)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/dataset_mode.py", line 589, in learn_bpe
self._learn_bpe(ncodes, executor)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/monolingual_functions_mode.py", line 123, in _learn_bpe
job.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 263, in result
r = self.results()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 291, in results
raise job_exception # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:

Traceback (most recent call last):
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/utils.py", line 122, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/dina/CodeGen/codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py", line 53, in learn_bpe_file
assert (
AssertionError: failed to learn bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb, command: /home/dina/CodeGen/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb > /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.codes


You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:

  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.err
  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.out

Conda Environment

Hi,

Having some dependency issues with the environment setup. Would someone be able to share a working Conda environment export on Mac and/or Linux?

I believe you just need to run conda env export > environment.yml.

In particular, the main issue I'm having seems to be with finding cudatools=11.0, and often conda is not able to solve the environment when I run the line in the setup script: conda install pytorch torchvision torchaudio cudatoolkit=11.0 six scikit-learn stringcase transformers ply slimit astunparse submitit

Unable to load module file for cuda

Hi,

I have been trying to configure CodeGen on WSL2. I have installed cuda toolkit 11.0. However, while running module load cuda/11.0 command, I am getting ' ERROR: Unable to locate a modulefile for 'cuda/11.0' ' error.
Is there any dependency that needs to be installed?

cuda_error

Could CodeGen run on Windows?

I try to run CodeGen on Windows. But error occurred when I install fastBPE with pip3.I donot know how to solve.Could anyone help me?

The error log list here:

Collecting fastBPE
Using cached fastBPE-0.1.0.tar.gz (35 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: fastBPE
Building wheel for fastBPE (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
package init file 'fastBPE_init_.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\fastBPE
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.9\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl: 命令行 warning D9025 :正在重写“/Os”(用“/Ot”)
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
cl: 命令行 warning D9002 :忽略未知选项“-Of”
cl: 命令行 warning D9002 :忽略未知选项“-Oa”
cl: 命令行 warning D9002 :忽略未知选项“-pthread”
fastBPE.cpp
C:\Users\leehsiang\AppData\Local\Temp\pip-install-4v3gqqtr\fastbpe_bf352c1da4044f1babb8708a79be8d35\fastBPE\fastBPE.hpp(15): fatal error C1083: 无法打开包括文件: “sys/mman.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for fastBPE
Running setup.py clean for fastBPE
Failed to build fastBPE
WARNING: Ignoring invalid distribution -umpy (c:\users\leehsiang\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution - (c:\users\leehsiang\appdata\local\programs\python\python39\lib\site-packages)
Installing collected packages: fastBPE
Running setup.py install for fastBPE ... error
error: subprocess-exited-with-error

× Running setup.py install for fastBPE did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running install
running build
running build_py
package init file 'fastBPE_init_.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\fastBPE
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.9\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl: 命令行 warning D9025 :正在重写“/Os”(用“/Ot”)
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
cl: 命令行 warning D9002 :忽略未知选项“-Of”
cl: 命令行 warning D9002 :忽略未知选项“-Oa”
cl: 命令行 warning D9002 :忽略未知选项“-pthread”
fastBPE.cpp
C:\Users\leehsiang\AppData\Local\Temp\pip-install-4v3gqqtr\fastbpe_bf352c1da4044f1babb8708a79be8d35\fastBPE\fastBPE.hpp(15): fatal error C1083: 无法打开包括文件: “sys/mman.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> fastBPE

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

How to add new lanuages

I am wondering if I can add new lanuages for code translation, for example I want to translate COBOL code to python

Do you have any tips if you can explain briefly that what I should do

How to use new BPE codes and vocab to generate parallel data

I have a few questions.

  1. I have used the data in the folder https://github.com/facebookresearch/CodeGen/tree/main/data/test_dataset learned BPE codes and vocab using Monolingual Functions mode. I want to know how to use the .tok files in the zip file https://dl.fbaipublicfiles.com/transcoder/test_set/transcoder_test_set.zip and generate files like test.cpp_sa-java_sa.cpp_sa.pth using my bpe codes and vocab?

  2. What is the contents of file test.cpp_sa-java_sa.cpp_sa.pth ? Also what is the difference between files test.cpp_sa-java_sa.cpp_sa.pth and test.cpp_sa-java_sa.java_sa.pth ?

  3. I first preprocessed data in Monolingual mode, learned BPE codes, and then did my MLM training. Then I preprocessed the data in Monolingual Functions mode and learned new BPE codes and vocab. My question is, which vocab have you used to train CodeGen? Also, why two different BPE codes are learned?

/usr/bin/ld: cannot find -lc++ - distutils.errors.DistutilsExecError: command '/usr/bin/cc'

I am trying to use the pretrained model to transform from C++ to Java, but with no luck after installations with this error

`
python -m codegen_sources.model.translate --src_lang cpp --tgt_lang java --model_path models/TransCoder_model_1.pth --beam_size 1 < zcpp_sample.cpp
adding to path <>/workspaces/git_web/facebook_transcoder/CodeGen
INFO - 10/02/21 08:26:08 - 0:00:04 - ============ Model Reloading
INFO - 10/02/21 08:26:08 - 0:00:04 - Reloading encoder from models/TransCoder_model_1.pth ...
WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 10/02/21 08:26:10 - 0:00:06 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 10/02/21 08:26:10 - 0:00:07 - Reloading decoders from models/TransCoder_model_1.pth ...
WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 10/02/21 08:26:11 - 0:00:08 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (encoder): 143239641
INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (decoders): 168442329
INFO - 10/02/21 08:26:11 - 0:00:08 - Number of decoders: 1

Input cpp function:
#include
using namespace std;

int main()
{
cout<<"My First Program. Helllllllo."<<endl;

return 0;

}
/usr/bin/ld: cannot find -lc++
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 206, in link
self.spawn(linker + ld_args)
File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 910, in spawn
spawn(cmd, dry_run=self.dry_run)
File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/spawn.py", line 91, in spawn
raise DistutilsExecError(
distutils.errors.DistutilsExecError: command '/usr/bin/cc' failed with exit code 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 254, in
output = translator.translate(
File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 139, in translate
src_lang_processor = LangProcessor.processors[lang1](
File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/cpp_processor.py", line 25, in init
super().init(
File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 40, in init
self.create_treesiter_parser()
File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 48, in create_treesiter_parser
Language.build_library(
File "<>/anaconda3/envs/pyt/lib/python3.9/site-packages/tree_sitter/init.py", line 72, in build_library
compiler.link_shared_object(object_paths, output_path)
File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 713, in link_shared_object
self.link(CCompiler.SHARED_OBJECT, objects,
File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 208, in link
raise LinkError(msg)
distutils.errors.LinkError: command '/usr/bin/cc' failed with exit code 1
`

I run the following command in my environment
'
cc --version
cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
'

How two checkpoints are selected in TransCoder?

For the TransCoder model, two separate checkpoints are provided for the following language directions.

  • TransCoder_model_1 for C++ -> Java, Java -> C++ and Java -> Python
  • TransCoder_model_2 for C++ -> Python, Python -> C++ and Python -> Java

Did you train TransCoder individually for the above two different directions setting? Or, the checkpoint is selected based on validation performances for those directions.

TypeError when testing DOBF

Hi,

I am new to this project and just playing around with your pertained models. I am able to obfuscate and deobfuscate a one liner code with commenting out https://github.com/facebookresearch/CodeGen/blob/main/codegen_sources/preprocessing/obfuscation/bobskater_obfuscator.py#L427-L436. I wonder if I am missing some flag that's not included in the help manual to turn off this code.

command: python -m codegen_sources.model.deobfuscate --lang python --model_path ~/xcellent-ml/trained_models/dobf.pth --beam_size 1 < ~/xcellent-ml/dataset_python/doall.py
The error message is:
adding to path /home/zujunt/xcellent-ml/CodeGen
INFO - 12/02/21 15:36:29 - 0:00:02 - ============ Model Reloading
INFO - 12/02/21 15:36:29 - 0:00:02 - Reloading encoder from /home/zujunt/xcellent-ml/trained_models/dobf.pth ...
WARNING - 12/02/21 15:36:33 - 0:00:06 - Lang java_dictionary matched to pretrained java_dictionary lang embedding.
WARNING - 12/02/21 15:36:33 - 0:00:06 - Lang java_obfuscated matched to pretrained java_obfuscated lang embedding.
WARNING - 12/02/21 15:36:33 - 0:00:06 - Lang python_dictionary matched to pretrained python_dictionary lang embedding.
WARNING - 12/02/21 15:36:33 - 0:00:06 - Lang python_obfuscated matched to pretrained python_obfuscated lang embedding.
INFO - 12/02/21 15:36:35 - 0:00:07 - Reloading decoders from /home/zujunt/xcellent-ml/trained_models/dobf.pth ...
WARNING - 12/02/21 15:36:35 - 0:00:08 - Lang java_dictionary matched to pretrained java_dictionary lang embedding.
WARNING - 12/02/21 15:36:35 - 0:00:08 - Lang java_obfuscated matched to pretrained java_obfuscated lang embedding.
WARNING - 12/02/21 15:36:35 - 0:00:08 - Lang python_dictionary matched to pretrained python_dictionary lang embedding.
WARNING - 12/02/21 15:36:35 - 0:00:08 - Lang python_obfuscated matched to pretrained python_obfuscated lang embedding.
INFO - 12/02/21 15:36:37 - 0:00:10 - Number of parameters (encoder): 125677911
INFO - 12/02/21 15:36:37 - 0:00:10 - Number of parameters (decoders): 97334103
INFO - 12/02/21 15:36:37 - 0:00:10 - Number of decoders: 1

INFO - 12/02/21 15:36:37 - 0:00:10 - Roberta BPE mode use Roberta pretrained codes and vocab /home/zujunt/xcellent-ml/CodeGen/data/bpe/roberta-base-vocab.
Original Code:
fruits = ["apple", "banana", "cherry"]
Traceback (most recent call last):
File "/home/zujunt/python/python-3.7.0/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/model/deobfuscate.py", line 235, in
input, lang=params.lang, beam_size=params.beam_size,
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/model/deobfuscate.py", line 140, in deobfuscate
input = obfuscator(input)[0]
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/preprocessing/lang_processors/python_processor.py", line 195, in obfuscate_code
res, dico = obfuscateString(code, obfuscateNames=True, removeDocstrings=False)
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/preprocessing/obfuscation/bobskater_obfuscator.py", line 457, in obfuscateString
sAst = transformer.visit(sAst)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/ast.py", line 262, in visit
return visitor(node)
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/preprocessing/obfuscation/bobskater_obfuscator.py", line 441, in generic_visit
super().generic_visit(node)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/ast.py", line 317, in generic_visit
value = self.visit(value)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/ast.py", line 262, in visit
return visitor(node)
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/preprocessing/obfuscation/bobskater_obfuscator.py", line 441, in generic_visit
super().generic_visit(node)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/ast.py", line 317, in generic_visit
value = self.visit(value)
File "/home/zujunt/python/python-3.7.0/lib/python3.7/ast.py", line 262, in visit
return visitor(node)
File "/home/zujunt/xcellent-ml/CodeGen/codegen_sources/preprocessing/obfuscation/bobskater_obfuscator.py", line 435, in generic_visit
+ "]"
TypeError: can only concatenate str (not "NoneType") to str

After commenting out the above mentioned code, I think I got the correct output:
image

Thank you for your help!

KeyError: 'dico'

I was trying to run evaluation on the pretrained model. But I keep getting this error

NFO - 10/06/21 01:07:36 - 0:00:00 - ============ Data summary

INFO - 10/06/21 01:07:36 - 0:00:00 - MEMORY (before build modules) : 13.222007751464844
Traceback (most recent call last):
File "codegen_sources/model/train.py", line 701, in
main(params)
File "codegen_sources/model/train.py", line 563, in main
encoder, decoder = build_model(params, data["dico"])
KeyError: 'dico'

Reports missing after running create_tests.py

While running create_tests.py with the recent changes to the pipeline, numerous test reports would not be available in es-consolidated-report/statistics.csv For example, if I had 2500 examples in Java in a directory such as java_000000000000_sa_tok I would often only see 900-1200 lines in the CSV file, whereas there should be close to 2500 reports.

On further inspection, the script uses ThreadPoolExecutor. According to the documentation here, if no argument is provided for max_workers, ThreadPoolExecutor creates a pool for 5 times the number of processors available. Because test case generation is generally CPU-bound (not IO bound) and because the default budget is relatively short, it seems running too many processes may be a likely cause for the undesired behavior. Indeed, when I set the max_workers to the number of available CPUs, my statistics.csv looks closer to what is expected (i.e. for 2500 examples, I usually get over 2400 lines in the csv file).

I will shortly submit a PR to close this.

Question regarding Backtranslation

Hi,

I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.

In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?

Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,

# input
static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java] 

# output
[python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }

As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn't have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?

If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.

Fine-tuning TransCoder

Hi,

We recently proposed a small-scale program translation dataset, called AVATAR. We want to fine-tune TransCoder on the translation task but we didn't find any documentation on that. Can you provide some guidelines on fine-tuning TransCoder?

Running into training issues with valid cpp_sa-java_sa example/ Hypothesis

This is a follow up to issue #5. I followed the directions there to to use the provided BPE codes and vocab and am still having issues. Specifically, I am getting this

outputing hypotheses in ./model_1/transcoder/q3nn5plz4y/hypotheses/hyp0.cpp_sa-java_sa.valid_beam0.txt compute_comp_acc Traceback (most recent call last): File "codegen_sources/model/train.py", line 701, in <module> main(params) File "codegen_sources/model/train.py", line 665, in main scores = evaluator.run_all_evals(trainer) File "/content/CodeGen/codegen_sources/model/src/evaluation/evaluator.py", line 299, in run_all_evals spans, File "/content/CodeGen/codegen_sources/model/src/evaluation/evaluator.py", line 939, in evaluate_mt roberta_mode=params.roberta_mode, File "/content/CodeGen/codegen_sources/model/src/evaluation/evaluator.py", line 1051, in compute_comp_acc roberta_mode, File "/content/CodeGen/codegen_sources/model/src/utils.py", line 392, in eval_function_output ids = read_file_lines(id_path) File "/content/CodeGen/codegen_sources/model/src/utils.py", line 455, in read_file_lines with open(hyp_path, "r", encoding="utf-8") as f: FileNotFoundError: [Errno 2] No such file or directory: './model_1/transcoder/q3nn5plz4y/hypotheses/ids.cpp_sa-java_sa.valid.txt'

after I run the following commands

python -m codegen_sources.preprocessing.preprocess /content/CodeGen/data/test_dataset/ --langs cpp java python --mode=monolingual_functions --local=True --fastbpe_vocab_path=/content/CodeGen/data/bpe/cpp-java-python/vocab --fastbpe_code_path=/content/CodeGen/data/bpe/cpp-java-python/codes --bpe_mode=fast --train_splits=1 --percent_test_valid=20

python codegen_sources/model/train.py --exp_name transcoder --dump_path './model_1' --data_path '/content/CodeGen/data/test_dataset/XLM-syml' --split_data_accross_gpu local --bt_steps 'cpp_sa-java_sa-cpp_sa,java_sa-cpp_sa-java_sa' --ae_steps 'cpp_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 12 --n_layers_decoder 6 --emb_dim 768 --n_heads 12 --lgs 'java_sa-cpp_sa' --max_vocab 64000 --gelu_activation true --roberta_mode False --lgs_mapping 'java_sa:java_obfuscated,cpp_sa:cpp_obfuscated' --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 50000 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_cpp_sa-java_sa_mt_comp_acc'

At this point I am using the provided data in all places and am following the directions in the read me, I am not sure what the issue is. I am experimenting with different flag variables but am struggling to find the issue

Panda error while creating Online test data

I am following the instructions for Transcoder-ST. I got this error

########## Creating Tests ##########
Running on the remaining 0 among 1 files
/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/submitit/core/core.py:628: UserWarning: Received an empty job array
warnings.warn("Received an empty job array")
0it [00:00, ?it/s]
Traceback (most recent call last):
File "codegen_sources/test_generation/create_tests.py", line 279, in
output_selected_tests_summary(out_folder)
File "codegen_sources/test_generation/create_tests.py", line 189, in output_selected_tests_summary
csv = pd.read_csv(csv_file)
File "/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/opt/conda/envs/codeGen_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file

DOBF for cpp

In the readme of DOBF, example is given with langs cpp java and python. But I wonder of the obfuscation code for cpp is implemented or not?

Problem with unrecognized argument

When I run the CodeGen, the preprocessing.py it is asking for various arguments and when I pass those argument its showing unrecognized arguments. Below are my codes.
%run codegen_sources/preprocessing/preprocess data/test_dataset 20 python obfuscation 8 500 200 400 roberta False data/bpe/cpp-java-python/vocab False data/bpe/cpp-java-python --1

usage: preprocess.py [-h] [--local LOCAL]
[--local_parallelism LOCAL_PARALLELISM]
[--langs LANGS [LANGS ...]]
[--mode {obfuscation,monolingual,monolingual_functions,obfuscation_functions}]
[--train_splits TRAIN_SPLITS] [--job_mem JOB_MEM]
[--tokenization_timeout TOKENIZATION_TIMEOUT]
[--bpe_timeout BPE_TIMEOUT]
[--train_bpe_timeout TRAIN_BPE_TIMEOUT]
[--bpe_mode {fast,roberta}]
[--fastbpe_use_vocab FASTBPE_USE_VOCAB]
[--fastbpe_vocab_path FASTBPE_VOCAB_PATH]
[--keep_comments KEEP_COMMENTS]
[--fastbpe_code_path FASTBPE_CODE_PATH] [--ncodes NCODES]
[--percent_test_valid PERCENT_TEST_VALID]
input_path
preprocess.py: error: unrecognized arguments: 20 python obfuscation 8 500 200 400 roberta False data/bpe/cpp-java-python/vocab False data/bpe/cpp-java-python --1

Could any one help me with this error, failed to learn bpe.

INFO - 11/18/21 08:01:43 - 0:01:18 - training bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb...
Traceback (most recent call last):
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dina/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 214, in
preprocess(args)
File "/home/dina/CodeGen/codegen_sources/preprocessing/preprocess.py", line 102, in preprocess
dataset.learn_bpe(ncodes=args.ncodes, executor=cluster_train_bpe)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/dataset_mode.py", line 589, in learn_bpe
self._learn_bpe(ncodes, executor)
File "/home/dina/CodeGen/codegen_sources/preprocessing/dataset_modes/monolingual_functions_mode.py", line 123, in _learn_bpe
job.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 263, in result
r = self.results()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/core.py", line 291, in results
raise job_exception # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:

Traceback (most recent call last):
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/dina/.local/lib/python3.8/site-packages/submitit/core/utils.py", line 122, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/dina/CodeGen/codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py", line 53, in learn_bpe_file
assert (
AssertionError: failed to learn bpe on /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb, command: /home/dina/CodeGen/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.tok.shuf.50gb > /home/dina/CodeGen/data/test_dataset/cpp-java-python.sa-cl.codes


You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:

  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.err
  • /home/dina/CodeGen/data/test_dataset/log/5615_0_log.out

Attempt to download MLM pre-trained model - access denied

wget https://dl.fbaipublicfiles.com/transcoder/pre_trained_models/mlm.pth

--2021-08-20 11:35:20-- https://dl.fbaipublicfiles.com/transcoder/pre_trained_models/mlm.pth
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-08-20 11:35:20 ERROR 403: Forbidden.

pre-processing memory error

If you guys have memory allocation error when pre-processing the whole dataset, try to open each json or tok file and process it with line by line multiprocessing. In this case no memory error occurred and the processing time is similar.

How to extract functions?

I have downloaded the raw source code on my machine, for example, python.000000000000.json.gz.
But when I run preprocessing pipeline, there is an error:
AssertionError: failed to learn bpe on /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb, command: /home/linjiayi/CodeGen-master/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb > /data2/linjiayi/CodeGen-master/data/python.sa-cl.codes

After I changed the filename to python.000.json.gz, it run successfully. Actually, I only want to extract functions from the raw source code. So, I have a few questions:

  1. Can data downloaded from BigQuery only be named in python.*[0-4][0-9][0-9].json.gz format?

  2. When I import data from BigQuery Table into Google Storage using wildcards, the file named python.000000000000.json.gz. How to save a file name with only three zeros? Did you download the file and modify it?

  3. After I run preprocessing pipeline, only one python.000.json.gz generates a lot of files.

捕获1

The `.sa.tok` files is `standalone functions.` The `.cl.tok` files is `class functions`. What's in the other files?
  1. Whether the Preprocessing Pipeline can extract description for each function? Which file to save the description in?

  2. The format of the extracted function is
    robertglen/flask | def test_explicit_instance_paths ( modules_tmpdir ) : NEW_LINE INDENT with pytest . raises ( ValueError ) as excinfo : NEW_LINE INDENT flask . Flask ( __name__ , instance_path = ' instance ' ) NEW_LINE DEDENT assert ' must ▁ be ▁ absolute ' in str ( excinfo . value ) NEW_LINE app = flask . Flask ( __name__ , instance_path = str ( modules_tmpdir ) ) NEW_LINE assert app . instance_path == str ( modules_tmpdir ) NEW_LINE DEDENT
    There are NEW_LINE INDENT, NEW_LINE DEDENT and NEW_LINE in the cdoes. Whether they affect the parsing of the code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.