salesforce / codet5 Goto Github PK

View Code? Open in Web Editor NEW

2.6K 39.0 383.0 10.91 MB

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Home Page: https://arxiv.org/abs/2305.07922

License: BSD 3-Clause "New" or "Revised" License

Python 98.28% Shell 1.72%

code-intelligence code-generation code-understanding large-language-models language-model

codet5's Introduction

CodeT5 and CodeT5+

Official research release for CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research, which are introduced by the following papers:

Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang , Shafiq Joty, Steven C.H. Hoi

In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities:

Text-to-code generation: generate code based on the natural language description.
Code autocompletion: complete the whole function of code given the target function name.
Code summarization: generate the summary of a function in natural language description.

What's New: 🎉

May 2023

CodeT5+ paper and models are released！🔥
paper | code | model | blog

Sep 2022

Our CodeRL paper has been accepted to NeurIPS 2022!
paper | code | blog

July 2022

We release two large-sized CodeT5 checkpoints at HuggingFace: Salesforce/codet5-large and Salesforce/codet5-large-ntp-py, which are introduced by the CodeRL paper.

Oct 2021

We release fine-tuned checkpoints for all the downstream tasks covered in the paper. Besides, we release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarization.

Sep, 2021

CodeT5 paper accepted to EMNLP 2021 and models are released!
paper | code | model | model card | blog

Citation

If you find this code to be useful for your research, please consider citing:

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={EMNLP},
    year={2021},
}

@inproceedings{
    le2022coderl,
    title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
    author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
    booktitle={NeurIPS},
    year={2022}
}

@article{
    wang2023codet5plus,
    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
    journal={arXiv preprint},
    year={2023}
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

codet5's People

Contributors

Stargazers

Watchers

Forkers

johndpope guoyi118 ai-hub-deep-learning-fundamental ajayarunachalam marvinmw tinyadapter yanndd1 nanosekun-de irononet praveenmunagapati spread0x tanphamnewbie salemh sobryan codeaudit aiwizzard astrocyteresearch ak391 hthompson6 martincastellano jcriddle-nci yuewang-sf guitaricet toufiqueparag elementai dragomirradev victorjmarin nashid daasin leomauro sureshannapureddy zfj1998 linyxus chrisdydo deanhnter frankxu2004 bjccdsrlcr timothyleslie mingmingazy zurichrain timeflystar ahadjiev hanfeijp saifulislamsalim79 or-toledano 0x4l3x warisgill kareemalaa2001 thisisanshgupta mcplayerfromprc leemgs openriemann lwehmeier lj2lijia jichoi0000 bmorphism valleysprings kolesnikov-pasha gemelgb isabella232 y-akinobu kamel773 avinashbhat laplacekorea aumahesh-mids wind1239 lucaburatti7 chungen04 qikahh ngohlong niya0515 ying091909 yangzhou6666 jiashenggu copperyp chriscn97 bdockbockd rosssong zeta1999 projetsplusia ishuangxin wo2022c gystar ilyaphlk liuchaoxd mathemusician vb33601 pkuzqh nishajacob96 shanthshivam shreeram0206 eduardoprea syrusua-t chubbymaggie spencerx brandaobrandisborges abdoiiii masstermax c0de3 minhtriet

codet5's Issues

Is it possible to finetune codet5 with the programming language which is not a part of codesearchnet dataset.

If it is not possible is pretraining codet5 with that new programming language dataset the only option?

Do you plan to release code for pretraining?

Hi, do you plan to release code for pretraining? Or maybe you have already released it, not sure which file to look at.

How do I get the original function in the dataset

Do you have any plans to expose the original C and C# functions in the dataset?

Recreating the performance from the README's gif?

Hi there, I am trying to recreate the suggestion from the gif in the README. Using the suggested code in the README, I have the following:

from transformers import RobertaTokenizer, T5ForConditionalGeneration
import os

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base")
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = """
// convert from one currency to another
"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=256)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

However this does not generate the suggested code, it only gets as far as "public static".
What am I doing wrong?

unable to regenerate output

Hi
Thanks for sharing such useful models
I am trying regenerate the output as shared by you in the gif
as per gif
the input is //convert one currency to another
for generating the output I use code-t5-base model from hugging face and replaced the binary file with concode_codet5_base(from finetune models)
But if you check output below its lot different from what is there in the gif

My question are:

Is my approach correct to run the model?
If what should be the recommended setting to get the correct output for text to code generation

Script for finetuning refine task ?

Hi,
I would try to finetune different dataset on refine task. But I cant seem to find the script for refinement task.

It will help me a lot if you enlighten me up.

Thanks in advance.

Regarding Code Generation task.

Can I use it for Code generation?
For example if I give a query, "Add two numbers", and it should generate the code for that.
And if Yes, can you please suggest how can I prepare the dataset for this task or can I use the dataset which you mentioned.

Thank you

Hi i have a small question about the finetune dataset for code summarization

I realize the CodeT5 have already saw the code-comment from CodeSerachNet as its input and output in the pretraining process as mentioned in the papaer "Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them."
And the model used code-comment from CodeSerachNet again to finetune the code summarization task, won't it be a problem( as the model has already saw the data )?
i'm a new hand to DL so please forgive me if this's a stupid question Orz

Can you release pre-training code of CodeT5?

Hi!
I would like to know how methods like MIP(Masked Identifier Prediction), IT(Identifier Tagging), BDG(Bimodal Dual Generation) etc are implemented. I'd appreciate it if you release pre-training code of CodeT5.

Multitask Pre-training

Dear authors of CodeT5,
Thanks for contributing such an amazing model to the community.
In the paper, it is said that CodeT5 was pre-trained using multiple tasks.
I'm wondering how these task were arranged, did you pre-train multiple tasks all-in-one and combine the loss or did you pre-train each task one by one.
Thank you very much for your help :)

Kind regards
Michael

Pre-training dataset

Hi,
thank you for this amazing model!
I was wondering if you can share with us the 8.3M methods dataset used for the pretraining.

Thank you very much!
Matteo

Multi-task training scripts?

Hi! In the paper you mention multi-task setup, but I can't find anything related to it in the code.

Have you released it? If not, what are your plans for it?

UPD: What is the command for code generation fine-tuning? run_exp.py only has these tasks 'summarize', 'concode', 'translate', 'refine', 'defect', 'clone'

'tuple' object has no attribute 'loss'

Hi, I want to run CodeT5-base on code generation task. I run the command:
python run_exp.py --model_tag codet5_base --task concode --sub_task none

There is an error: 'tuple' object has no attribute 'loss'.

I try to change
outputs = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)
to
outputs, _ = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)

There is an error: too many values to unpack (expected 2)

What should I do?

The dual-gen pre-trained model

Hi,
I'm fine-tuning the CodeT5 model on my dataset for the code generation task. In the paper the CodeT5-base model with the dual-gen achieved the best performance but there are no model with dual-gen released. So,
is the CodeT5-base model released in hugginface the model with the dual-gen?
If not, could you please release the pre-trained model with the dual-gen?

run_exp.py not working.

run_exp.py does not return any errors or results. It seems like it's not running any models at all.

Question for Translation task and Failed to reproduce Translation results

hi,

for c# java translation task.
I see the code_bleu is not reported in the paper. could you share the scores ? or can the translation result can be published ?
The code bleu score is important for this task.

Thanks

I downloaded the released model and run inference on java-C# translation task. I got the result as below which not matched in the raw paper:

cs to java translation

from transformers import RobertaTokenizer, T5ForConditionalGeneration
tokenizer = RobertaTokenizer.from_pretrained(
    'path/to/codet5/cs_java')
model = T5ForConditionalGeneration.from_pretrained(
    'path/to/codet5/cs_java')
model = model.to("cuda")

def predict(samples):
    results = []
    for sample in samples:
        input_ids = tokenizer(sample, return_tensors="pt").input_ids.to("cuda")
        generated_ids = model.generate(input_ids, max_length=510)
        rst = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
        results.append(rst)
    return results

cs = [line.strip() for line in open("./test.java-cs.txt.cs", "r")]
results = predict(cs)

scores

BLEU: 77.79
ngram match: 0.7778875426766637, weighted ngram match: 0.7859241463045725, syntax_match: 0.9075318329182916, dataflow_match: 0.9004485422377274
CodeBLEU score: 0.8429480160343139
EM: 0.649, = 649/1000

java to cs translation

BLEU: 81.57
ngram match: 0.8157761914953569, weighted ngram match: 0.827130874395443, syntax_match: 0.8968348170128586, dataflow_match: 0.9094303577631122
CodeBLEU score: 0.8622930601666927
EM: 0.618, = 618/1000

Missing 'Train.jsonl' in Python Summarization

Hi, thanks for your excellent work.

I am fine-tuning code summarization task and I found there misses 'train.jsonl' in summarization of python language.

I note that there is a little difference between your dataset and data in CodeXGLUE.

Could you please upload your 'train.jsonl'?

Loss in run_gen.py

Hi, I am trying to figure out how the loss is calculated, for example here. I assume it is some distance between generated_ids and target_ids with attention_masks, but could you point it out how the code and formula look like? Thank you very much!

Reproducing Translation results

Hi,
Thank you for releasing the model and this repository!

I am trying to reproduce the Java->C# translation results from the paper using CodeT5-base.
I ran it according to the instructions,
and in the 15th epoch I got dev results of:

[15] Best bleu+em: 150.38 (bleu: 82.18, em: 68.20)

The model early-stopped itself and evaluated on the test set, and these are the results on the test set:

[best-bleu] bleu-4: 83.83, em: 63.7000, codebleu: 0.0000

However, the results reported in the paper are bleu: 84.03 and EM: 65.90.

The BLEU results are sufficiently close to the reported results, but EM is 2.2% from the paper numbers.
Do you have an idea whether the reported settings are different from the settings in the paper, or is it just training randomness?

These are the settings from my logs:

03/10/2022 15:22:47 - INFO - __main__ -   Namespace(task='translate', sub_task='java-cs', lang='c_sharp', eval_task='', model_type='codet5', add_lang_ids=False, data_num=-1, start_epoch=0, num_train_epochs=100, patience=5, cache_path='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/cache_data', summary_dir='tensorboard', data_dir='/projects/tir4/users/urialon/CodeT5/data', res_dir='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/prediction', res_fn='results/translate_codet5_base.txt', add_task_prefix=False, save_last_checkpoints=True, always_save_model=True, do_eval_bleu=True, model_name_or_path='Salesforce/codet5-base', output_dir='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100', load_model_path=None, train_filename=None, dev_filename=None, test_filename=None, config_name='', tokenizer_name='Salesforce/codet5-base', max_source_length=320, max_target_length=256, do_train=True, do_eval=True, do_test=True, do_lower_case=False, no_cuda=False, train_batch_size=25, eval_batch_size=25, gradient_accumulation_steps=1, learning_rate=5e-05, beam_size=10, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, save_steps=-1, log_steps=-1, max_steps=-1, eval_steps=-1, train_steps=-1, warmup_steps=1000, local_rank=-1, seed=1234)
03/10/2022 15:22:47 - WARNING - configs -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, cpu count: 32
03/10/2022 15:22:52 - INFO - models -   Finish loading model [223M] from Salesforce/codet5-base
03/10/2022 15:23:14 - INFO - utils -   Read 10300 examples, avg src len: 13, avg trg len: 15, max src len: 136, max trg len: 118
03/10/2022 15:23:14 - INFO - utils -   [TOKENIZE] avg src len: 45, avg trg len: 56, max src len: 391, max trg len: 404
03/10/2022 15:23:14 - INFO - utils -   Load cache data from saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/cache_data/train_all.pt
/home/ualon/.conda/envs/3090/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
03/10/2022 15:23:14 - INFO - __main__ -   ***** Running training *****
03/10/2022 15:23:14 - INFO - __main__ -     Num examples = 10300
03/10/2022 15:23:14 - INFO - __main__ -     Batch size = 25
03/10/2022 15:23:14 - INFO - __main__ -     Batch num = 412
03/10/2022 15:23:14 - INFO - __main__ -     Num epoch = 100

Thanks!
Uri

Minor typo in README

In the introduction it says "fine-tine" where I believe the correct phrase is fine-tune.

Use the python code generation example in the paper

Hi,
I was trying to find the model to fulfill the example in the paper "Generate Python: increment value" -> "def inc_value(x):...".

May I know if there is a released model to do this (no need to fine-tune)? In the readme file the only text-to-code generation model seems to be the concode model, which generates Java code.

If there is not, may I know what is the recommended model to do the fine-tuning to fulfill this task?

Thanks!

Prediction on new data

Can you provide a small instance of CodeT5's prediction on concode data set for the concode task ? I am not sure how to make predictions on new data and what command line argument to use.

how can i output embedding for code

tokenizer suggestion

Hi, thanks for sharing your great work!

Following the link from huggingface transformers documentation, I think it would be better to save tokenizer by tokenizer.save rather than tokenizer.save_model.

That is,

CodeT5/tokenizer/train_tokenizer.py

Line 18 in 466b860

tokenizer.save_model("./salesforce", "codet5")

change this to tokenizer.save("tokenizer.json")

Then, you can use transformers transformers.PretrainedTokenizerFast rather than tokenizers.Tokenizer at

CodeT5/tokenizer/apply_tokenizer.py

Lines 3 to 6 in 208acbd

 tokenizer = ByteLevelBPETokenizer.from_file( 

 "./salesforce/codet5-vocab.json", 

 "./salesforce/codet5-merges.txt" 

 )

like this:

tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Maximum sequence length in the clone detection task

Hi,
Thanks for sharing this great work.

I have a question about the maximum sequence length of the inputs of the clone detection task. In the paper, it is mentioned that the maximum source and target sequence lengths are set to be 512 and 256, respectively. However, in the clone detection task script, the src_len and trg_len are both set to 400:

CodeT5/sh/run_exp.py

Lines 60 to 64 in ad787aa

 elif task == 'clone': 

 # Read 901028 examples, avg src len: 120, avg trg len: 123, max src len: 5270, max trg len: 5270 

 # [TOKENIZE] avg src len: 318, avg trg len: 323, max src len: 15111, max trg len: 15111 

 src_len = 400 

 trg_len = 400

Then these input strings are tokenized, padded and concatenated in convert_clone_examples_to_features making an 800 token input:

CodeT5/_utils.py

Lines 69 to 71 in ad787aa

 code1 = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True) 

 code2 = tokenizer.encode(target_str, max_length=args.max_source_length, padding='max_length', truncation=True) 

 source_ids = code1 + code2

Could you please explain how this works considering the mentioned maximum length limit?

How to use fine-tuned model with my own dataset and task?

Hi,

I fine-tuned a model using my own dataset, task and subtask. Say the task is called "own_task" and subtask is "c" since it is about c scripts. Now I have models saved in saved_models/own_task/c/codet5_small/checkpoint-best-ppl/ and checkpoint-last/ as pytorch_model.bin. Then I try to load this model using

from transformers import T5Config, RobertaTokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("dir_saved_model")

It fails with error about missing config.json file. Is this the correct way to load the model? Do I need to generate a config.json for the model manually, or it can be done automatically?

Thank you very much.

Inference for java code summarization

Is it possible to make code summarization for raw Java code?

I can't find the example of inference for code summarization. Could you please provide an example?
E.g., I expect the following code:

from transformers import RobertaTokenizer,  WHICH_MODELTO_USE

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = WHICH_MODELTO_USE.from_pretrained('Salesforce/codet5-base')

java_code = 'int i = 0; ++i;  int b = runSomeFunction(i); extract(b);'
code_summarization = model.predict(java_code)
print(code_summarization)

The expected result is the following:
'Extracts and returns max value'

Is it possible to make such the prediction? The problem is I can't understand how you are translating from code to the vector which will be used to predict the summarization without pretraining procedures.

Could you please provide an example?

Does CodeT5 continue pre-training based on T5?

Does CodeT5 continue pre-training based on origin T5? Or did you only use the T5 architecture without using the T5 weights and conduct training from scratch?

Thanks!

How to do Python to Java translation?

Hello,

Is it possible to translate a Python source code to Java source code using CodeT5? The current model seems to perform only Java to C# translation.

What is a license of CodeT5 model?

Hello? I have a question concerning CodeT5 licensing.
CodeT5 is released under the BSD-3 license.

I'm asking about the license for the following CodeT5 models.
https://github.com/salesforce/CodeT5#download

Doubts regarding tokenizer.

Can you please tell what is -
paths = ['train_code.txt', 'train_doc.txt']

in train_tiokenizer.py file.

Is that the training code data and doc_string data in txt format?

Also, how can I run the inference using trained model. Because the training script is only generating model.bin, it is not giving any config, tokenizer etc..

Thank you

Is it possible to download model and use it locally from hugginface?

Thanks for uploading the latest model for code summarization ( https://huggingface.co/Salesforce/codet5-base-multi-sum)
I need to download the model (with wget) and then set it as a cache.

When I tried to use tokenizer by tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum'), I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
    info = self.model_info(
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/hf_api.py", line 585, in model_info
    r = requests.get(path, headers=headers, timeout=timeout)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/Salesforce/codet5-base-multi-sum (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

I cannot use that way since I have a proxy and I can use only curl or wget. Each request inside python is blocked.

Could you please tell me how is it possible to download and use a cached model and tokenizer?

Publishing to HuggingFace Models?

Is the released pre-trained model including the dual generation pre-training

Dear authors,

I noticed in the paper you mentioned that you pre-train the T5 model with identifier-aware denoising for 100 epochs and further pre-train with bimodal generation for 50 epochs. I was wondering the released model only includes the first 100 epochs or the whole 150 epochs?

Thanks in advance for your clarification

how to train on multy gpus

can I train models on multy gpus? now I can train on only one gpu,cause the args is 'the index of gpu' which represent the index rather than the amount

Pretrained model for prediction

Can you kindly elaborate on how we can use the fine-tuned checkpoints for the prediction of new data in concode task?
Say this is my prediction data:
{"code": "public integer sum(Integer arg0,Integer arg1) {return result;}", "nl": "Add two integers. concode_field_sep int sum concode_field_sep int result"}
If I understand correctly then concode is supposed to complete these functions. However, I am not sure how to generate prediction on this sample data.
I tried replacing the test file containing original test data with this sample test data and then ran this command
python run_exp.py --model_tag codet5_small --task concode --sub_task none
This command starts with training, then evaluating and finally testing. However, I am interested in only prediction. Isn't there any way to directly generate predictions from fine-tuned model on concode ?
Kindly let me know if I am doing something wrong.

AI Coding Assistant

Can you kindly release the VS AI Code Assistant for code generation?

Can the CodeT5 model do code autocompletion without fine-tuning?

The readme mentioned that this is used for Code autocompletion in VSCode, I wonder how to use CodeT5 without fine-tuning as a language model to complete code given some previous context in code?

Multi-task models

Dear authors,

First, congrats on the amazing paper!

In the paper, you train models in downstream tasks with a multi-task learning approach. When I went into the available models, I could not find the ones fine-tuned with multi-task learning. I also could not find the scripts for it, but I saw in Issue #7 that you are already planning to release them.

In summary, my question is: There are any plans to release the multi-task models?

Thanks in advance for your clarification!

How to use a fine-tuned Checkpoint?

Hi,

This might seem like a basic question, but I want to be sure I'm on the right track here. I am trying to use the code generation fine-tuned checkpoint, but since the binary file isn't sufficient to load the model in huggingface, I downloaded the huggingface codet5-base directory and replaced the pythorch_model.bin with the binary file corresponding to the concode task. Is this the correct way to go about this?

About AI coding assistant demo

Hi, the newly added AI coding assistant demo is cool! I have a few questions about it:

Did you make the codeT5 model into VS Code plugin? How did you do that?
When demonstrating code generation, editing the comment generates the corresponding code snippet. Aren't the inputs to the code generation model natural Language Description and class Environment?

The format of the data in the Concode dataset is

{
    "code": "int function ( double [ ] arg0 , double [ ] arg1 ) { int loc0 = arg0 . length - arg1 . length ; outer : for ( int loc1 = 0 ; loc1 <= loc0 ; loc1 ++ ) { for ( int loc2 = 0 ; loc2 < arg1 . length ; loc2 ++ ) { if ( ne ( arg0 [ loc1 + loc2 ] , arg1 [ loc2 ] ) ) { continue outer ; } } return ( loc1 ) ; } return ( - 1 ) ; }",
    "nl": "searches for the first subsequence of a that matches sub elementwise . elements of sub are considered to match elements of a if they pass the #eq test . concode_field_sep double max_ratio concode_elem_sep double min_ratio concode_elem_sep boolean off concode_field_sep boolean isElemMatch concode_elem_sep int compare concode_elem_sep boolean isSubset concode_elem_sep boolean ne concode_elem_sep boolean lt concode_elem_sep boolean gte concode_elem_sep void set_rel_diff concode_elem_sep boolean eq concode_elem_sep boolean lte concode_elem_sep boolean gt"
}

Is it possible to generate accurate code snippet by typing only comments without class Environment? Doesn't the loss of context information affect the quality of the generated code?

Resource requirements for Fine-Tuning

Dear sir,

We're currently trying to finetune Code-T5 and would like to know the minimum as well as recommended hardware requirements for doing so.

Reading data for code generation.

For code generation task, should I use the data reading method used for concode.

def read_concode_examples(filename, data_num):
    """Read examples from filename."""
    examples = []

    with open(filename) as f:
        for idx, line in enumerate(f):
            x = json.loads(line)
            examples.append(
                Example(
                    idx=idx,
                    source=x["nl"].strip(),
                    target=x["code"].strip()
                )
            )
            idx += 1
            if idx == data_num:
                break
    return examples

OR
Data reading method used for code summarization (here replacing source with the docstring_tokens and target with code_tokens.

def read_summarize_examples(filename, data_num):
    """Read examples from filename."""
    examples = []
    with open(filename, encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = line.strip()
            js = json.loads(line)
            if 'idx' not in js:
                js['idx'] = idx
            code = ' '.join(js['code_tokens']).replace('\n', ' ')
            code = ' '.join(code.strip().split())
            nl = ' '.join(js['docstring_tokens']).replace('\n', '')
            nl = ' '.join(nl.strip().split())
            examples.append(
                Example(
                    idx=idx,
                    source=nl,
                    target=code,
                )
            )
            if idx + 1 == data_num:
                break
    return examples

Any suggestions?

Also, do I need to change the args.max_source_length = 256 and args.max_target_length = 128 for code generation task?

Which tokenizer to use for customized python summary data?

Hi,
I am fine-tuning codeT5 base model. I see in exp_with_args.sh that for python summarization task, RobertaTokenizer is used. However, in the data you shared in here does not look like to be generated by RobertaTokenizer. Since Robertatokenizer will tokenize space as Ġ, see e.g.. here, but in the data you uploaded, there is no such Ġ in code_token, nor in string_token.

Could you comment on this? Thank you very much!

Pre training script?

Hi, is there a pre-training script i could use for this model?

Finetune Task For Code Completion

Hi, Is there a fine-tune task for code completion? Maybe, I can implement Masked Span Prediction Task as fine-tune task for code generation (like gpt style)?

Unable to match results on code generation

I am unable to get a decent (close to test set performance reported in paper) performance on the validation set for code generation using your fine-tuned checkpoint. I am getting a bleu score of 29.49 and EM of 12.65. Here is my code. Am I doing something wrong here?

from datasets import load_dataset

class Example(object):
    def __init__(self, idx, source, target ):
        self.idx = idx
        self.source = source
        self.target = target

def read_examples(split):
    dataset = load_dataset('code_x_glue_tc_text_to_code')[split]
    examples = []
    for eg in dataset:
        examples.append(Example(idx = eg['id'], source=eg['nl'], target=eg['code']))
    return examples

examples = read_examples('validation')

from transformers import RobertaTokenizer, T5ForConditionalGeneration
import torch
from tqdm import tqdm
import os

os.environ["CUDA_VISIBLE_DEVICES"]="0"

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

model.load_state_dict(torch.load('finetuned_models_concode_codet5_base.bin'))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)
model.to(device)

preds = []
for eg in tqdm(examples):
    input_ids = tokenizer(eg.source, return_tensors="pt").input_ids.to(device)
#     print (len(input_ids[0]))
    generated_ids = model.generate(input_ids, max_length=200, num_beams=5)
    preds.append(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    
import sys
import numpy as np
from bleu import _bleu

accs = []
with open("test.output",'w') as f, open("test.gold",'w') as f1:
    for ref,gold in zip(preds,examples[:len(preds)]):
        f.write(ref+'\n')
        f1.write(gold.target+'\n')    
        accs.append(ref.strip().split()==gold.target.split())

print (np.mean(accs), _bleu('test.gold', 'test.output'))

Dataset for fine-tuning on python for Code generation task

Dear Sir,

For the text-to-code generation task, the model is fine-tuned on the Concode Java dataset. But, I want to fine-tune the model on Python dataset. While I was figuring out how to do this, I went across the following issue : https://github.com/salesforce/CodeT5/issues/36 where it is mentioned that we can fine-tune on the python subset of CodeSearchNet.

But, the python subset of CodeSearchNet contains various fields such as repo, path, url, original string, etc. whereas the Concode dataset contains only two fields for each function : code and nl. So, can you please guide me how can I create a similar dataset for python also so that I can fine-tune the text-to-code generation task on Python?

finetuned checkpoints

Hi,
Is there any plan to release the finetuned checkpoints?

Make exp_with_args.sh runnable by any user

The user wang.y is hard-coded into sh/exp_with_args.sh. It would be great if one could just clone the repository, install the dependencies, and run exp_with_args.sh.

	tokenizer = ByteLevelBPETokenizer.from_file(
	"./salesforce/codet5-vocab.json",
	"./salesforce/codet5-merges.txt"
	)

	elif task == 'clone':
	# Read 901028 examples, avg src len: 120, avg trg len: 123, max src len: 5270, max trg len: 5270
	# [TOKENIZE] avg src len: 318, avg trg len: 323, max src len: 15111, max trg len: 15111
	src_len = 400
	trg_len = 400

	code1 = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True)
	code2 = tokenizer.encode(target_str, max_length=args.max_source_length, padding='max_length', truncation=True)
	source_ids = code1 + code2