The chainofeverything from sert121

Does Adapting Chain-of-Thought Help ?

Installation:

$ pip install -r requirements.txt

Evaluation for Codellama and GPT3.5 models:

HumanEval

2.1.: Generate the test results by the model you'd like between codellama 7b and 34b Instruct variants:
```
$ python eval_codellama_humaneval.py --model_name codellama/CodeLlama-7b-Instruct-hf --length 100
```
OR if you plan to evaluate GPT3.5
```
$ python eval_gpt_humaneval.py 
```
2.2 To evaluate for a number greater/lesser than 100, you would need to change the length on line 57 in human-eval/human_eval/evaluation.py to match the length set.
```
$ evaluate_functional_correctness results/codellama/humaneval_CodeLlama-7b-Instruct-hf_100.jsonl
```
for GPT3.5
```
$ evaluate_functional_correctness results/openai/gpt3.5-turbo_100.jsonl
```
2.3 Change prompt: Different prompts are provided in the eval_codellama_human_eval.py (more details below).

MBPP

3.1 Generate:
```
$ python eval_codellama_mbpp.py --model_name codellama/CodeLlama-7b-Instruct-hf --length 100
```
OR if you plan to evaluate GPT3.5
```
$ python eval_gpt_mbpp.py 
```
2.2 Evaluate: # you may need to change the output directory of the model as per your choice of model in eval_mbpp.py line 254
```
$ python eval_mbpp.py
```
File info:
1. eval_gpt_mbpp.py: We include several functions such as wrap_code_template , wrap_code_template_baseline, wrap_with_steps, one_shot_pseudocode,one_shot_steps,zero_shot_pseudocode to construct prompts according to different settings (one shot, with and without steps and psuedocode). We also include some other variations we had tried with GPT.
2. eval_gpt_humaneval.py: We include similar functions like above, for testing GPT on human eval as well.
3. For eval_codellama_humaneval.py we include similar functions: Functions like construct_codellama_prompt, construct_codellama_prompt_v2, construct_codellama_prompt_oneshot_examples, construct_codellama_comment_prompt_one_shot_psuedocode, construct_codellama_comment_prompt_one_shot, help us create prompts for performing baseline, zero shot steps/pseudocode, as well as 1-shot steps or pseudocode evalutions.
4. For eval_codellama_mbpp.py also contains such prompt functions like: construct_codellama_prompt, construct_codellama_pseudo_prompt, construct_codellama_pseudo_prompt_example, construct_codellama_prompt_steps.
5. We include the step by step examples or solving process that we include for one-shot steps/solving process prompt for humaneval at humaneval_steps_magicoder.json. This is generated by magicoder1.
6. We include the step by step examples or solving process that we include for one-shot steps/solving process prompt for mbpp at
  mbpp_examples_magicoder_reform_v1.json. This is generated by magicoder1.
7. humaneval_actualpsuedocode_magicoder_reform_v1.json is the collected set of psuedocode generated by magicoder for the human eval dataset. Each pseudocode was generated using the prompt specified in the paper.
8. mbpp_actualpsuedocode_magicoder_reform_v1.json is the collected set of psuedocode generated by magicoder for the mbpp dataset. Each pseudocode was generated using the prompt specified in the paper.
Misc
- To run certain scripts you need to include an OPENAI and HUGGINGFACE token, to call the required APIs. OPENAI token can be exported by using export OPENAI_API_KEY = <OPENAIKEY>. The TOKEN can be set in the respective files using the variable defined at the beginning of the files using it (codellama).
- The above evaluation was performed on a 80GB A100 GPU, with 128GB RAM, alongwith 24 CPUs. We acknowledge Mila for supporting us with the compute resources.

sert121 / chainofeverything Goto Github PK

chainofeverything's Introduction

HumanEval

MBPP

chainofeverything's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent