Giter Club home page Giter Club logo

human-eval's Introduction

HumanEval: Hand-Written Evaluation Set

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".

Installation

Make sure to use python 3.7 or later:

$ conda create -n codex python=3.7
$ conda activate codex

Check out and install this repository:

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval

Usage

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:

{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}

We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.

Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

To evaluate the samples, run

$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}

This script provides more fine-grained information in a new file ending in <input_path>_results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".

As a quick sanity-check, the example samples should yield 0.5 pass@1.

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00,  1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}

Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=<comma-separated-values-here>. For other options, see

$ evaluate_functional_correctness --help

However, we recommend that you use the default values for the rest.

Known Issues

While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.

malloc: can't allocate region

Citation

Please cite using the following bibtex entry:

@article{chen2021codex,
  title={Evaluating Large Language Models Trained on Code},
  author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
  year={2021},
  eprint={2107.03374},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

human-eval's People

Contributors

heewooj avatar linyxus avatar qimingyuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

human-eval's Issues

Finetuning With HumanEval

Hi, thanks for the great work. Most of the models are evaluated in a zero-shot manner. I really wanna know whether the dataset can be used for fine-tuning.

evaluate_functional_correctness can't run

I created a conda environment with python3.7 using the exact same command in the doc. Then, I used openai's text-davinci-002 to generate a samples.jsonl file with 3 results for each problem.

Calling evaluate_functional_correctness samples.jsonl, I got the error message as below. I also tried to evaluate the example results with evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl, and got the same error.

I wonder how to fix it?

Error message:
Reading samples...
6it [00:00, 7427.93it/s]
Running test suites...
0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/miniconda3/envs/codex/bin/evaluate_functional_correctness", line 33, in
sys.exit(load_entry_point('human-eval', 'console_scripts', 'evaluate_functional_correctness')())
File "/opt/miniconda3/envs/codex/bin/evaluate_functional_correctness", line 25, in importlib_load_entry_point
return next(matches).load()
File "/opt/miniconda3/envs/codex/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/opt/miniconda3/envs/codex/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/boyuanchen/Desktop/human-eval/human_eval/evaluate_functional_correctness.py", line 30, in
sys.exit(main())
File "/Users/boyuanchen/Desktop/human-eval/human_eval/evaluate_functional_correctness.py", line 27, in main
fire.Fire(entry_point)
File "/opt/miniconda3/envs/codex/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/miniconda3/envs/codex/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/miniconda3/envs/codex/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/boyuanchen/Desktop/human-eval/human_eval/evaluate_functional_correctness.py", line 22, in entry_point
results = evaluate_functional_correctness(sample_file, k, n_workers, timeout, problem_file)
File "/Users/boyuanchen/Desktop/human-eval/human_eval/evaluation.py", line 75, in evaluate_functional_correctness
result = future.result()
File "/opt/miniconda3/envs/codex/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/opt/miniconda3/envs/codex/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/opt/miniconda3/envs/codex/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/boyuanchen/Desktop/human-eval/human_eval/execution.py", line 77, in check_correctness
p.start()
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/miniconda3/envs/codex/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'check_correctness..unsafe_execute
`

Why not allow contribution?

I am curious as to why this data set is not open for contribution to keep it evolving. Yes, "164 hand-written programming problems" is a good start, but more is certainly better, especially that all the problems seems to be focusing on algorithms. By opening this for contribution, you crowd source this problem. Obviously, contributions have to meet a certain standard to avoid degrading the quality of the data set, but this is not hard to achieve.

Another problem that might happen from allowing contribution is it might make hard to reference in papers, but surely a simple versioning system can solve this problem. Authors can then say something like: We achieved X% accuracy on HumanEval version Y.

Codex Training Data

Is the training data used to train Codex publicly available? If it is, where is it available for download?

pass@k on filtered samples

Hi,

Thank you for the great work!

I have 2 questions about the computation of the pass@k metric after applying filtering on the APPS benchmark.

  1. Will the total array in the below code snippet contain numbers of filtered samples that passed the example test cases (from problem statement), i.e. each number <= N_original_samples(=1000)?

    total = np.array(total)

  2. In the cases when a number of filtered samples is less than k (=[1,5]), how do you compute the pass@k metric for these cases? For example, when N_filtered_samples = 1 and k=5, can we assume execution results of 4 failures and 1 passed/failure (depending on the final unit test results of this filtered sample)?

Prompt used in APPS

Thank you for the very interesting work!

I have one question about the natural language prompt used in APPS.

Did you directly use the original prompts used in the APPS benchmark? (as coded here in the generate_prompt function).

You mentioned in the paper that 'we append a single input/output example from the task description to the docstring as a formatting hint.' How did you do this precisely? Did you need to construct a new prompt, including a function signature and a docstring with input/output example?

Task 145 makes no sense

The prompt, canonical solution and tests for task 145 are:

def order_by_points(nums):
    """
    Write a function which sorts the given list of integers
    in ascending order according to the sum of their digits.
    Note: if there are several items with similar sum of their digits,
    order them based on their index in original list.

    For example:
    >>> order_by_points([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]
    >>> order_by_points([]) == []
    """
    def digits_sum(n):
        neg = 1
        if n < 0: n, neg = -1 * n, -1 
        n = [int(i) for i in str(n)]
        n[0] = n[0] * neg
        return sum(n)
    
    return sorted(nums, key=digits_sum)


def check(candidate):
    # Check some simple cases
    assert candidate([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]
    assert candidate([1234,423,463,145,2,423,423,53,6,37,3457,3,56,0,46]) == [0, 2, 3, 6, 53, 423, 423, 423, 1234, 145, 37, 46, 56, 463, 3457]
    assert candidate([]) == []
    assert candidate([1, -11, -32, 43, 54, -98, 2, -3]) == [-3, -32, -98, -11, 1, 2, 43, 54]
    assert candidate([1,2,3,4,5,6,7,8,9,10,11]) == [1, 10, 2, 11, 3, 4, 5, 6, 7, 8, 9]
    assert candidate([0,6,6,-76,-21,23,4]) == [-76, -21, 0, 4, 23, 6, 6]

    # Check some edge cases that are easy to work out by hand.
    assert True, "This prints if this assert fails 2 (also good for debugging!)"

This makes no sense for negative inputs. For example, look at the first test:

assert candidate([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]

One reasonable interpretation for "digit sum" for a negative integer would be just the digit sum of the absolute value. Another would be the negative of the digit sum of the absolute value. But neither of those rules seem to be used here. Instead, if we look at the canonical solution, the interpretation seems to be that we should think of the minus sign as applying only to the first digit in the number, so a number like -186 breaks down as (-1, 8, 6) and has a digit sum of 13.

Theoretically it would be possible to infer this rule from the example given in the prompt, but it's still extremely difficult to guess the intended meaning here, even for a human. This may be a deliberate design choice, but I think it's worth flagging here so people can at least be aware of it.

Why do I use the phi model to output the same result for all samples at a temperature of 0.8?

def generate_one_completion(prompt: str):
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("//phi-1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("/
/phi-1", trust_remote_code=True)
# inputs = tokenizer("'''"+prompt+"'''", return_tensors="pt", return_attention_mask=False)
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
# outputs = model.generate(**inputs, max_length=200,max_new_tokens=430)
outputs = model.generate(**inputs, max_length=200,temperature=0.8,do_sample=True)
completion = tokenizer.batch_decode(outputs)[0]
return completion

This is my model output code. Regardless of the value of num_samples_per_task set, it returns the same answer for each question

Error in tests for HumanEval/163

In the prompt, it is stated that "Knowing that (a) is less than 100." Why are there test cases like assert candidate(11 * 13 * 7) == True in the testing?

Evaluation doesn't work on Windows

After getting a score of 0 every time, I looked at the samples.jsonl_results.jsonl file and the result for each is this: "failed: module 'signal' has no attribute 'setitimer'"

This seems like a Windows/Unix issue.

Error in canonical solution and tests for HumanEval/163

The following solution and tests don't seem to match the description. Can you explain why we are bounding lower and upper to be between 2 and 8?

`def generate_integers(a, b):
"""
Given two positive integers a and b, return the even digits between a
and b, in ascending order.

For example:
generate_integers(2, 8) => [2, 4, 6, 8]
generate_integers(8, 2) => [2, 4, 6, 8]
generate_integers(10, 14) => []
"""

lower = max(2, min(a, b))
upper = min(8, max(a, b))

return [i for i in range(lower, upper+1) if i % 2 == 0]

def check(candidate):

# Check some simple cases
assert candidate(2, 10) == [2, 4, 6, 8], "Test 1"
assert candidate(10, 2) == [2, 4, 6, 8], "Test 2"
assert candidate(132, 2) == [2, 4, 6, 8], "Test 3"
assert candidate(17,89) == [], "Test 4"`

Entry point error while Installing the package.

Hi,

I am having trouble installing the package.

Installation Command: pip install git+http://github.com/openai/human-eval#egg=human-eval

Error: For req: human-eval. Invalid script entry point: <ExportEntry evaluate_functional_correctness = human_eval.evaluate_functional_correctness:None []>

Possible Solution: Manually edit the setup.py file and change the entry point to the main function instead of None.

Is it possible to include this change in the new version?

why use ThreadPoolExecutor with GIL in background?

In evaluation the code uses ThreadPoolExecutor at first and in each thread use multiprocessing package. Why not use ProcessPoolExecutor at first? Is there any consideration of optimizing performance?

Evaluation.py failing on KeyError: 'test/0'

I tried running:

evaluate_functional_correctness ./data/example_samples.jsonl

Getting the following error:

File "/Users/brianriviere/projects/human-eval/human_eval/evaluation.py", line 65, in evaluate_functional_correctness
   args = (problems[task_id], completion, timeout, completion_id[task_id])
KeyError: 'test/0'

Is there something I'm not doing correctly?

Question about the generate_one_completion

Hey, I wanna ask one question about this 'generate_one_completion' in your usage part in the README file. What is that? Is that any Python function that has complete logic?

{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.

Here is a nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.

Will this be helpful for people reading the paper ?

I was reading your paper. And was interested in seeing what the test data looks like.
But you have given it in .jsonl format which is okay if you are doing a careful analysis of the test data.

But for quick readers parsing the test data requires willingness and they might not want to do it.

So I parsed the data.jsonl into Human Readable Prompt and Canoncial Sections : https://gist.github.com/jalotra/f4d119e8108fcd44b932103d73140381.

Will this add value to this repo if we can add this for a quick read ?

execution.py bug request

bug request for execution.py

This part (

)

            try:
                exec_globals = {}
                with swallow_io():
                    with time_limit(timeout):
# WARNING
# This program exists to execute untrusted model-generated code. Although
# it is highly unlikely that model-generated code will do something overtly
# malicious in response to this test suite, model-generated code may act
# destructively due to a lack of model capability or alignment.
# Users are strongly encouraged to sandbox this evaluation suite so that it 
# does not perform destructive actions on their host or network. For more 
# information on how OpenAI sandboxes its code, see the accompanying paper.
# Once you have read this disclaimer and taken appropriate precautions, 
# uncomment the following line and proceed at your own risk:
#                         exec(check_program, exec_globals)
                result.append("passed")
            except TimeoutException:
                result.append("timed out")
            except BaseException as e:
                result.append(f"failed: {e}")

            # Needed for cleaning up.
            shutil.rmtree = rmtree
            os.rmdir = rmdir
            os.chdir = chdir

should be fixed since it causes an error due to the indentation issue.

Problems with installation instructions

After setting up my environment with conda, I ran into problems with

$ git checkout https://github.com/openai/human-eval
$ pip install -e human-eval

First, git checkout should be a git clone.

Then, when I run the pip command, I get

ERROR: human-eval is not a valid editable requirement. It should either be a path to a local project or a VCS URL (beginning with bzr+http, bzr+https, bzr+ssh, bzr+sftp, bzr+ftp, bzr+lp, bzr+file, git+http, git+https, git+ssh, git+git, git+file, hg+file, hg+http, hg+https, hg+ssh, hg+static-http, svn+ssh, svn+http, svn+https, svn+svn, svn+file).

Why pass@k =1.0? use the "evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl"

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 7047.28it/s]
Running test suites...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 98.99it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 21826.39it/s]
{'pass@1': 1.0}

Error in the prompt of HumanEval/47

def median(l: list):
    """Return median of elements in the list l.
    >>> median([3, 1, 2, 4, 5])
    3
    >>> median([-10, 4, 6, 1000, 10, 20])
    15.0
    """

Maybe the output of median([-10, 4, 6, 1000, 10, 20]) should be 8 not 15.0 Can you please check it?

Here's the test with statistics library

>>> import statistics
>>> statistics.median([-10, 4, 6, 1000, 10, 20])
8.0

AttributeError: Can't pickle local object 'check_correctness.<locals>.unsafe_execute'

When I run "evaluate_functional_correctness sample.jsonl --problem_file=problem.jsonl", it has the following problem.

Can u help me? thx
Detail log.


Reading samples...
1it [00:00, 2118.34it/s]
Running test suites...
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/user/opt/anaconda3/envs/py38/bin/evaluate_functional_correctness", line 33, in
sys.exit(load_entry_point('human-eval', 'console_scripts', 'evaluate_functional_correctness')())
File "/Users/user/opt/anaconda3/envs/py38/bin/evaluate_functional_correctness", line 25, in importlib_load_entry_point
return next(matches).load()
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/user/Desktop/Program/LLM/human-eval/human_eval/evaluate_functional_correctness.py", line 28, in
sys.exit(main())
File "/Users/user/Desktop/Program/LLM/human-eval/human_eval/evaluate_functional_correctness.py", line 25, in main
fire.Fire(entry_point)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/user/Desktop/Program/LLM/human-eval/human_eval/evaluate_functional_correctness.py", line 20, in entry_point
results = evaluate_functional_correctness(sample_file, k, n_workers, timeout, problem_file)
File "/Users/user/Desktop/Program/LLM/human-eval/human_eval/evaluation.py", line 75, in evaluate_functional_correctness
result = future.result()
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/user/Desktop/Program/LLM/human-eval/human_eval/execution.py", line 73, in check_correctness
p.start()
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'check_correctness..unsafe_execute'

Re-produce raw GPT-Neo with 125M and 1.3B on this human-eval dataset

Hi, we reproduced the performance of the raw GPT-Neo (125M and 1.3B) on the HumanEval dataset, and found that it was much lower than that reported in the Codex paper. Do you have any plans to publish the raw GPT-Neo on HumanEval? In addition, are there any tricks in the process of reproducing this? Thanks!

Our re-produce results:
image

The official reported results:
image

Looking forward to your reply!

Evaluations timing out

Hello,

I've been messing around with the benchmark for the last couple of days, everything as been running perfectly. All of a sudden the evaluations started to time out and I can't figure out why, been trying to find the issue the last few days. Anyone has come across something equal?

Better explanation:
After running the benchmark in generation+evaluation mode the generation works perfectly, all the samples look correct but every time the threads start to evaluate they always time out giving terrible results for all models. I'm running the evaluations with:

WizardLM/WizardCoder-33B-V1.1
BitsandBytes 4bit quantization
Temperature 0.2
Top_k 0
Top_p 0.95
Nr_samples 1
Batch size 1
Max_len_generation 1024

Error in canonical solution 95 check_dict_case

def check_dict_case(dict: Dict[str, str]) -> bool:
    """
    Given a dictionary, return True if all keys are strings in lower 
    case or all keys are strings in upper case, else return False.
    The function should return False is the given dictionary is empty.
    Examples:
    # >>> check_dict_case({"a":"apple", "b":"banana"})
    # True
    # >>> check_dict_case({"a":"apple", "A":"banana", "B":"banana"})
    # False
    # >>> check_dict_case({"a":"apple", 8:"banana", "a":"apple"})
    # False
    # >>> check_dict_case({"Name":"John", "Age":"36", "City":"Houston"})
    # False
    # >>> check_dict_case({"STATE":"NC", "ZIP":"12345" })
    True
    """
    if len(dict.keys()) == 0:
        return False
    else:
        state = "start"
        for key in dict.keys():

            if isinstance(key, str) == False:
                state = "mixed"
                break
            if state == "start":
                if key.isupper():
                    state = "upper"
                elif key.islower():
                    state = "lower"
                else:
                    break
            elif (state == "upper" and not key.isupper()) or (state == "lower" and not key.islower()):
                    state = "mixed"
                    break
            else:
                break
        return state == "upper" or state == "lower"

In the solution above, the last break statement should instead be continue

In the current test

assert candidate({"p":"pineapple", "A":"banana", "B":"banana"}) == False

current solution would work because the case switch occurs in the second element, and that python happens to preserve the order in which the key, value pairs are iterated.

If the test were changed to

assert candidate({"A":"banana", "B":"banana", "p":"pineapple"}) == False

This solution would have failed the test

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.