nuprl / multipl-e Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 35.0 20.89 MB

A multi-programming language benchmark for LLMs

Home Page: https://nuprl.github.io/MultiPL-E/

License: Other

Python 97.51% Dockerfile 0.06% Makefile 0.01% Shell 0.18% Lua 2.01% C++ 0.01% Jupyter Notebook 0.22%

multipl-e's People

Contributors

Stargazers

Watchers

multipl-e's Issues

Evaluation with a container stops halfway without error message

Hi, thanks for providing the nice multi-lingual evaluation framework.

The generation went smoothly. When it comes to evaluation, I used my mac to run evaluation w/ a container (following this tutorial). The execution stops at here 53%|█████▎ | 504/955 [00:09<00:08, 53.40it/s]. Is there a good way to tell what went wrong? FYI, the following is the anonymized debug log of podman:

DEBU[0000] DoRequest Method: GET URI: http://d/v4.8.1/libpod/_ping 
DEBU[0000] DoRequest Method: GET URI: http://d/v4.8.1/libpod/networks/pasta/exists 
DEBU[0000] Loading registries configuration "/etc/containers/registries.conf" 
DEBU[0000] DoRequest Method: POST URI: http://d/v4.8.1/libpod/images/pull 
DEBU[0000] User or group ID mappings not available: open /proc/self/uid_map: no such file or directory 
DEBU[0000] User or group ID mappings not available: open /proc/self/uid_map: no such file or directory 
DEBU[0000] User mount xxx:/xxx options [rw] 
DEBU[0000] DoRequest Method: GET URI: http://d/v4.8.1/libpod/images/multipl-e-eval/json 
DEBU[0000] DoRequest Method: POST URI: http://d/v4.8.1/libpod/containers/create 
DEBU[0000] Enabling signal proxying                     
DEBU[0000] Enabling signal proxying                     
DEBU[0000] DoRequest Method: GET URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d/json 
DEBU[0000] DoRequest Method: POST URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d/attach 
DEBU[0000] Copying standard streams of container "41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d" in non-terminal mode 
DEBU[0000] DoRequest Method: POST URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d/start 
 53%|█████▎    | 504/955 [00:09<00:08, 55.18it/s]DEBU[0017] DoRequest Method: POST URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d/wait 
DEBU[0019] DoRequest Method: POST URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d/shouldrestart 
DEBU[0019] DoRequest Method: DELETE URI: http://d/v4.8.1/libpod/containers/41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d 
DEBU[0019] Container 41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d does not exist: no container with ID or name "41237923cc99a526b678300f112a4f2cb5d2feed497078ed0a8a6086cbea635d" found: no such container

As my ubuntu server doesn't allow me to easily run docker/podman, to walk-around,
--> I tried to run "evaluation w/o a container", this did finish but some scores of some PL are 0 because I didn't install certain library.
--> Could you provide a installation script for properly preparing the envs, ensuring a consistent evaluation w & w/o a container?

Let me know if you need more info. Thanks in advance!

All non-multiline commented prompts currently broken

The issue in #111 is not isolated to R, but to most if not all languages that have non-multiline commented prompts. The problem lies in the regex in the translator. Example from CPP:

#include<assert.h>
#include<bits/stdc++.h>
// Given a string s, count the number of uppercase vowels in even indices.
For example:
count_upper('aBCdEf') returns 1
count_upper('abcdefg') returns 0
count_upper('dBBE') returns 0
long count_upper(std::string s) {

Quantized model is not supported - Calling cuda() is not supported for 4-bit or 8-bit quantized models

config.json: 100% 1.10k/1.10k [00:00<00:00, 4.08MB/s]
low_cpu_mem_usage was None, now set to True since model is quantized.
model.safetensors: 100% 4.13G/4.13G [04:15<00:00, 16.1MB/s]
generation_config.json: 100% 111/111 [00:00<00:00, 590kB/s]
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
File "/content/MultiPL-E/automodel.py", line 179, in
main()
File "/content/MultiPL-E/automodel.py", line 168, in main
model = Model(
File "/content/MultiPL-E/automodel.py", line 16, in init
).cuda()
File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 454, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2543, in cuda
raise ValueError(
ValueError: Calling cuda() is not supported for 4-bit or 8-bit quantized models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

Workaround:

Remove .cuda from line 16 of automodel.py from MultiPL-E folder.
Add "load_in_4bit": True within model_kwargs on line 164 of automodel.py

Fix some easy paths to wrong evaluation results

As a newcomer to BigCode I wanted to use MultiPL-E to confirm pass@k perf for bigcode models before starting experiments. I loosely followed the tutorial/advice from slack, and ran into a few different surprises in the values of pass@k reported.

What I ran (with relevant outputs only):

# setup
$ git clone https://github.com/nuprl/MultiPL-E
$ cd MultiPL-E/
$ mkdir tutorial
$ python3 -m inference --model-name inference.bigcode_dedupaltcomments --root-dataset humaneval --lang py     --temperature 0.2 --batch-size 20 --completion-limit 20 --output-dir-prefix tutorial
$ podman run --rm --network none -v ./tutorial:/tutorial:rw multipl-e-eval --dir /tutorial --output-dir /tutorial --recursive
# surprises start here:
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded/
Dataset,Pass@k,Estimate
,10,0.25
,100,1.00
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded
Dataset,Pass@k,Estimate
humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded,1,0.18

Notice that the last two commands differ only in the "/" at the end.

Perl Unit test comparing float values

Example test: HumanEval_71_triangle_area

sub testhumaneval {
    my $candidate = \&triangle_area;
        if(eq_deeply($candidate->(3, 4, 5),6.0)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(1, 2, 10),-1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(4, 8, 5),8.18)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2, 2, 2),1.73)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(1, 2, 3),-1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(10, 5, 7),16.25)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2, 6, 3),-1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(1, 1, 1),0.43)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2, 2, 10),-1)) {
        print "ok!" }else{
        exit 1;
        }
}

testhumaneval();

An output value of 6.00 would not pass the first test eq_deeply(6.00, 6.0). Seems like the same unit test library has this is_deeply_float function that maybe useful here? link

Could I get all statistics?

I want to get all statistics come from this wonderful research, or at least about humaneval dataset.

Could I get the material or link to follow? It seems that the paper does not report exact values about the Figure.

Bad filenames when using gz

I get these filenames:

mbpp_123_amicable_numbers_sum.json.gz

mbpp_123_amicable_numbers_sum.json.results.json.gz

I think the latter is not intended.

Turn translator into a library

For ongoing research, we need to turn the translators (dataset_builder) into a library. It will cause conflicts, so let's merge #79 before we do that.

The "automodel" file does not have this parameter.

Task HumanEval/092 has contradictory tests in Rust

The Rust version of HumanEval/092 contains the following lines:

assert_eq!(candidate(3.0, 4.0, 7.0), true);
assert_eq!(candidate(3.0, 4.0, 7.0), false);

(I think this is row 67 of the huggingface dataset for multipl-E, but I haven't checked)

This obviously makes the tests unsatisfiable. It seems like this was a type-casting issue when translating from Python, the original tests read:

assert candidate(3,4,7)==True, "This prints if this assert fails 9 (also good for debugging!)"
assert candidate(3.0,4,7)==False, "This prints if this assert fails 10 (also good for debugging!)"

Small issues with Swift prompt signatures

Issue 1

The indentation for extension is funky here

    def add_protocol_conformance(self, t: str, p: str, body: str):
        conf_str: str = f"""
extension {t}: {p} {body}
        """
        self.protocol_conformances.add(conf_str)

Instead, this should fix it:

    def add_protocol_conformance(self, t: str, p: str, body: str):
        conf_str: str = f"""
extension {t}: {p} {body}
"""
        self.protocol_conformances.add(conf_str)

the resulting extension and function will be nicely aligned afterwards at the base indentation level.

issue2

Not sure if this was intentional but there are certainly other default libraries imported for other languages, so maybe this is needed in the prompt:

import Swift
import Foundation

There are quite a few functions that uses pow, ceil, round, floor, or sqrt and they can all be found in the libraries above.

load_dataset() doesn't work without specifying dataset_revision?

Hello!

Loading any section of the dataset like this:

from datasets import load_dataset
d = load_dataset('nuprl/MultiPL-E', 'humaneval-lua', download_mode='force_redownload')

results in ExpectedMoreDownloadedFiles error, but using revision number from your completions.py works:

d = load_dataset('nuprl/MultiPL-E', 'humaneval-lua', download_mode='force_redownload', revision = "bf4f3c31a1e0a164b7886c9eb04f82534edf4ce9")

is this intended?

thanks a lot in advance!

Error evaluating TS/Java

Hi,

I'm using MultiPL-E evaluating models on different languages (latest version and it passed make test). But I found the scores on Java and TypeScript are not quite aligned with the trend on others. I checked some outputs and noticed that a lot of errors are SyntaxError, and the code is logically correct.
(1) Is it possible that something is wrong with the env? For example, import org.javatuples.*; is very suspicious, since the code often runs well after removing this import.
(2) Besides status, can we also log the complete error outputs from the compiler/executor, so we can have better ideas about what errors happened.

Thank you!
Rui

Example of Java (HumanEval_28_concatenate):

    {
      "program": "import java.util.*;\nimport java.lang.reflect.*;\nimport org.javatuples.*;\nimport java.security.*;\nimport java.math.*;\nimport java.io.*;\nimport java.util.stream.*;\nclass Problem {\n    // Concatenate array list of strings into a single string\n    // >>> concatenate((new ArrayList<String>(Arrays.asList())))\n    // (\"\")\n    // >>> concatenate((new ArrayList<String>(Arrays.asList((String)\"a\", (String)\"b\", (String)\"c\"))))\n    // (\"abc\")\n    public static String concatenate(ArrayList<String> strings) {\n        StringBuilder sb = new StringBuilder();\n        for (String s : strings) {\n            sb.append(s);\n        }\n        return sb.toString();\n    }\n    public static void main(String[] args) {\n    assert(concatenate((new ArrayList<String>(Arrays.asList()))).equals((\"\")));\n    assert(concatenate((new ArrayList<String>(Arrays.asList((String)\"x\", (String)\"y\", (String)\"z\")))).equals((\"xyz\")));\n    assert(concatenate((new ArrayList<String>(Arrays.asList((String)\"x\", (String)\"y\", (String)\"z\", (String)\"w\", (String)\"k\")))).equals((\"xyzwk\")));\n    }\n\n}\n",
      "timestamp": 1690088869,
      "stdout": "",
      "stderr": "",
      "exit_code": -1,
      "status": "SyntaxError"
    },

Example of TS (HumanEval_0_has_close_elements):

    {
      "program": "//Check if in given array of numbers, are any two numbers closer to each other than\n// given threshold.\n// >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n// false\n// >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n// true\nfunction has_close_elements(numbers: number[], threshold: number): boolean {\n  for (let i = 0; i < numbers.length - 1; i++) {\n    for (let j = i + 1; j < numbers.length; j++) {\n      if (Math.abs(numbers[i] - numbers[j]) < threshold) {\n        return true;\n      }\n    }\n  }\n  return false;\n}\n\ndeclare var require: any;\nconst assert = require('node:assert');\n\n\nfunction test() {\n  let candidate = has_close_elements;\n  assert.deepEqual(candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3),true);\n  assert.deepEqual(candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05),false);\n  assert.deepEqual(candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95),true);\n  assert.deepEqual(candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8),false);\n  assert.deepEqual(candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1),true);\n  assert.deepEqual(candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0),true);\n  assert.deepEqual(candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5),false);\n}\n\ntest();",
      "timestamp": 1690088212,
      "stdout": "",
      "stderr": "",
      "exit_code": -1,
      "status": "SyntaxError"
    },

Add HumanEval+ tests

In "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation", Liu et al. introduced additional auto-generated tests to HumanEval, which reduced pass rates significantly for some modes (e.g., 32.2->27.2 for CodeGen 16B or 88.4->76.2 for GPT-4). I think it will be useful to add these tests to MultiPL-E.
Their code is available at https://github.com/evalplus/evalplus

Stop tokens for Java do not allow completions that produce several top-level methods.

example Java signature:

import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
    public static boolean hasCloseElements(ArrayList<Float> numbers, float threshold) {

The current stop sequence for Java is

"\n    }\n"

semantically it means to stop after having finished generating the current function. However, for the case of HumanEval at least, there are multiple instances where the canonical python solution contain nested functions. This means that for Java (where nested functions are illegal), the model may need to generate helper functions in addition to the main function.

I found the following stop tokens to work better, stops when it needs to, while allowing the model to generate multiple functions

"public static void main"
"\n}"

This way, model stops either when it begins to generate the main function, or the end of the Problem class.

This however, may break the philosophy of the testing, if we were to only expect one function to be generated? In which case, the problems with nested functions may just be much more difficult for models to generate (In Java).

Environment for evaluating C#

Hi there,

I'm trying to reproduce the results on C# (c-sharp, c#, cs). I'm running it on an Ubuntu virtual machine.

I tried the recommended podman way, but I don't think it ran correctly (finished instantly, no score output, no other informative output).
Then I choose to run it without a container, and it says

  File "/export/home/project/codeai/MultiPL-E/evaluation/src/eval_cs.py", line 24, in eval_script
    build = subprocess.run(["csc", "/d:DEBUG", "-r:System.Numerics.dll", path, f"/out:{binaryname}"], capture_output=True)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'csc'

AFAIK, csc won't work on linux machines. Can you share more details about how to configure C# running environment? Also, Toolchains page says Conda is required. Does conda provides anything for executing C#?

Thank you!
Rui

Better isolation for evaluation

At the moment, we rely on mounting a directory of json files in the container for evaluation. I don't think this is safe in general, and certain languages (e.g., Bash) are very likely to produce code that may accidentally delete all results.

I just did a run on Discovery and got dozens of garbage files in my working directory.

I think we should consider moving to a model where src/main.py runs outside a container, but executes a single program in a container.

Doesn't check for cuda support before attempting to execute on GPU.

Most models default to running on cuda without a proper check to see if a cuda enabled GPU is on system. This presents issues when running tests for actual models on local machines that lack GPU support. I believe we should update models to check for cuda support before attempting to execute on it. We should probably use torch to check if cuda is available and use the .to(DEVICE) method instead.

Java program evaluation error with javatuples.Pair class

Hi,

I am evaluating on Java completions, and seems like whenever the completion uses javatuples.Pair program, the evaluation throws the following error

Error: Unable to initialize main class Problem
Caused by: java.lang.NoClassDefFoundError: org/javatuples/Pair

I tried two ways of evaluating, using local environment, downloaded javatuples-1.2.jar file and set CLASSPATH in env var in evaluation/src/eval_java.py, as well as trying the existing podman container ghcr.io/nuprl/multipl-e-evaluation. Both yields the same result for 9 out of 158 programs in HumanEval. Is this supposed to happen? seems like some evaluation script fault?

code generated with wrong end of string place

I used model: Code Llama - Instruct 13B to generate code for MultiPL-E Typescript language

In some test cases: e.g. HumanEval_109_move_one_ball the end of the code generated not ending correctly like this:

console.log(move_one_ball([3, 4, 5, 1, 2]));
console.log(move_one_ball([3, 5, 4, 1,

that cause the syntax error. I don't know if that comes from the model Code Llama or our tool MultiPL-E ?

can you guys please double check ? other test case like: HumanEval_112_reverse_delete also has the same error

cc @arjunguha

Scala tests comparing optional value

Example: HumanEval_90_next_smallest

Program that failed:

import scala.math._
import scala.collection.mutable._
object Problem {
    def nextSmallest(lst : List[Long]) : Option[Long] = {
        val sortedList = lst.sorted.distinct
        if (sortedList.length < 2) None else Some(sortedList(1))
    }
    def main(args: Array[String]) = {
    // Current test: test fail because `2l` is not the same as `Some(2l)`
    assert(nextSmallest((List[Long](5l.toLong, 1l.toLong, 4l.toLong, 3l.toLong, 2l.toLong))).equals(2l));
    // Modified test: comparing optional output with optional value
    assert(nextSmallest((List[Long](1l.toLong, 2l.toLong, 3l.toLong, 4l.toLong, 5l.toLong))).equals(Some(2l)));
    //assert(nextSmallest((List[Long]())).equals(None));
    //assert(nextSmallest((List[Long](1l.toLong, 1l.toLong))).equals(None));
    //assert(nextSmallest((List[Long](1l.toLong, 1l.toLong, 1l.toLong, 1l.toLong, 0l.toLong))).equals(1l));
    //assert(nextSmallest((List[Long](1l.toLong, 1l.toLong))).equals(None));
    //assert(nextSmallest((List[Long](-35l.toLong, 34l.toLong, 12l.toLong, -45l.toLong))).equals(-35l));
    }

}

There was a similar fix a few weeks ago for a different language but I can't remember which language that is..

C++ test float comparison

Example: HumanEval45_triangle_area

float triangle_area(long a, long b, long c) {
   // C++ program
}
int main() {
    auto candidate = triangle_area;
    assert(candidate((3), (4), (5)) == (6.0));
    assert(candidate((1), (2), (10)) == (float(-1)));
    assert(candidate((4), (8), (5)) == (8.18));
    assert(candidate((2), (2), (2)) == (1.73));
    assert(candidate((1), (2), (3)) == (float(-1)));
    assert(candidate((10), (5), (7)) == (16.25));
    assert(candidate((2), (6), (3)) == (float(-1)));
    assert(candidate((1), (1), (1)) == (0.43));
    assert(candidate((2), (2), (10)) == (float(-1)));
}

When comparing float outputs, often the tests would fail (in this case, the test that failed for me was the 3rd one), because I think C++ defaults instantiations like this 8.18 to a double type, which then doesn't match with the program output (float type)

there is at least another failure point in HumanEval_4_mean_absolute_deviation, but there could be many more.

float mean_absolute_deviation(std::vector<float> numbers) {
    // C++ program
}
int main() {
    auto candidate = mean_absolute_deviation;
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0}))) == (0.5));
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0, (float)3.0, (float)4.0}))) == (1.0));
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0, (float)3.0, (float)4.0, (float)5.0}))) == (1.2));
}

Perl Unit test when expecting "False/0" output

I have a feeling that this may have been debated but testing boolean values in Perl may need improvement

Example: HumanEval_92_any_int

sub any_int {
    my($x, $y, $z) = @_;
    # some perl program that returns 0/1
}
use Test::Deep;


sub testhumaneval {
    my $candidate = \&any_int;
        if(eq_deeply($candidate->(2, 3, 1),1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2.5, 2, 3),"")) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(1.5, 5, 3.5),"")) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2, 6, 2),"")) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(4, 2, 2),1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2.2, 2.2, 2.2),"")) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(-4, 6, 2),1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(2, 1, 1),1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(3, 4, 7),1)) {
        print "ok!" }else{
        exit 1;
        }
        if(eq_deeply($candidate->(3.0, 4, 7),"")) {
        print "ok!" }else{
        exit 1;
        }
}

testhumaneval();

It seems like at the moment, when the program is expected to output False, it is being compared against "" with eq_deeply. Many generations in perl, though, return 0/1. But the following comparison between 0 and "" seem to evaluate to False

eq_deeply(0, "") # -> False

Maybe, one solution is to directly use the output of these functions as the condition for the if statement for that unit test (only when output is expected to be boolean)

if($candidate->(2, 3, 1)) {   #expect True
        print "ok!" }else{
        exit 1;
        }
if(!$candidate->(2.5, 2, 3)) {   #expect False
        print "ok!" }else{
        exit 1;
        }

Racket unit test numerical equivalence

Example program: HumanEval_99_closest_integer

This is the current test

    (check-equal? (candidate "14.5") 15)

which outputs:

--------------------
FAILURE
name:       check-equal?
location:   problem.rkt:27:4
actual:     15.0
expected:   15
--------------------

Here are some alternatives we may consider (source):

    (check = (candidate "14.5") 15)
    (check-= (candidate "14.5") 15 0.01)
    (check-within (candidate "14.5") 15 0.01)

All of them would pass with the same inputs. The second and third version checks equivalence with small error range.

PHP test indexed array comparison

example problem: HumanEval_26_remove_duplicates

function test(): void {
    print_r(candidate(array(1, 2, 3, 2, 4, 3, 5))); // this gives
//    Array
//(
//    [0] => 1
//    [4] => 4
//    [6] => 5
//)
    if (candidate(array(1, 2, 3, 2, 4, 3, 5)) !== array(1, 4, 5)) { throw new Exception("Test failed!"); }
}

sometimes, the built-in array processing functions (array_filter for example) in PHP keeps the original index of the array, make this test fail, but somehow I feel like this should be considered a pass? (again, this maybe have been debated before, but just throwing it out there)

A possible solution is what was discussed here

function array_equal($a, $b) {
    return (
         is_array($a) 
         && is_array($b) 
         && count($a) == count($b) 
         && array_diff($a, $b) === array_diff($b, $a)
    );
}

// and then for tests
if (!array_equal(candidate(array(1, 2, 3, 2, 4, 3, 5)),array(1, 4, 5))) {throw new Exception("Test failed!"); };

Warning: Bash performance results artificially low

Failure Case 1

Not sure if this is expected failure case of unit tests in bash. Here is an example HumanEval_45_triangle_area

#!/bin/bash
#
#
# $1 is an integer
# $2 is an integer
triangle_area() {
    echo "$1 * $2 / 2.0" | bc -l
}


candidate() {
    triangle_area "$@"
}

set -e
run_test() {
    [[ $(candidate "5" "3") = "7.5" ]]
    [[ $(candidate "2" "2") = "2.0" ]]
    [[ $(candidate "10" "8") = "40.0" ]]
}

run_test

If we print the output of $(candidate "5" "3"), it is "7.500000000", and it is different from the expected "7.5", tests fails. Maybe something with bc to evaluate the numeric value of the strings instead of comparing strings?

Failure Case 2

HumanEval_42_incr_list

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo $((e + 1))
    done
}


candidate() {
    incr_list "$@"
}

set -e
run_test() {
#    [[ $(candidate "") = "" ]]
    echo $(candidate "3 2 1")   # prints -> 4 3 2\n
#    [[ $(candidate "3 2 1") = "4 3 2" ]]
    echo $(candidate "5 2 5 2 3 3 9 0 123")  # prints -> 6 3 6 3 4 4 10 1 124\n
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124" ]]
}

run_test

first test passes, second and third tests fail. And so I printed out the output of each cases.

I tried adding the newline character \n to the end of expected values and that didn't work. My lack of knowledge in Bash is not giving me any idea how it might be fixed.. but I don't think this should fail?

Java transpiled test failing with optional output

For example, with HumanEval_90_next_smallest, the java transpiled signature is

public static Optional<Long> nextSmallest(ArrayList<Long> lst) {

However, in the unit test below, it is not testing for Optional.of kind when output is not null

    public static void main(String[] args) {
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)1l, (long)2l, (long)3l, (long)4l, (long)5l)))).equals(2l));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)5l, (long)1l, (long)4l, (long)3l, (long)2l)))).equals(2l));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList()))).equals(Optional.empty()));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)1l, (long)1l)))).equals(Optional.empty()));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)1l, (long)1l, (long)1l, (long)1l, (long)0l)))).equals(1l));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)1l, (long)1l)))).equals(Optional.empty()));
    assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)-35l, (long)34l, (long)12l, (long)-45l)))).equals(-35l));
    }

where the first assert should have been

assert(nextSmallest((new ArrayList<Long>(Arrays.asList((long)1l, (long)2l, (long)3l, (long)4l, (long)5l)))).equals(Optional.of(2l)));

Otherwise, no generated function with output type Optional<Long> can satisfy any of these unit tests. I believe there are (at least) 5 instances of this error:

HumanEval_90_next_smallest
HumanEval_162_string_to_md5
HumanEval_136_largest_smallest_integers
HumanEval_12_longest
HumanEval_128_prod_signs

The looks like it may get into the cpp transpiler, so maybe best if the author can make some quick corrections here.

Thanks so much!

Strange Scala Unit test Translation for output Tuple Type w/ extra parenthesis

Example: HumanEval_155_even_odd_count

import scala.math._
import scala.collection.mutable._
object Problem {
    def evenOddCount(num : Long) : Tuple2[Long, Long] = {
        // Some program here
    } 
    def main(args: Array[String]) = {
    assert(evenOddCount((7l)).equals(((0l, 1l))));
    assert(evenOddCount((-78l)).equals(((1l, 1l))));
    assert(evenOddCount((3452l)).equals(((2l, 2l))));
    assert(evenOddCount((346211l)).equals(((3l, 3l))));
    assert(evenOddCount((-345821l)).equals(((3l, 3l))));
    assert(evenOddCount((-2l)).equals(((1l, 0l))));
    assert(evenOddCount((-45347l)).equals(((2l, 3l))));
    assert(evenOddCount((0l)).equals(((1l, 0l))));
    }

}

There is an extra level of parenthesis around theses expected tuple values that doesn't match up with what the programs output. At first I thought this is mistranslated for all programs with output type of Tuple2, but seems like it only applies to this one program..

Performance issue with Julia/R languages

Hello!

First, I would like to thank you for your time and effort invested in developing this tool.
I am writing to report an issue that I have encountered while evaluating Julia and R code on the HumanEval dataset. I have noticed that these two languages are very "expensive" in terms of the resources required to run the test cases.

In particular, it appears that increasing the number of functions to test per problem also increases CPU utilization, as if it launches a new process for each function to test. To avoid this problem, I am running the docker container with the option "--cpus 6". However, this leads to lots of timeouts, significantly impacting the final pass rate.

I also experimented with other languages, such as Lua, but found no specific issues.

Do you have any clue or suggestion of how can I fix this problem?

Thanks in advance!

Run generations of multiple languages at once

Hi, I am wondering whether this framework can run generations of multiple languages at once?
It seems that every time I execute automodel.py, there is a must to pass --lang parameter.

R unit tests atomic vector comparison

Hi Arjun and Co.,

I was testing generations for R and realized that sometimes when a program output a vector and we expect it to be atomic (all elements within have the same type), the unit tests are still comparing them to list type (which allows different element types within. The downfall of this, is that when program output c(1,2), and is being checked against list(1,2), the unit test says it's incorrect.

An example unit test can be found in even_odd_palindrome :

candidate <- even_odd_palindrome
    if(!identical(candidate(123), list(8, 13))){quit('no', 1)}
    if(!identical(candidate(12), list(4, 6))){quit('no', 1)}
    if(!identical(candidate(3), list(1, 2))){quit('no', 1)}
    if(!identical(candidate(63), list(6, 8))){quit('no', 1)}
    if(!identical(candidate(25), list(5, 6))){quit('no', 1)}
    if(!identical(candidate(19), list(4, 6))){quit('no', 1)}
    if(!identical(candidate(9), list(4, 5))){quit('no', 1)}
    if(!identical(candidate(1), list(0, 1))){quit('no', 1)}
}

where we are obviously expecting atomic vector of length 2 that outputs two elements of the same type (integers for counts)

There are a couple other places where this happen, causing perfectly good generations to fail unit tests. Could y'all help look into how to fix it in the transpiler?

This might be one of the reason R performance is absurdly low..

Thanks so much!

Here I am updating a list of all programs whose unit tests are affected:

HumanEval_5_intersperse
HumanEval_6_parse_nested_parens
HumanEval_7_filter_by_substring
HumanEval_8_sum_product
HumanEval_9_rolling_max
HumanEval_20_find_closest_elements
HumanEval_21_rescale_to_unit
HumanEval_22_filter_integers
HumanEval_25_factorize
HumanEval_29_filter_by_prefix
HumanEval_30_get_positive
HumanEval_33_sort_third
HumanEval_37_sort_even
HumanEval_58_common
HumanEval_62_derivative
HumanEval_68_pluck
HumanEval_70_strange_sort_list
HumanEval_74_total_match
HumanEval_81_numerical_letter_grade
HumanEval_88_sort_array
HumanEval_96_count_up_to
HumanEval_100_make_a_pile
HumanEval_101_words_string
HumanEval_105_by_length
HumanEval_113_odd_count
HumanEval_117_select_words
HumanEval_120_maximum
HumanEval_123_get_odd_collatz
HumanEval_125_split_words
HumanEval_130_tri
HumanEval_148_bf
HumanEval_152_compare
HumanEval_155_even_odd_count
HumanEval_159_eat
HumanEval_163_generate_integers

Reported pass@k silently wrong for n<k

pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.

However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if k=100. Nothing in the callers prevents using too small an n, in fact someone in a hurry is quite likely to use a small n (as I did in the original issue, oops).

Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that k is elided from the dictionary.

Originally posted by @daniel-vainsencher in #31 (comment)

Support execution for the FIM benchmarks

Citation for the LeetCode Dataset

I'm curious what the citation for the LeetCode dataset is, and how the dataset was built

R prompts are currently broken

But, they are fine on the Hub. The MultiPL-T R prompts are also fine.

This notebook shows what's up:

https://gist.github.com/arjunguha/e1b3ad72bfe96b8c7f38af8944d1c5b6

How to run a multi-GPU model for inference testing?

llama3 70b

python3 automodel.py --name /home/models/Meta-Llama-3-70B/ --root-dataset humaneval --lang $lang --temperature 0.2 --batch-size 40 --completion-limit 1 --output-dir-prefix $output

thanks

padding left some token causing compile error

I think after some lately update in the file automodel.py, it broke the code generated

Testcase: HumanEval_0_has_close_elements
Model used: Code Llama 13B Instruct
Prompt:

//Check if in given array of numbers, are any two numbers closer to each other than
// given threshold.
// >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
// false
// >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
// true
function has_close_elements(numbers: number[], threshold: number): boolean {

Generated code:

an {
  for (let i = 0; i < numbers.length; i++) {
    for (let j = i + 1; j < numbers.length; j++) {
      if (Math.abs(numbers[i] - numbers[j]) < threshold) {
        return true;
      }
    }
  }
  return false;
}

The code an { is redundant now, the whole pipeline worked before (around 1 month ago) but currently on every models i test it's always return redundant code with around 4-5 token padding left from the beginning.

I switched back to automodel.py in the commit 334b49c4d3f4b9c4082b7c724b1d6095075cc13b (auto bf16) and it working fine again

Can you guys please double check

Unable to load weights from pytorch checkpoint file

I try to run: !python3 automodel.py --name mhhmm/typescript-instruct-20k --root-dataset humaneval --lang ts --temperature 0.2 --batch-size 20 --completion-limit 20 --output-dir-prefix typescript on Google Colab and it returns this error.

My model: https://huggingface.co/mhhmm/typescript-instruct-20k

OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--mhhmm--typescript-instruct-20k/snapshots/9ab8785b81f9e7aca9c614ca0e66e746e3c48224/pytorch_model-00001-of-00006.bin' at '/root/.cache/huggingface/hub/models--mhhmm--typescript-instruct-20k/snapshots/9ab8785b81f9e7aca9c614ca0e66e746e3c48224/pytorch_model-00001-of-00006.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Did I do something wrong? Crying for help ...

R unit test comparison between integer and double

This is a very nuanced difference but is actually the cause of at least 10% of the issues in R unit tests (in HumanEval I have seen):

Example: HumanEval_60_sum_to_n:

sum_to_n <- function(n) {
  return(sum(0:n))

}
test_humaneval <- function() {
candidate <- sum_to_n
    if(!identical(candidate(1), 1)){quit('no', 1)}
    if(!identical(candidate(6), 21)){quit('no', 1)}
    if(!identical(candidate(11), 66)){quit('no', 1)}
    if(!identical(candidate(30), 465)){quit('no', 1)}
    if(!identical(candidate(100), 5050)){quit('no', 1)}
}
test_humaneval()

this is not working because sum returns an integer, and we are comparing it with float in the output. The identical comparator requires the types of variables to be the same. In these cases, the type should not matter.

My suggestion might be to change identical to == comparator, but only for these single numeric value comparisons. so in this case:

candidate <- sum_to_n
    if(!(candidate(1)== 1)){quit('no', 1)}
    if(!(candidate(6)== 21)){quit('no', 1)}
    if(!(candidate(11)== 66)){quit('no', 1)}
    if(!(candidate(30)== 465)){quit('no', 1)}
    if(!(candidate(100)== 5050)){quit('no', 1)}
}

@mhyee Maybe you have an idea how to fix this in the transpiler code?
CC @arjunguha

C# test sequence equality

Issue 1: comparing lists

example program: HumanEval_9_rolling_max

// current comparison, returns False
Debug.Assert((new List<long>(new long[]{(long)1L, (long)2L, (long)3L, (long)4L})).Equals((new List<long>(new long[]{(long)1L, (long)2L, (long)3L, (long)4L}))));
// proposed comparison, returns True
Debug.Assert((new List<long>(new long[]{(long)1L, (long)2L, (long)3L, (long)4L})).SequenceEqual((new List<long>(new long[]{(long)1L, (long)2L, (long)3L, (long)4L}))));

should be changed to .SequenceEqual, as Equals checks for reference equality, where as the second checks for equality element-wise (source)

Issue 2: comparing dictionaries

example program HumanEval_111_histogram

Dictionary<string,long> dic1 = new Dictionary<string,long>(){{"a", 2L}, {"b", 2L}};
Dictionary<string,long> dic2 = new Dictionary<string,long>(){{"a", 2L}, {"b", 2L}};
// current comparison, returns False
Console.WriteLine((dic1).Equals(dic2));
// proposed comparison, returns True
Console.WriteLine(dic1.Count == dic2.Count && !dic1.Except(dic2).Any());

source

Proposed change

currently in humaneval_to_cs.py, we have:

    def deep_equality(self, left: str, right: str) -> str:
        """
        All tests are assertions that compare deep equality between left and right.
        Use ==  for primitive types and Equals for objects
        """
        #Empty the union declarations
        self.union_decls = {}
        if self.is_primitive_type(self.translated_return_type):
            return f"    Debug.Assert({left} == {right});"
        else:
            return f"    Debug.Assert({left}.Equals({right}));"

instead, we can change to this:

    def deep_equality(self, left: str, right: str) -> str:
        """
        All tests are assertions that compare deep equality between left and right.
        Use ==  for primitive types and Equals for objects
        """
        #Empty the union declarations
        self.union_decls = {}
        if self.is_primitive_type(self.translated_return_type):
            return f"    Debug.Assert({left} == {right});"
        elif self.list_type in self.translated_return_type:
            return f"    Debug.Assert({left}.SequenceEqual({right}));"
        elif self.dict_type in self.translated_return_type:
            return f"    Debug.Assert({left}.Count == {right}.Count && !{left}.Except({right}).Any()));"
        else:
            return f"    Debug.Assert({left}.Equals({right}));"

leetcode dataset not found

I noticed that this repository has already added its own dataset for leetcode, but it is not yet enabled to be one of the "available" datasets.

Command:

python3 automodel.py --name codellama/CodeLlama-7b-Instruct-hf --root-dataset leetcode --lang py --temperature 0.8 --batch-size 5 --completion-limit 20 --output-

Error:

ValueError: BuilderConfig 'leetcode-py' not found. Available: ['humaneval-cpp-keep', 'humaneval-cpp-transform', 'humaneval-cpp', 'humaneval-cpp-remove', 'humaneval-cs-keep', 'humaneval-cs-transform', 'humaneval-cs', 'humaneval-cs-remove', 'humaneval-d-keep', 'humaneval-d-transform', 'humaneval-d', 'humaneval-d-remove', 'humaneval-go-keep', 'humaneval-go-transform', 'humaneval-go', 'humaneval-go-remove', 'humaneval-java-keep', 'humaneval-java-transform', 'humaneval-java', 'humaneval-java-remove', 'humaneval-jl-keep', 'humaneval-jl-transform', 'humaneval-jl', 'humaneval-jl-remove', 'humaneval-js-keep', 'humaneval-js-transform', 'humaneval-js', 'humaneval-js-remove', 'humaneval-lua-keep', 'humaneval-lua-transform', 'humaneval-lua', 'humaneval-lua-remove', 'humaneval-php-keep', 'humaneval-php-transform', 'humaneval-php', 'humaneval-php-remove', 'humaneval-pl-keep', 'humaneval-pl-transform', 'humaneval-pl', 'humaneval-pl-remove', 'humaneval-py-keep', 'humaneval-py-transform', 'humaneval-py', 'humaneval-py-remove', 'humaneval-r-keep', 'humaneval-r-transform', 'humaneval-r', 'humaneval-r-remove', 'humaneval-rb-keep', 'humaneval-rb-transform', 'humaneval-rb', 'humaneval-rb-remove', 'humaneval-rkt-keep', 'humaneval-rkt-transform', 'humaneval-rkt', 'humaneval-rkt-remove', 'humaneval-rs-keep', 'humaneval-rs-transform', 'humaneval-rs', 'humaneval-rs-remove', 'humaneval-scala-keep', 'humaneval-scala-transform', 'humaneval-scala', 'humaneval-scala-remove', 'humaneval-sh-keep', 'humaneval-sh-transform', 'humaneval-sh', 'humaneval-sh-remove', 'humaneval-swift-keep', 'humaneval-swift-transform', 'humaneval-swift', 'humaneval-swift-remove', 'humaneval-ts-keep', 'humaneval-ts-transform', 'humaneval-ts', 'humaneval-ts-remove', 'mbpp-cpp-keep', 'mbpp-cpp', 'mbpp-cs-keep', 'mbpp-cs', 'mbpp-d-keep', 'mbpp-d', 'mbpp-go-keep', 'mbpp-go', 'mbpp-java-keep', 'mbpp-java', 'mbpp-jl-keep', 'mbpp-jl', 'mbpp-js-keep', 'mbpp-js', 'mbpp-lua-keep', 'mbpp-lua', 'mbpp-php-keep', 'mbpp-php', 'mbpp-pl-keep', 'mbpp-pl', 'mbpp-py-keep', 'mbpp-py', 'mbpp-r-keep', 'mbpp-r', 'mbpp-rb-keep', 'mbpp-rb', 'mbpp-rkt-keep', 'mbpp-rkt', 'mbpp-rs-keep', 'mbpp-rs', 'mbpp-scala-keep', 'mbpp-scala', 'mbpp-sh-keep', 'mbpp-sh', 'mbpp-swift-keep', 'mbpp-swift', 'mbpp-ts-keep', 'mbpp-ts']