The gemini-benchmark from neulab

gemini-benchmark's Issues

Improve README

It'd be good to add to the README:

More comprehensive description of what we did
The main results table from the paper
Links to all of the Zeno reports

Check results of re-run experiments

The re-run experiments from #31 and #32 should be checked for any issues (and similar results to previously reported results, such as in the Gemini paper and Mixtral blog post). @neubig will do this.

Knowledge-based QA
- Our MMLU COT five-shot in Zeno: gemini-pro=65.22, gpt-3.5-turbo=67.55, gpt-4-turbo=80.48, mixtral=68.81
- Reports from Chain-of-thought Hub: gemini-pro=???, gpt-3.5-turbo=67.3, gpt-4=???, mixtral=???
- Note: Our implementation is based on chain-of-thought hub, a widely used benchmarking setup for LLMs. Our GPT-3.5-Turbo results are comparable to chain-of-thought hub, so we believe the implementation is correct. However, these are slightly lower overall than those reported in the Gemini and Mixtral papers, indicating that possibly with different prompts the scores may rise somewhat overall.
Reasoning
- Our BBH on Zeno: gemini-pro=71.11, gpt-3.5-turbo=71.02, gpt-4-turbo=83.90, mixtral=60.76
- Previous paper BBH: gemini-pro=75.0, gpt-3.5-turbo=66.6, gpt-4=83.1, mixtral=NA
- Note: We have not yet received an example of how to reproduce the numbers from the Gemini team, so we're waiting to receive that.
Mathematics
- Our GSM8K on Zeno: gemini-pro=71.34/76.42, gpt-3.5-turbo=74.60/78.01, gpt-4=92.95/???, mixtral=71.49/???
- Previous paper GSM8K: gemini-pro=??? (only CoT@32 reported), gpt-3.5-turbo=57.1, gpt-4=92.0, mixtral=58.4
- Note: We have successfully reproduced Gemini's results from the paper but using a different prompt than the Eleuther one (first number is eleuther, second is Google prompt). We are re-running with the new prompt from Google, and will likely switch to this prompt if it achieves reasonable performance for the other tested models.
Code Generation
- Our HumanEval on Zeno: gemini-pro=60.4, gpt-3.5-turbo=73.8, gpt-4=77.4, mixtral=30.5
- Previous paper HumanEval: gemini-pro=67.7, gpt-3.5-turbo=48.1, gpt-4=67.0, mixtral=40.5 (ref)
- We have improved all results by using a temperature of 0 and max_tokens of 512.
- Note 1: Try to figure out why our Gemini results are lower than the report by Google. Currently we're running a sanity check to try to check whether query VertexAI directly could be the issue.
- Note 2: Graham's Mixtral results are lower than Zichun's, probably because it is continuing with explanation afterwards. To fix this, we can add additional "stop" tokens.
Machine Translation
- Waiting on #32
Web Navigation Agents
- Our WebArena on Zeno: gemini-pro=7.12, gpt-3.5-turbo=8.87, gpt-4=13.76, mixtral=1.39
- Previous paper WebArena: gemini-pro=???, gpt-3.5-turbo=8.75, gpt-4=14.41, mixtral=???
- Note: Pending web-arena-x/webarena#81, we will make sure that the Zeno results match the paper results.

Code Generation Evals should parse code from LM response

Hi! Thanks for this great and extensive benchmarking work. I was looking at Section 6 in the paper and found this graph intriguing since it states that gpt3.5-turbo is better than gpt4-turbo on the ODEX benchmark. It was even more interesting because this is on slices of the dataset that includes very popular libraries: pandas, numpy, etc

Issue: Code not extracted from LM response

From the zenoml dashboard, it looks like code from gpt4-turbo might be correct (in fact same as 3.5-turbo) but failed because there is an NL description after the code, causing a syntax error.

Question: Should we be prompting better (e.g., with code delimiters ```) and parsing the code from the response before execution and evaluation?

Instruction Following Checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Update paper and re-release

Pending #35, we should make a full pass of the paper to ensure consistency, then re-release it on arXiv. @neubig will mark off sections when they have done a final check.

Once each section is done, we should:

Do a final read-through for formatting
Re-submit to arXiv

Double-check filtering of final strings in codegen

The results of filtering final strings in codegen is a little bit buggy (I made a temporary patch to get results quickly, but it should be fixed permanently).

gemini-benchmark/benchmarking/Code/run_code.py

Lines 126 to 130 in fe7a80c

 # Strip the trailing English text 

 predictions = [ 

 re.sub(r"(.)\n\n[A-Z].*", r"\1", x) 

 for x in predictions 

 ]

Text understanding checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Re-run Mistral Evals w/ official Mistral-Instruct

We used a third-party model instead of the official Mixtral model in our original evaluation: https://twitter.com/arthurmensch/status/1737138144854606314

We should:

Re-run the Mixtral evals for tasks with the official Mixtral Instruct model
Update the numbers, figures, and discussion in each task section of the paper
Update the Zeno report to match the paper content

Here is a checklist for the tasks, check off each task when this is done please!

Question regarding Mixtral evaluation

Hi folks,

Thanks for the paper and open-sourcing your evaluation code.

The Mixtral evaluation numbers looked a bit strange to me, so I checked which model was used for evaluation as this was not reported in the paper (base/instruction-tuned/etc.). If I understand correctly based on this line, it seems this model was used for evaluation: https://huggingface.co/DiscoResearch/DiscoLM-mixtral-8x7b-v2.

Please note that this is a highly experimental model that was published before Mistral's official release, which was also fine-tuned on some random datasets. This is also indicated in the model card:

This model is still an early Alpha with experimental code and we can't guarantee that there all values are correct.

It would be great if you could reevaluate the Mixtral model with the official instruction-tuned version: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1. This model is also available on Together's API: https://docs.together.ai/docs/inference-models.

That way, one should normally obtain evaluation scores which are in line with what has been reported here: https://github.com/open-compass/MixtralKit. For example, they report 67.1 on BigBench-Hard compared to 41.76 in your paper. Note that this is a higher number than Gemini-Pro, which scores 65.58 according to your paper. They also report 65.7 on BigBench-Hard compared to 58.45 in your paper, etc.

Thanks!

Kind regards,

Niels

Mathematical Reasoning Checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Code Generation Checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Re-run with Gemini's safety settings off

Gemini has safety filters on by default, but this may hurt downstream accuracy. We should try to re-run the gemini filters with safety settings off (reference: BerriAI/litellm#1190).

We should:

Re-run the Gemini evals for tasks with safety settings off (we will name this model gemini-pro and the model with safety settings on gemini-pro-filtered)
Update the numbers, figures, and discussion in each task section of the paper with the new gemini-pro numbers
Where appropriate (maybe in MMLU and Translation?) have a limited discussion of the effect of safety filtering
Update the Zeno report to match the paper content

Here is a checklist for the tasks, check off each task when this is done please!

Common-sense QA checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Machine Translation Checklist

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

	# Strip the trailing English text
	predictions = [
	re.sub(r"(.)\n\n[A-Z].*", r"\1", x)
	for x in predictions
	]

neulab / gemini-benchmark Goto Github PK

gemini-benchmark's People

Contributors

Stargazers

Watchers

Forkers

gemini-benchmark's Issues

Issue: Code not extracted from LM response

Recommend Projects

Recommend Topics

Recommend Org