Giter Club home page Giter Club logo

gemini-benchmark's People

Contributors

aashiqmuhamed avatar cabreraalex avatar dhgarrette avatar eltociear avatar neubig avatar oootttyyy avatar snat1505027 avatar sparkier avatar yuzc19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gemini-benchmark's Issues

Improve README

It'd be good to add to the README:

  • More comprehensive description of what we did
  • The main results table from the paper
  • Links to all of the Zeno reports

Check results of re-run experiments

The re-run experiments from #31 and #32 should be checked for any issues (and similar results to previously reported results, such as in the Gemini paper and Mixtral blog post). @neubig will do this.

  • Knowledge-based QA
    • Our MMLU COT five-shot in Zeno: gemini-pro=65.22, gpt-3.5-turbo=67.55, gpt-4-turbo=80.48, mixtral=68.81
    • Reports from Chain-of-thought Hub: gemini-pro=???, gpt-3.5-turbo=67.3, gpt-4=???, mixtral=???
    • Note: Our implementation is based on chain-of-thought hub, a widely used benchmarking setup for LLMs. Our GPT-3.5-Turbo results are comparable to chain-of-thought hub, so we believe the implementation is correct. However, these are slightly lower overall than those reported in the Gemini and Mixtral papers, indicating that possibly with different prompts the scores may rise somewhat overall.
  • Reasoning
    • Our BBH on Zeno: gemini-pro=71.11, gpt-3.5-turbo=71.02, gpt-4-turbo=83.90, mixtral=60.76
    • Previous paper BBH: gemini-pro=75.0, gpt-3.5-turbo=66.6, gpt-4=83.1, mixtral=NA
    • Note: We have not yet received an example of how to reproduce the numbers from the Gemini team, so we're waiting to receive that.
  • Mathematics
    • Our GSM8K on Zeno: gemini-pro=71.34/76.42, gpt-3.5-turbo=74.60/78.01, gpt-4=92.95/???, mixtral=71.49/???
    • Previous paper GSM8K: gemini-pro=??? (only CoT@32 reported), gpt-3.5-turbo=57.1, gpt-4=92.0, mixtral=58.4
    • Note: We have successfully reproduced Gemini's results from the paper but using a different prompt than the Eleuther one (first number is eleuther, second is Google prompt). We are re-running with the new prompt from Google, and will likely switch to this prompt if it achieves reasonable performance for the other tested models.
  • Code Generation
    • Our HumanEval on Zeno: gemini-pro=60.4, gpt-3.5-turbo=73.8, gpt-4=77.4, mixtral=30.5
    • Previous paper HumanEval: gemini-pro=67.7, gpt-3.5-turbo=48.1, gpt-4=67.0, mixtral=40.5 (ref)
    • We have improved all results by using a temperature of 0 and max_tokens of 512.
    • Note 1: Try to figure out why our Gemini results are lower than the report by Google. Currently we're running a sanity check to try to check whether query VertexAI directly could be the issue.
    • Note 2: Graham's Mixtral results are lower than Zichun's, probably because it is continuing with explanation afterwards. To fix this, we can add additional "stop" tokens.
  • Machine Translation
    • Waiting on #32
  • Web Navigation Agents
    • Our WebArena on Zeno: gemini-pro=7.12, gpt-3.5-turbo=8.87, gpt-4=13.76, mixtral=1.39
    • Previous paper WebArena: gemini-pro=???, gpt-3.5-turbo=8.75, gpt-4=14.41, mixtral=???
    • Note: Pending web-arena-x/webarena#81, we will make sure that the Zeno results match the paper results.

Code Generation Evals should parse code from LM response

Hi! Thanks for this great and extensive benchmarking work. I was looking at Section 6 in the paper and found this graph intriguing since it states that gpt3.5-turbo is better than gpt4-turbo on the ODEX benchmark. It was even more interesting because this is on slices of the dataset that includes very popular libraries: pandas, numpy, etc

image

Issue: Code not extracted from LM response

From the zenoml dashboard, it looks like code from gpt4-turbo might be correct (in fact same as 3.5-turbo) but failed because there is an NL description after the code, causing a syntax error.

image

Question: Should we be prompting better (e.g., with code delimiters ```) and parsing the code from the response before execution and evaluation?

Instruction Following Checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Update paper and re-release

Pending #35, we should make a full pass of the paper to ensure consistency, then re-release it on arXiv. @neubig will mark off sections when they have done a final check.

  • Intro
  • Experimental Setup
  • Knowledge-based QA
  • Reasoning
  • Mathematics
  • Code Generation
  • Translation
  • Web Agent
  • Conclusion

Once each section is done, we should:

  • Do a final read-through for formatting
  • Re-submit to arXiv

Text understanding checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Re-run Mistral Evals w/ official Mistral-Instruct

We used a third-party model instead of the official Mixtral model in our original evaluation: https://twitter.com/arthurmensch/status/1737138144854606314

We should:

  1. Re-run the Mixtral evals for tasks with the official Mixtral Instruct model
  2. Update the numbers, figures, and discussion in each task section of the paper
  3. Update the Zeno report to match the paper content

Here is a checklist for the tasks, check off each task when this is done please!

  • Knowledge-based QA
  • Reasoning
  • Mathematics
  • Code Generation
  • Translation
  • Web Instruction Following

Question regarding Mixtral evaluation

Hi folks,

Thanks for the paper and open-sourcing your evaluation code.

The Mixtral evaluation numbers looked a bit strange to me, so I checked which model was used for evaluation as this was not reported in the paper (base/instruction-tuned/etc.). If I understand correctly based on this line, it seems this model was used for evaluation: https://huggingface.co/DiscoResearch/DiscoLM-mixtral-8x7b-v2.

Please note that this is a highly experimental model that was published before Mistral's official release, which was also fine-tuned on some random datasets. This is also indicated in the model card:

This model is still an early Alpha with experimental code and we can't guarantee that there all values are correct.

It would be great if you could reevaluate the Mixtral model with the official instruction-tuned version: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1. This model is also available on Together's API: https://docs.together.ai/docs/inference-models.

That way, one should normally obtain evaluation scores which are in line with what has been reported here: https://github.com/open-compass/MixtralKit. For example, they report 67.1 on BigBench-Hard compared to 41.76 in your paper. Note that this is a higher number than Gemini-Pro, which scores 65.58 according to your paper. They also report 65.7 on BigBench-Hard compared to 58.45 in your paper, etc.

Thanks!

Kind regards,

Niels

Mathematical Reasoning Checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Code Generation Checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Re-run with Gemini's safety settings off

Gemini has safety filters on by default, but this may hurt downstream accuracy. We should try to re-run the gemini filters with safety settings off (reference: BerriAI/litellm#1190).

We should:

  1. Re-run the Gemini evals for tasks with safety settings off (we will name this model gemini-pro and the model with safety settings on gemini-pro-filtered)
  2. Update the numbers, figures, and discussion in each task section of the paper with the new gemini-pro numbers
  3. Where appropriate (maybe in MMLU and Translation?) have a limited discussion of the effect of safety filtering
  4. Update the Zeno report to match the paper content

Here is a checklist for the tasks, check off each task when this is done please!

  • Knowledge-based QA
  • Reasoning
  • Mathematics
  • Code Generation
  • Translation
  • Web Instruction Following

Common-sense QA checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Machine Translation Checklist

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.