Giter Club home page Giter Club logo

arena-hard-auto's People

Contributors

codingwithtim avatar dmitrysarov avatar infwinston avatar karthik-nexusflow avatar r4dm avatar sxjscience avatar xukai92 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

arena-hard-auto's Issues

Evaluate local models

Hi! Thanks for releasing this awesome benchmark :)

I was interested in evaluating this benchmark using local models that I have trained or models that are available on Hugging Face. From what I understand, it appears that I would need to develop the generation pipeline myself, possibly using tools like vLLM or similar services. Do I miss anything here?

Bradley-Terry model

First of all thanx for your work.
Maybe I have misunderstood, but I could not file Bradley-Terry model usage/implementation in your code, instead you are doing something interesting with LogReg coefficient.
Please, can you point to the source of idea behind?
And do you think that Bradley-Terry model will perform worse than this LR trick?

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Hi! Thanks for releasing this amazing benchmark, and we have been heavily using it for our development. When I run show_result.py, the following warning occurs and it seems thta logistic regression model failed to converge. Should I be concerned about it?

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
bootstrap:   2%|████▍                                                                                                                                                                                                                         | 4/200 [00:07<05:28,  1.68s/it]
/scratch/gpfs/mengzhou/anaconda3/envs/handbook/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

CI results different for same model answer copy

Hey Team,

Thanks for sharing the benchmarks! I'm testing the scripts by simply copy a model answer as well as the judgement and change the model ids in each jsonl file, but got different CI results after showing as below:

gpt-4-0613-copy                  | score: 37.9  | 95% CI: (-2.8, 2.7)  | average #tokens: 354
gpt-4-0613                       | score: 37.9  | 95% CI: (-2.7, 3.0)  | average #tokens: 354

Is it related to the bootstrapping? Thanks!

Best regards,
QQ

Local model as a judge

Hello! Thank you very much for your work! I have a question.

How do I use a local model as a judge? Generating responses using VLLM works fine, but if you set the same local model as an evaluator, the code throws an error:

openai.NotFoundError: Error code: 404 - {'object': 'error', 'message': 'The model ` Meta-Llama-3-70B-Instruct-GPTQ` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}.

Of course, I change in judge_config.yaml: judge_model: Meta-Llama-3-70B-Instruct-GPTQ and api_config.yaml is configured correctly because generation works with the same model.
Do I need to fix something in the code to make the local model work as a judge? TY!!

[Discussion] Methodology for bootstrapping with replacement to obtain separability confidence intervals

Hi,

I was studying how arena-hard gets confidence intervals from pairwise comparisons as this is used directly to measure "separability", defined as the percentage of non-overlapping intervals.

The original blog post says: "The 95% confidence interval is computed via 100 rounds of bootstrapping".

I pulled up the code and it looks like you are indeed:

  1. calculating a new elo score over multiple rounds
  2. using the same set of battles each round, but
  3. randomly sampling 100% of the battles with replacement, i.e. df.sample(frac=1.0, replace=True).

I understand that step 3) gives you a slightly different set of battles each time, however, the bootstrap sets will be nearly identical to the original except for whichever items are randomly selected more than once.

I don't doubt that bootstrapping overall is a widely used technique -- I'm mainly wondering about whether df.sample(frac=1.0, replace-True) is the most accepted "correct" way to do it.

As a non-statistician expert, my hunch is that bootstrapping with k-fold cross validation where the battles are separated into random buckets (without replacement) would be a better expression of variance and thus a stronger test for confidence intervals.

[Q] About hosting `arena-hard-v0.1/question.json` in the Hugging Face Hub

Description

Hi here! Awesome job with Arena Hard 👏🏻 I just opened this issue since we, at @argilla-io, are currently exploring the usage of distilabel for running benchmarks such as Arena Hard, and we were wondering if it's a problem or if you have any issue with https://huggingface.co/datasets/alvarobartt/lmsys-arena-hard-v0.1 being hosted in the Hugging Face Hub on our end.

We uploaded it there because we couldn't find the dataset per se in the Hub, but could only find https://huggingface.co/spaces/lmsys/arena-hard-browser with the question.jsonl file there.

If there's already a dataset or you'd like us to transfer that to your org we'll happily do so, hope this is not an issue, but just let us know!

Congrats for the awesome job on evaluating LLMs again!

[Bug] Temperature is always `0.0`

The temperature for answer generations is always 0.0, since the category of all questions is arena-hard-v0.1. This implicitly means that the temperature will be defaulted to 0.0. Wanted to flag since I think this might be undesirable behavior.

[Feature] support arena-hard in opencompass

Hi, Thanks for such a robust work!
We have supported ArenaHard dataset in Opencompass now, OpenCompass is an evaluation platform that can partition tasks and support different model inference backends, thereby accelerating the model evaluation process.
After integrating the advantages of your datasets and OpenCompass, it is now possible to directly select a model to perform rapid inference and evaluation in one step.
Besides, opencompass also support to change judge model or set multi-judge models.
The demo config in Opencompass is here: https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_arena_hard.py
Welcome to try in Opencompass and we can further collaborate to strengthen the LLM evaluation work of the open-source community.

Multi-threads generation support ?

I found the judgement generation to be really time-consuming, where evaluating a model would cost more than 1.5 hours, using gpt-4-1106-preview and parallel count 8. Is this an expected behavior?

If yes, does gen_judgement support multi-threads generation on multiple api endpoints to balance the load? For example, if I have 5 endpoints, I can generate 40 judgements at the same time, which should significantly accelerate the evaluating process.

If it is now supported, please tell me how to perform this. Simply adding multiple endpoints for judge model in api_config.yaml seems not working.

Much thanks.

Discrepancy in Scores When Switching GPT Model Versions

I recently judge the model answer provided here and decided to switch the GPT version from gpt-4-1106-preview to gpt-4-0125-preview. Cause I can only access instances of this version.
After making this change, I observed a discrepancy of over 440 points(overall 1000) in the score compared to the judgement benchmarks listed in your documentation.

Could you please advise on how to address this issue or suggest any solutions that might help mitigate this discrepancy?

Majority of questions are coding questions!

Thanks for the great work. The questions are so hard that I couldn't answer any of them.

However, I found a large majority of the questions are coding and CS-related, which would favor code-oriented models and disadvantage other general-purpose models, despite the fact that this benchmark is being advertised as general-purpose MT-bench replacement.

How do you think?

Models testing themselves will always be biased.

The prompt is formatted such that they are supposed to answer the question before judging. If the model is judging itself, then it will compare its own answer with... its own answer, which will be mostly the same. That holds true for models of the same family as well – GPT-4-turbo-preview will have nearly the same answers as GPT-4-turbo. Same with the Claude suite.

Naturally, the judge is going to prefer the answer that most resembles its own. In the end, I'm wondering if the judge solution is necessary.

Markdown Rendering Issue

Environment

python3.10, gardio 3.40.0

Issue

When browsing some questions (I forget which), I encounter this backend gardio error:

ValueError: 
\ket{\psi} = \frac{\ket{00} + \ket{01} + \ket{10}}{\sqrt{3}}
^
ParseFatalException: Unknown symbol: \ket, found '\'  (at char 0), (line:1, col:1)

QA browser does not work properly for me

Environment

python3.10, with gardio==4.26.0, gradio_client==0.15.1

Typo

Well, first there is a typo in the file name. It should be "browser" instead pf "broswer".

How to reproduce it

First, there are no dependencies of gradio in requirements. So first I have to manually install gradio to run the qa browser

After invoking python qa_broswer.py, there is something abnormal:

Namespace(host='0.0.0.0', port=None, share=False, config_file='config/judge_config.yaml')
<...>/python3.10/site-packages/gradio/components/dropdown.py:179: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: gpt-3.5-turbo-0613 or set allow_custom_value=True.
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

When accessing the link,

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  <skip some not very relevant messages>
  File "xxx/python3.10/site-packages/gradio/routes.py", line 797, in process_msg
    return f"data: {orjson.dumps(message.model_dump()).decode('utf-8')}\n\n"
TypeError: Type is not JSON serializable: Dropdown

And this is what I observed

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.