lm-sys / arena-hard-auto Goto Github PK

View Code? Open in Web Editor NEW

344.0 344.0 37.0 3.09 MB

Arena-Hard-Auto: An automatic LLM benchmark.

License: Apache License 2.0

Python 9.79% Jupyter Notebook 90.21%

arena-hard-auto's People

Contributors

Stargazers

Watchers

arena-hard-auto's Issues

Evaluate local models

Hi! Thanks for releasing this awesome benchmark :)

I was interested in evaluating this benchmark using local models that I have trained or models that are available on Hugging Face. From what I understand, it appears that I would need to develop the generation pipeline myself, possibly using tools like vLLM or similar services. Do I miss anything here?

Bradley-Terry model

First of all thanx for your work.
Maybe I have misunderstood, but I could not file Bradley-Terry model usage/implementation in your code, instead you are doing something interesting with LogReg coefficient.
Please, can you point to the source of idea behind?
And do you think that Bradley-Terry model will perform worse than this LR trick?

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Hi! Thanks for releasing this amazing benchmark, and we have been heavily using it for our development. When I run show_result.py, the following warning occurs and it seems thta logistic regression model failed to converge. Should I be concerned about it?

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
bootstrap:   2%|████▍                                                                                                                                                                                                                         | 4/200 [00:07<05:28,  1.68s/it]
/scratch/gpfs/mengzhou/anaconda3/envs/handbook/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

CI results different for same model answer copy

Hey Team,

Thanks for sharing the benchmarks! I'm testing the scripts by simply copy a model answer as well as the judgement and change the model ids in each jsonl file, but got different CI results after showing as below:

gpt-4-0613-copy                  | score: 37.9  | 95% CI: (-2.8, 2.7)  | average #tokens: 354
gpt-4-0613                       | score: 37.9  | 95% CI: (-2.7, 3.0)  | average #tokens: 354

Is it related to the bootstrapping? Thanks!

Best regards,
QQ

Local model as a judge

Hello! Thank you very much for your work! I have a question.

How do I use a local model as a judge? Generating responses using VLLM works fine, but if you set the same local model as an evaluator, the code throws an error:

openai.NotFoundError: Error code: 404 - {'object': 'error', 'message': 'The model ` Meta-Llama-3-70B-Instruct-GPTQ` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}.

Of course, I change in judge_config.yaml: judge_model: Meta-Llama-3-70B-Instruct-GPTQ and api_config.yaml is configured correctly because generation works with the same model.
Do I need to fix something in the code to make the local model work as a judge? TY!!

Allow to set generation sampling parameters

Currently, the only generation parameter that can be set is temperature:
https://github.com/lm-sys/arena-hard/blob/main/config/gen_answer_config.yaml#L5

However, it would be useful to also be able to set other parameters, such as repetition_penalty, in best case on a model level. These then could be passed accordingly to the API endpoints for generation.

Fix the order of questions.jsonl on Huggingface

Can you add deepseek-coder-v2?

afaik is the best open source model, no?
Also I would like to see claude 3.5 gpt4o and qwen2

Is there any plan to share the full dataset (200k prompts) with the "number of hardness criteria met" label ? I think it would be quite useful to the community

[Discussion] Methodology for bootstrapping with replacement to obtain separability confidence intervals

Hi,

I was studying how arena-hard gets confidence intervals from pairwise comparisons as this is used directly to measure "separability", defined as the percentage of non-overlapping intervals.

The original blog post says: "The 95% confidence interval is computed via 100 rounds of bootstrapping".

I pulled up the code and it looks like you are indeed:

calculating a new elo score over multiple rounds
using the same set of battles each round, but
randomly sampling 100% of the battles with replacement, i.e. df.sample(frac=1.0, replace=True).

I understand that step 3) gives you a slightly different set of battles each time, however, the bootstrap sets will be nearly identical to the original except for whichever items are randomly selected more than once.

I don't doubt that bootstrapping overall is a widely used technique -- I'm mainly wondering about whether df.sample(frac=1.0, replace-True) is the most accepted "correct" way to do it.

As a non-statistician expert, my hunch is that bootstrapping with k-fold cross validation where the battles are separated into random buckets (without replacement) would be a better expression of variance and thus a stronger test for confidence intervals.

[Q] About hosting `arena-hard-v0.1/question.json` in the Hugging Face Hub

Description

Hi here! Awesome job with Arena Hard 👏🏻 I just opened this issue since we, at @argilla-io, are currently exploring the usage of distilabel for running benchmarks such as Arena Hard, and we were wondering if it's a problem or if you have any issue with https://huggingface.co/datasets/alvarobartt/lmsys-arena-hard-v0.1 being hosted in the Hugging Face Hub on our end.

We uploaded it there because we couldn't find the dataset per se in the Hub, but could only find https://huggingface.co/spaces/lmsys/arena-hard-browser with the question.jsonl file there.

If there's already a dataset or you'd like us to transfer that to your org we'll happily do so, hope this is not an issue, but just let us know!

Congrats for the awesome job on evaluating LLMs again!

[Bug] Temperature is always `0.0`

The temperature for answer generations is always 0.0, since the category of all questions is arena-hard-v0.1. This implicitly means that the temperature will be defaulted to 0.0. Wanted to flag since I think this might be undesirable behavior.

Bug in get_battles_from_judgment

It looks like the winners are not correctly specified in the get_battles_from_judgment function. Can you take a look and fix it?

https://github.com/lm-sys/arena-hard/blob/5eb649883765107f42b171962683829fd064a63f/show_result.py#L160-L168

How to add new models to the leaderboard?

Thanks for your great work. Can I request for evaluation for new models to add into the leaderboard?

[Feature] support arena-hard in opencompass

Hi, Thanks for such a robust work!
We have supported ArenaHard dataset in Opencompass now, OpenCompass is an evaluation platform that can partition tasks and support different model inference backends, thereby accelerating the model evaluation process.
After integrating the advantages of your datasets and OpenCompass, it is now possible to directly select a model to perform rapid inference and evaluation in one step.
Besides, opencompass also support to change judge model or set multi-judge models.
The demo config in Opencompass is here: https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_arena_hard.py
Welcome to try in Opencompass and we can further collaborate to strengthen the LLM evaluation work of the open-source community.

Multi-threads generation support ?

I found the judgement generation to be really time-consuming, where evaluating a model would cost more than 1.5 hours, using gpt-4-1106-preview and parallel count 8. Is this an expected behavior?

If yes, does gen_judgement support multi-threads generation on multiple api endpoints to balance the load? For example, if I have 5 endpoints, I can generate 40 judgements at the same time, which should significantly accelerate the evaluating process.

If it is now supported, please tell me how to perform this. Simply adding multiple endpoints for judge model in api_config.yaml seems not working.

Much thanks.

Discrepancy in Scores When Switching GPT Model Versions

I recently judge the model answer provided here and decided to switch the GPT version from gpt-4-1106-preview to gpt-4-0125-preview. Cause I can only access instances of this version.
After making this change, I observed a discrepancy of over 440 points(overall 1000) in the score compared to the judgement benchmarks listed in your documentation.

Could you please advise on how to address this issue or suggest any solutions that might help mitigate this discrepancy?

Majority of questions are coding questions!

Thanks for the great work. The questions are so hard that I couldn't answer any of them.

However, I found a large majority of the questions are coding and CS-related, which would favor code-oriented models and disadvantage other general-purpose models, despite the fact that this benchmark is being advertised as general-purpose MT-bench replacement.

How do you think?

Models testing themselves will always be biased.

The prompt is formatted such that they are supposed to answer the question before judging. If the model is judging itself, then it will compare its own answer with... its own answer, which will be mostly the same. That holds true for models of the same family as well – GPT-4-turbo-preview will have nearly the same answers as GPT-4-turbo. Same with the Claude suite.

Naturally, the judge is going to prefer the answer that most resembles its own. In the end, I'm wondering if the judge solution is necessary.

Markdown Rendering Issue

Environment

python3.10, gardio 3.40.0

Issue

When browsing some questions (I forget which), I encounter this backend gardio error:

ValueError: 
\ket{\psi} = \frac{\ket{00} + \ket{01} + \ket{10}}{\sqrt{3}}
^
ParseFatalException: Unknown symbol: \ket, found '\'  (at char 0), (line:1, col:1)

Only support baseline=True and pairwise=True?

the prompt template in Github is comparing two models instead of scoring

QA browser does not work properly for me

Environment

python3.10, with gardio==4.26.0, gradio_client==0.15.1

Typo

Well, first there is a typo in the file name. It should be "browser" instead pf "broswer".

How to reproduce it

First, there are no dependencies of gradio in requirements. So first I have to manually install gradio to run the qa browser

After invoking python qa_broswer.py, there is something abnormal:

Namespace(host='0.0.0.0', port=None, share=False, config_file='config/judge_config.yaml')
<...>/python3.10/site-packages/gradio/components/dropdown.py:179: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: gpt-3.5-turbo-0613 or set allow_custom_value=True.
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

When accessing the link,

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  <skip some not very relevant messages>
  File "xxx/python3.10/site-packages/gradio/routes.py", line 797, in process_msg
    return f"data: {orjson.dumps(message.model_dump()).decode('utf-8')}\n\n"
TypeError: Type is not JSON serializable: Dropdown

And this is what I observed

lm-sys / arena-hard-auto Goto Github PK

arena-hard-auto's People

Contributors

Stargazers

Watchers

Forkers

arena-hard-auto's Issues

Description

Environment

Issue

Environment

Typo

How to reproduce it

Recommend Projects

Recommend Topics

Recommend Org