lm-sys / arena-hard-auto Goto Github PK
View Code? Open in Web Editor NEWArena-Hard-Auto: An automatic LLM benchmark.
License: Apache License 2.0
Arena-Hard-Auto: An automatic LLM benchmark.
License: Apache License 2.0
Hi! Thanks for releasing this awesome benchmark :)
I was interested in evaluating this benchmark using local models that I have trained or models that are available on Hugging Face. From what I understand, it appears that I would need to develop the generation pipeline myself, possibly using tools like vLLM or similar services. Do I miss anything here?
First of all thanx for your work.
Maybe I have misunderstood, but I could not file Bradley-Terry model usage/implementation in your code, instead you are doing something interesting with LogReg coefficient.
Please, can you point to the source of idea behind?
And do you think that Bradley-Terry model will perform worse than this LR trick?
Hi! Thanks for releasing this amazing benchmark, and we have been heavily using it for our development. When I run show_result.py
, the following warning occurs and it seems thta logistic regression model failed to converge. Should I be concerned about it?
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
bootstrap: 2%|████▍ | 4/200 [00:07<05:28, 1.68s/it]
/scratch/gpfs/mengzhou/anaconda3/envs/handbook/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Hey Team,
Thanks for sharing the benchmarks! I'm testing the scripts by simply copy a model answer as well as the judgement and change the model ids in each jsonl file, but got different CI results after showing as below:
gpt-4-0613-copy | score: 37.9 | 95% CI: (-2.8, 2.7) | average #tokens: 354
gpt-4-0613 | score: 37.9 | 95% CI: (-2.7, 3.0) | average #tokens: 354
Is it related to the bootstrapping? Thanks!
Best regards,
QQ
Hello! Thank you very much for your work! I have a question.
How do I use a local model as a judge? Generating responses using VLLM works fine, but if you set the same local model as an evaluator, the code throws an error:
openai.NotFoundError: Error code: 404 - {'object': 'error', 'message': 'The model ` Meta-Llama-3-70B-Instruct-GPTQ` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}.
Of course, I change in judge_config.yaml: judge_model: Meta-Llama-3-70B-Instruct-GPTQ
and api_config.yaml is configured correctly because generation works with the same model.
Do I need to fix something in the code to make the local model work as a judge? TY!!
Currently, the only generation parameter that can be set is temperature
:
https://github.com/lm-sys/arena-hard/blob/main/config/gen_answer_config.yaml#L5
However, it would be useful to also be able to set other parameters, such as repetition_penalty
, in best case on a model level. These then could be passed accordingly to the API endpoints for generation.
afaik is the best open source model, no?
Also I would like to see claude 3.5 gpt4o and qwen2
Hi,
I was studying how arena-hard gets confidence intervals from pairwise comparisons as this is used directly to measure "separability", defined as the percentage of non-overlapping intervals.
The original blog post says: "The 95% confidence interval is computed via 100 rounds of bootstrapping".
I pulled up the code and it looks like you are indeed:
df.sample(frac=1.0, replace=True)
.I understand that step 3) gives you a slightly different set of battles each time, however, the bootstrap sets will be nearly identical to the original except for whichever items are randomly selected more than once.
I don't doubt that bootstrapping overall is a widely used technique -- I'm mainly wondering about whether df.sample(frac=1.0, replace-True)
is the most accepted "correct" way to do it.
As a non-statistician expert, my hunch is that bootstrapping with k-fold cross validation where the battles are separated into random buckets (without replacement) would be a better expression of variance and thus a stronger test for confidence intervals.
Hi here! Awesome job with Arena Hard 👏🏻 I just opened this issue since we, at @argilla-io, are currently exploring the usage of distilabel
for running benchmarks such as Arena Hard, and we were wondering if it's a problem or if you have any issue with https://huggingface.co/datasets/alvarobartt/lmsys-arena-hard-v0.1 being hosted in the Hugging Face Hub on our end.
We uploaded it there because we couldn't find the dataset per se in the Hub, but could only find https://huggingface.co/spaces/lmsys/arena-hard-browser with the question.jsonl
file there.
If there's already a dataset or you'd like us to transfer that to your org we'll happily do so, hope this is not an issue, but just let us know!
Congrats for the awesome job on evaluating LLMs again!
The temperature for answer generations is always 0.0
, since the category
of all questions is arena-hard-v0.1.
This implicitly means that the temperature will be defaulted to 0.0.
Wanted to flag since I think this might be undesirable behavior.
It looks like the winners are not correctly specified in the get_battles_from_judgment
function. Can you take a look and fix it?
Thanks for your great work. Can I request for evaluation for new models to add into the leaderboard?
Hi, Thanks for such a robust work!
We have supported ArenaHard dataset in Opencompass now, OpenCompass is an evaluation platform that can partition tasks and support different model inference backends, thereby accelerating the model evaluation process.
After integrating the advantages of your datasets and OpenCompass, it is now possible to directly select a model to perform rapid inference and evaluation in one step.
Besides, opencompass also support to change judge model or set multi-judge models.
The demo config in Opencompass is here: https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_arena_hard.py
Welcome to try in Opencompass and we can further collaborate to strengthen the LLM evaluation work of the open-source community.
I found the judgement generation to be really time-consuming, where evaluating a model would cost more than 1.5 hours, using gpt-4-1106-preview and parallel count 8. Is this an expected behavior?
If yes, does gen_judgement support multi-threads generation on multiple api endpoints to balance the load? For example, if I have 5 endpoints, I can generate 40 judgements at the same time, which should significantly accelerate the evaluating process.
If it is now supported, please tell me how to perform this. Simply adding multiple endpoints for judge model in api_config.yaml seems not working.
Much thanks.
I recently judge the model answer provided here and decided to switch the GPT version from gpt-4-1106-preview to gpt-4-0125-preview. Cause I can only access instances of this version.
After making this change, I observed a discrepancy of over 440 points(overall 1000) in the score compared to the judgement benchmarks listed in your documentation.
Could you please advise on how to address this issue or suggest any solutions that might help mitigate this discrepancy?
Thanks for the great work. The questions are so hard that I couldn't answer any of them.
However, I found a large majority of the questions are coding and CS-related, which would favor code-oriented models and disadvantage other general-purpose models, despite the fact that this benchmark is being advertised as general-purpose MT-bench replacement.
How do you think?
The prompt is formatted such that they are supposed to answer the question before judging. If the model is judging itself, then it will compare its own answer with... its own answer, which will be mostly the same. That holds true for models of the same family as well – GPT-4-turbo-preview will have nearly the same answers as GPT-4-turbo. Same with the Claude suite.
Naturally, the judge is going to prefer the answer that most resembles its own. In the end, I'm wondering if the judge solution is necessary.
python3.10, gardio 3.40.0
When browsing some questions (I forget which), I encounter this backend gardio error:
ValueError:
\ket{\psi} = \frac{\ket{00} + \ket{01} + \ket{10}}{\sqrt{3}}
^
ParseFatalException: Unknown symbol: \ket, found '\' (at char 0), (line:1, col:1)
the prompt template in Github is comparing two models instead of scoring
python3.10, with gardio==4.26.0, gradio_client==0.15.1
Well, first there is a typo in the file name. It should be "browser" instead pf "broswer".
First, there are no dependencies of gradio in requirements. So first I have to manually install gradio to run the qa browser
After invoking python qa_broswer.py
, there is something abnormal:
Namespace(host='0.0.0.0', port=None, share=False, config_file='config/judge_config.yaml')
<...>/python3.10/site-packages/gradio/components/dropdown.py:179: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: gpt-3.5-turbo-0613 or set allow_custom_value=True.
warnings.warn(
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
When accessing the link,
ERROR: Exception in ASGI application
Traceback (most recent call last):
<skip some not very relevant messages>
File "xxx/python3.10/site-packages/gradio/routes.py", line 797, in process_msg
return f"data: {orjson.dumps(message.model_dump()).decode('utf-8')}\n\n"
TypeError: Type is not JSON serializable: Dropdown
And this is what I observed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.