allenai / reward-bench Goto Github PK

View Code? Open in Web Editor NEW

303.0 4.0 33.0 26.06 MB

RewardBench: the first evaluation tool for reward models.

Home Page: https://huggingface.co/spaces/allenai/reward-bench

License: Apache License 2.0

Python 99.28% Makefile 0.12% Dockerfile 0.40% Shell 0.21%

preference-learning rlhf

reward-bench's Introduction

RewardBench: Evaluating Reward Models

Leaderboard 📐 | RewardBench Dataset | Existing Test Sets | Results 📊 | Paper📝

RewardBench is a benchmark designed to evaluate the capabilities and safety of reward models (including those trained with Direct Preference Optimization, DPO). The repository includes the following:

Common inference code for a variety of reward models (Starling, PairRM, OpenAssistant, DPO, and more).
Common dataset formatting and tests for fair reward model inference.
Analysis and visualization tools.

The two primary scripts to generate results (more in scripts/):

scripts/run_rm.py: Run evaluations for reward models.
scripts/run_dpo.py: Run evaluations for direct preference optimization (DPO) models (and other models using implicit rewards, such as KTO).
scripts/train_rm.py: A basic RM training script built on TRL.

Quick Usage

RewardBench let's you quickly evaluate any reward model on any preference set. To install for quick usage, install with pip as:

pip install rewardbench

Then, run a following:

rewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8

For a DPO model, pass --ref_model={} and the script will automatically route there. Automatically uses Tokenizers chat templates, but can also use fastchat conv templates.

To run the core Reward Bench evaluation set, run:

rewardbench --model={yourmodel}

Examples:

Normal operation

rewardbench --model=OpenAssistant/reward-model-deberta-v3-large-v2 --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw

DPO model from local dataset (note --load_json)

rewardbench --model=Qwen/Qwen1.5-0.5B-Chat --ref_model=Qwen/Qwen1.5-0.5B --dataset=/net/nfs.cirrascale/allennlp/jacobm/herm/data/berkeley-nectar-binarized-preferences-random-rejected.jsonl --load_json

Experimental: Generative RMs can be run from the pip install by running:

pip install rewardbench[generative]

And then:

rewardbench-gen --model={}

For more information, see scripts/run_generative.py. The extra requirement for local models is VLLM and the requesite API for API models (OpenAI, Anthropic, and Together are supported).

Full Installation

To install from source, please install torch on your system, and then install the following requirements.

pip install -e .

Add the following to your .bashrc:

export HF_TOKEN="{your_token}"

Contribute Your Model

For now, in order to contribute your model to the leaderboard, open an issue with the model name on HuggingFace (you can still evaluate local models with RewardBench, see below). If custom code is needed, please open a PR that enables it in our inference stack (see rewardbench/models for more information).

Evaluating Models

For reference configs, see scripts/configs/eval_configs.yaml. For reference on Chat Templates, many models follow the base / sft model terminology here. A small model for debugging is available at natolambert/gpt2-dummy-rm.

The core scripts automatically evaluate our core evaluation set. To run these on existing preference sets, add the argument --pref_sets.

Running Reward Models

To run individual models with scripts/run_rm.py, use any of the following examples:

python scripts/run_rm.py --model=openbmb/UltraRM-13b --chat_template=openbmb --batch_size=8
python scripts/run_rm.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia
python scripts/run_rm.py --model=PKU-Alignment/beaver-7b-v1.0-cost --chat_template=pku-align --batch_size=16
python scripts/run_rm.py --model=IDEA-CCNL/Ziya-LLaMA-7B-Reward --batch_size=32 --trust_remote_code --chat_template=Ziya

To run these models with AI2 infrastructure, run:

python scripts/submit_eval_jobs.py

Or for example, the best of N sweep on the non-default image:

python scripts/submit_eval_jobs.py --eval_on_bon --image=nathanl/herm_bon

Note: for AI2 users, you must set beaker secret write HF_TOKEN <your_write_token_here> to make the scripts work.

Models using the default abstraction AutoModelForSequenceClassification.from_pretrained can also be loaded locally. Expanding this functionality is TODO. E.g.

python scripts/run_rm.py --model=/net/nfs.cirrascale/allennlp/hamishi/EasyLM/rm_13b_3ep --chat_template=tulu --batch_size=8

Running DPO Models

And for DPO:

python scripts/run_dpo.py --model=stabilityai/stablelm-zephyr-3b --ref_model=stabilityai/stablelm-3b-4e1t --batch_size=8
python scripts/run_dpo.py --model=stabilityai/stablelm-2-zephyr-1_6b --ref_model=stabilityai/stablelm-2-1_6b --batch_size=16

Ensembling RMs

For reward models already in RewardBench, you can run an offline ensemble test to approximate using multiple reward models in your system. To try this, you can run:

python analysis/run_ensemble_offline.py --models sfairXC/FsfairX-LLaMA3-RM-v0.1 openbmb/Eurus-RM-7b Nexusflow/Starling-RM-34B

Running Generative RMs (LLM-as-a-judge)

Local and API models are supported. For example, run OpenAI's models like:

python scripts/run_generative.py --model=gpt-3.5-turbo-0125

Local models are loaded from HuggingFace, though some are also available via Together's API. Run Llama 3 locally with

python scripts/run_generative.py --model=meta-llama/Llama-3-70b-chat-hf --force_local

Or, with Together's API with:

python scripts/run_generative.py --model=meta-llama/Llama-3-70b-chat-hf

We are adding support for generative ensembles (only via API for now), run with:

python scripts/run_generative.py --model gpt-3.5-turbo-0125 claude-3-sonnet-20240229 meta-llama/Llama-3-70b-chat-hf

Note: these must be an odd number of models > 1.

Creating Best of N (BoN) rankings

To create the ranking across the dataset, run (best_of 8 being placeholder, 16 should be fine as eval logic will handle lower best of N numbers):

python scripts/run_bon.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia --best_of=8 --debug

Getting Leaderboard Section Scores

Important: We use prompt-weighed scores for the sections Chat, Chat Hard, Safety, and Reasoning (with math equalized to code here) to avoid assigning too much credit to small subsets (e.g. MT Bench ones). Use the following code to compute the scores for each category, assuming RewardBench is installed:

from rewardbench.constants import EXAMPLE_COUNTS, SUBSET_MAPPING
from rewardbench.utils import calculate_scores_per_section

metrics = {
  "alpacaeval-easy": 0.5,
  "alpacaeval-hard": 0.7052631578947368,
  "alpacaeval-length": 0.5894736842105263,
  "chat_template": "tokenizer",
  "donotanswer": 0.8235294117647058,
  "hep-cpp": 0.6280487804878049,
  "hep-go": 0.6341463414634146,
  "hep-java": 0.7073170731707317,
  "hep-js": 0.6646341463414634,
  "hep-python": 0.5487804878048781,
  "hep-rust": 0.6463414634146342,
  "llmbar-adver-GPTInst": 0.391304347826087,
  "llmbar-adver-GPTOut": 0.46808510638297873,
  "llmbar-adver-manual": 0.3695652173913043,
  "llmbar-adver-neighbor": 0.43283582089552236,
  "llmbar-natural": 0.52,
  "math-prm": 0.2953020134228188,
  "model": "PKU-Alignment/beaver-7b-v1.0-cost",
  "model_type": "Seq. Classifier",
  "mt-bench-easy": 0.5714285714285714,
  "mt-bench-hard": 0.5405405405405406,
  "mt-bench-med": 0.725,
  "refusals-dangerous": 0.97,
  "refusals-offensive": 1,
  "xstest-should-refuse": 1,
  "xstest-should-respond": 0.284
}

# Calculate and print the scores per section
scores_per_section = calculate_scores_per_section(EXAMPLE_COUNTS, SUBSET_MAPPING, metrics)
print(scores_per_section)

Repository structure

├── README.md                   <- The top-level README for researchers using this project
├── analysis/                   <- Directory of tools to analyze RewardBench results or other reward model properties
├── rewardbench/                <- Core utils and modeling files
|   ├── models/                     ├── Standalone files for running existing reward models
|   └── *.py                        └── RewardBench tools and utilities
├── scripts/                    <- Scripts and configs to train and evaluate reward models
├── tests                       <- Unit tests
├── Dockerfile                  <- Build file for reproducible and scaleable research at AI2
├── LICENSE
├── Makefile                    <- Makefile with commands like `make style`
└── setup.py                    <- Makes project pip installable (pip install -e .) so `alignment` can be imported

Maintenance

This section is designed for AI2 usage, but may help others evaluating models with Docker.

Updating the docker image

When updating this repo, the docker image should be rebuilt to include those changes. For AI2 members, please update the list below with any images you use regularly. For example, if you update scripts/run_rm.py and include a new package (or change a package version), you should rebuild the image and verify it still works on known models.

To update the image, run these commands in the root directory of this repo:

docker build -t <local_image_name> . --platform linux/amd64
beaker image create <local_image_name> -n <beaker_image_name>

Notes: Do not use the character - in image names for beaker,

When updating the Dockerfile, make sure to see the instructions at the top to update the base cuda version.

In development, we have the following docker images (most recent first as it's likely what you need). TODO: Update it so one image has VLLM (for generative RM only) and one without. Without will load much faster.

nathanl/rb_v17 (with VLLM): add support for vllm + llm as a judge, rb_v16 is similar without prometheus and some OpenAI models
nathanl/rb_v12: add support for llama3
nathanl/rewardbench_v10: add support for mightbe/Better-PairRM via jinja2
nathanl/rewardbench_v8: add support for openbmb/Eurus-RM-7b and starcoder2
nathanl/rewardbench_v5: improve saving with DPO script
nathanl/rewardbench_v4: fix EOS token bug on FastChat models (GH #90)
nathanl/rewardbench_v2: fix beaver cost model
nathanl/rewardbench_v1: release version

Citation

Please cite our work with the following:

@misc{lambert2024rewardbench,
      title={RewardBench: Evaluating Reward Models for Language Modeling}, 
      author={Nathan Lambert and Valentina Pyatkin and Jacob Morrison and LJ Miranda and Bill Yuchen Lin and Khyathi Chandu and Nouha Dziri and Sachin Kumar and Tom Zick and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2024},
      eprint={2403.13787},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

reward-bench's People

Contributors

Stargazers

Watchers

reward-bench's Issues

Visualization requests

Some things to add:

Pareto distribution of any Section or Subset

Comment anything else (or just watch my notes)

Improve per-token reward tool

Todos:

Right way to store data for multiple models on the same prompt
Ability to handle chat template
Ability to randomly sample from known datasets (e.g. alpacaeval)
Way to visualize one or multiple models together

Pref Sets updates

Add id column
Remove summarize prompted and reflect that in the leaderboard

Add a new mistral RM model

Thank you for your work! Can you please test my RM hendrydong/Mistral-RM-for-RAFT-GSHF-v0 in the leaderboard?

My local results are as below:

{"model": "hendrydong/Mistral-RM-for-RAFT-GSHF-v0", "model_type": "Seq. Classifier", "chat_template": "tokenizer", "alpacaeval-easy": 0.99, "alpacaeval-hard": 1.0, "alpacaeval-length": 0.9473684210526315, "donotanswer": 0.6470588235294118, "hep-cpp": 0.9390243902439024, "hep-go": 0.9573170731707317, "hep-java": 0.9695121951219512, "hep-js": 0.9390243902439024, "hep-python": 0.9451219512195121, "hep-rust": 0.9329268292682927, "llmbar-adver-GPTInst": 0.3804347826086957, "llmbar-adver-GPTOut": 0.5957446808510638, "llmbar-adver-manual": 0.34782608695652173, "llmbar-adver-neighbor": 0.4701492537313433, "llmbar-natural": 0.9, "math-prm": 0.5503355704697986, "mt-bench-easy": 1.0, "mt-bench-hard": 0.7837837837837838, "mt-bench-med": 0.975, "refusals-dangerous": 0.75, "refusals-offensive": 0.96, "xstest-should-refuse": 0.9805194805194806, "xstest-should-respond": 0.888}

Set default chat template to None

and subsiquent logic. Much more explict.

New LLaMA-3 Seq. Classfier Model

Hi, congrats to your impactful work again.

We found that the LLaMA-3 model also performs well as the Seq. Classfier. Can you please include our latest LLaMA-3-RM into the reward-bench? We uploaded it at sfairXC/FsfairX-LLaMA3-RM-v0.1.

Thanks in advance!

Add Nvidia RMs (and Nemo compatibility)

See below!
https://huggingface.co/models?library=nemo&search=RM

This involves converting to HF format or adding nemo compatibility, if anyone has time!

Add generative models to pip install (probably with optional dependencies)

Set up OpenRouter for llm-as-a-judge

Need to get legal approval to use Gemini API, but try this https://openrouter.ai/docs#quick-start

stanfordnlp/SteamSHP-flan-t5 performance on SHP and HH-RLHF Helpful

Hi, thanks for this great work, its really interessting and helpful!

I was a bit surprised by the stanfordnlp/SteamSHP-flan-t5-xl and stanfordnlp/SteamSHP-flan-t5-large performance on the SHP dataset in Table 12, because their self reported accuracy is 0.7278 and 0.7203 respectively. Do you know the reason for this difference?

(AFAIK, their reported average also includes the performance on HH-RLHF helpful-base, but I dont think that should drag the performance down that much?)

Vice versa, the HH-RLHF helpful scores in Table 12 are much lower than the reported ones on huggingface (0.731 vs 0.633 and 0.731 vs 0.629).

Is eval set on huggingface the eval set or train set?

Hi @natolambert et al,

We are reading the paper and the 2.98K filtered dataset at huggingface.

https://huggingface.co/datasets/allenai/reward-bench

I am curious if the huggingface 2.98K filtered data is the actual evaluation data used to evaluate on the leaderboard?

Cause I looked into the code and saw this line in utils.py.

CORE_EVAL_SET = "ai2-adapt-dev/rm-benchmark-dev"
EXTRA_PREF_SETS = "allenai/pref-test-sets"

When I went to ai2-adapt-dev, I saw that it is a private dataset.

Asking cause we're hoping to know if we can/should train on the huggingface dataset for our reward model to fairly compare on the leaderboard.

Thanks!

Experiment with human vs gpt4 data

With the human data AI2 has or a dataset like no_robots, we could test if a RM prefers the human or model answers to a completion.

[Model Request] mightbe/Better-PairRM

https://huggingface.co/mightbe/Better-PairRM

I made few fixes on PairRM's dataset filter process and truncate etc..

As a result I got at least 15% performance.

Code and prompt template on huggingface repo.

Thank you.

Check EOS token on FastChat models

TLDR:

HF tokenizer files add EOS token at the end of instructions https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/tokenizer_config.json#L34
DPO code manually adds EOS token https://github.com/huggingface/trl/blob/6c2f829bb7408660b0e581cde56fbff0980b9d7b/trl/trainer/dpo_trainer.py#L668

Seems like FastChat models are not getting an EOS token. Could effect the non-DPO models that are using FastChat chat templates minorly

`pad_token_id` issue

Hi,

I encountered some problem about pad_token_id.

I trained a TinyLlama reward model by modifying TRL sample code. I want to use this benchmark for evaluation. I add the following code into REWARD_MODEL_CONFIG:

"TinyLlama/TinyLlama-1.1B-Chat-v0.5": {
        "model_builder": AutoModelForSequenceClassification.from_pretrained,
        "pipeline_builder": pipeline,
        "quantized": False,
        "custom_dialogue": False,
        "model_type": "Seq. Classifier",
    },
}

And I run the evaluation by python scripts/run_rm.py --model=TinyLlama/TinyLlama-1.1B-Chat-v0.5 --chat_template=TinyLlama --do_not_save.

The error message looks like this:

Traceback (most recent call last):
  File "/work/hank0316/reward-bench/scripts/run_rm.py", line 337, in <module>
    main()
  File "/work/hank0316/reward-bench/scripts/run_rm.py", line 200, in main
    results_rej = reward_pipe(dataset["text_rejected"], **reward_pipeline_kwargs)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 156, in __call__
    result = super().__call__(*inputs, **kwargs)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1198, in __call__
    outputs = list(final_iterator)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1123, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 187, in _forward
    return self.model(**model_inputs)
  File "/home/hank0316/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/hank0316/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hank0316/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1401, in forward
    raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
ValueError: Cannot handle batch sizes > 1 if no padding token is defined.

I think this error message means that there's no pad_token_id in the model config. The pad_token_id exists in the tokenizer. As a result, I add two line of code under line 185 of scripts/run_rm.py:

    elif reward_pipe.model.config.pad_token_id is None:
        reward_pipe.model.config.pad_token_id = reward_pipe.tokenizer.pad_token_id

I would appreciate it if someone could review this modification to confirm that it correctly addresses the issue. Additionally, if there are any concerns or alternative approaches to consider, please let me know.

Thank you for your attention to this matter.

Best regards,
Hank

Add `rewardbench` on pypi + basic release management

adding kto as a separate category

can KTO be added as a separate model type on the leaderboard?

Check Qwen model

Results are quite odd, initial support added in #47 following tokenizer documentation and FAQ of the model.

Results are worse than expected.

Experiment request: DPO with different betas

TLDR: How much does beta impact a DPO model accuracy by controlling KL distance?

Add PoLL for generative RM

See this paper: essentially ensemble LLM as a judge https://arxiv.org/abs/2404.18796

Fix score saving PairRM and SteamSHP

The logic for saving per-prompt scores is breaking these two models, see the following beaker logs.
https://beaker.org/ex/01HQ6VG7H3XPYRVP3S76ZXB126
https://beaker.org/ex/01HQ6VG7GMRN9TNWNZH2WTMKG0

Clarification Needed on DPO Reward Evaluation

Thank you for providing such a valuable benchmark.

I am seeking clarification on the model/reference specifications for DPO rewards, which are not readily apparent in either the paper or the leaderboard. For example, it is unclear whether models like Llama-3-8B-Instruct, Qwen, and Zephyr were evaluated without or with references. If references were used, could you please provide guidance on how to access the reference model?

Thank you for your assistance.

adding Archangel models (dpo, kto, sft+dpo, sft+kto)

The Archangel suite of models contain DPO, SFT+DPO, KTO, SFT+KTO models which can also be used as reward models: https://huggingface.co/collections/ContextualAI/archangel-65bd45029fa020161b052430

For each method, there are seven models available: pythia-{1.4, 2.8, 6.9, 12.0}B and llama-{7, 13, 30}B, all of which have been aligned under nearly identical settings on {Anthropic HH, Open Assistant, SHP 1.0} data.

The implied reward for both DPO- and KTO-aligned models is $\beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$, where $\pi_\text{ref}$ is the reference model

The reference model for each set of models in Archangel is as follows:

for the SFT+DPO model ContextualAI/archangel_sft-dpo_{model}, the reference is ContextualAI/archangel_sft_{model}
for the SFT+KTO model ContextualAI/archangel_sft-kto_{model}, the reference is ContextualAI/archangel_sft_{model}
for the DPO model w/o SFT ContextualAI/archangel_dpo_llama7b, the reference is huggyllama/llama-7b, which can be found in the _name_or_path field in config.json
for the KTO model w/o SFT ContextualAI/archangel_kto_llama7b, the reference is huggyllama/llama-7b, which can be found in the _name_or_path field in config.json

Add new model Mistral-7B-instruct-Unified-Feedback

Thank you for your great work! Could you add my reward model on the leader board? I have tested it locally.

Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback

Clean up / enhance DPO code

Make it so you can run inference over individual text prompts (rather than chosen + rejected)
Clean up nograd/detach (see https://twitter.com/shxf0072/status/1771220126655811610), but should be pretty obvious
Add per-model log prob computation.

Support Nous Mixtral

Trust remote tokenizer added in #50, but still having unclear issues hanging on model loading

DATASET TRACKING

A quick place to list datasets we need to add still.

Existing Test Sets

Completed:

New Eval Tools

Science Data from David Wadden

Completed:

HumanEvalPack
Many others earlier in process

(this does not yet include filtering)

Check beaver cost model

Quoting an author (I think):

Great work, this is an long due effort in this field. Though it's a bit unexpected to see beaver-cost model performed poorly on safety-related dataset. Have you checked if you have got the signs worked out? Because in our setting negative reward means safer and should be chosen.

Truncation of long sequences

We should double check it's handled as left truncation, similar to

reward_tokenizer.truncation_side = "left"

Output leaderboard scores when running `run_rm.py`

e.g. integrate this https://github.com/allenai/reward-bench?tab=readme-ov-file#getting-leaderboard-section-scores

Dataset v2 discussion & feedback

Hey! Post any questions or complaints on the dataset. We'll log our internal goals and limitations here too.

It was pointed out by Rishabh Agarwal that the PRM Math subset has two structural issues. 1) we added newlines to the human reference answers (debatably could be called a bug). 2) with GPT4 always as rejected, some models may be biased there.

Add new model weqweasdas/RM-Mistral-7B

Hi! Congratulations on your amazing work!

I am wondering whether it is convenient for you to add this reward model in your leaderboard:

https://huggingface.co/weqweasdas/RM-Mistral-7B

many thanks in advance!

Rename Starling 34B

Change pointer of model from to berkeley-nest to Nexusflow/Starling-RM-34B

multi gpu inference with run_rm.py

Hello Nathan,

Thank you for this valuable resource! I strongly think that we needed more standardized benchmarks to evaluate reward/evaluator models.

I think submit_eval_jobs.py (using AI2's beaker) supports multi gpu inference but run_rm.py doesn't at the moment.
I was wondering if this intended (correct me if I'm wrong)!

Best,
Seungone

Save reward scores for each prompt

Add this feature for future research!

Multiple styles of computing reward with DPO

Currently matches the paper, but we should add the ability to normalize by length:

Divide by length of response (chosen or rejected).
Take a norm-style approach which is a length weighted average.

Do we need to add system prompt when training/evaluating RM?

Hi,

I want to know if we should add a system prompt when training and evaluating our reward models. For instance, I have supervised fine-tuned LLaMA2-7B on the Alpaca SFT 10k subset, and I use the same chat template for the tokenizer as the one in LLaMA2-7B-Chat. The system prompt used in Meta's paper is shown in the screenshot below, which is included manually during my SFT procedure.

My understanding is that since we add a system prompt when collecting responses from LLaMA2 during RLHF, the system prompt becomes part of the overall prompt seen by LLaMA2. Therefore, it seems logical to include the system prompt in the query sent to the reward model to better assess the quality of the generated responses. However, if we apply the tokenizer's chat template, as is the default behavior in Reward-Bench, the system prompt might not be included.

Could you please clarify if the system prompt should be added during the training and evaluation of reward models?

Thank you!

[Core team] Migrate Prior Sets to 50% weight

Fixes:

Leaderboard loading / display
Repo loading / table printing (for paper)
Replace all column names "Average" with "Score"
Update paper

cc @ljvmiranda921

Saving bug (non breaking)

We don't use our own sub_path correctly in saving results. It works, but is confusing.
See:

Tbh i'm surprised the code still works as expected (lol)

Generative RM

To use models like GPT4 and others as a baseline, we need a script that generates a response to which is better.
I'm not sure if we want to include this yet.

An example model is Auto J.

Even with temperature = 0, there are lots of ways for this to seem unnecessary and non-deterministic (unless trained with DPO).

Best of N benchmark

Take a few chat models as the “base set”, say 1-3, like tulu 2 7b and tulu 2 13b (maybe olmo-instruct)
Generate ~8 completions per prompt in AlpacaEval (this is the heldout set)
Use each RM to choose the best-of-1 from that set, then run alpaca eval on the outputs
Score the delta for each RM in the batch on a set task (alpacaeval) and set base model (tulu)
Could do this with MTBench, but two turn is harder

Obvi flaws, but that seems WAY better than nothing.