Giter Club home page Giter Club logo

simpo's Introduction

Simple Preference Optimization (SimPO)

This repository contains the code and released models for our paper SimPO: Simple Preference Optimization with a Reference-Free Reward. We propose a simpler and more effective preference optimization algorithm than DPO (Direct Preference Optimization) without using a reference model. SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings.

🔗 Quick Links

Released Models

Below is the full list of models that we evaluate in our preprint.

models AE2 LC AE2 WR AH
Mistral Base 7B SFT alignment-handbook/zephyr-7b-sft-full 8.4 6.2 1.3
Mistral Base 7B DPO (Zephyr) princeton-nlp/Mistral-7B-Base-SFT-DPO 15.1 12.5 10.4
Mistral Base 7B IPO princeton-nlp/Mistral-7B-Base-SFT-IPO 11.8 9.4 7.5
Mistral Base 7B KTO princeton-nlp/Mistral-7B-Base-SFT-KTO 13.1 9.1 5.6
Mistral Base 7B ORPO kaist-ai/mistral-orpo-beta 14.7 12.2 7.0
Mistral Base 7B R-DPO princeton-nlp/Mistral-7B-Base-SFT-RDPO 17.4 12.8 9.9
Mistral Base 7B SimPO princeton-nlp/Mistral-7B-Base-SFT-SimPO 21.4 20.8 16.6
Mistral Instruct 7B SFT mistralai/Mistral-7B-Instruct-v0.2 17.1 14.7 12.6
Mistral Instruct 7B DPO princeton-nlp/Mistral-7B-Instruct-DPO 26.8 24.9 16.3
Mistral Instruct 7B IPO princeton-nlp/Mistral-7B-Instruct-IPO 20.3 20.3 16.2
Mistral Instruct 7B KTO princeton-nlp/Mistral-7B-Instruct-KTO 24.5 23.6 17.9
Mistral Instruct 7B ORPO princeton-nlp/Mistral-7B-Instruct-ORPO 24.5 24.9 20.8
Mistral Instruct 7B R-DPO princeton-nlp/Mistral-7B-Instruct-RDPO 27.3 24.5 16.1
Mistral Instruct 7B SimPO princeton-nlp/Mistral-7B-Instruct-SimPO 32.1 34.8 21.0
Llama3 Base 8B SFT princeton-nlp/Llama-3-Base-8B-SFT 6.2 4.6 3.3
Llama3 Base 8B DPO princeton-nlp/Llama-3-Base-8B-SFT-DPO 18.2 15.5 15.9
Llama3 Base 8B IPO princeton-nlp/Llama-3-Base-8B-SFT-IPO 14.4 14.2 17.8
Llama3 Base 8B KTO princeton-nlp/Llama-3-Base-8B-SFT-KTO 14.2 12.4 12.5
Llama3 Base 8B ORPO princeton-nlp/Llama-3-Base-8B-SFT-ORPO 12.2 10.6 10.8
Llama3 Base 8B R-DPO princeton-nlp/Llama-3-Base-8B-SFT-RDPO 17.6 14.4 17.2
Llama3 Base 8B SimPO princeton-nlp/Llama-3-Base-8B-SFT-SimPO 22.0 20.3 23.4
Llama3 Instruct 8B SFT meta-llama/Meta-Llama-3-Instruct-8B 26.0 25.3 22.3
Llama3 Instruct 8B DPO princeton-nlp/Llama-3-Instruct-8B-DPO 40.3 37.9 32.6
Llama3 Instruct 8B IPO princeton-nlp/Llama-3-Instruct-8B-IPO 35.6 35.6 30.5
Llama3 Instruct 8B KTO princeton-nlp/Llama-3-Instruct-8B-KTO 33.1 31.8 26.4
Llama3 Instruct 8B ORPO princeton-nlp/Llama-3-Instruct-8B-ORPO 28.5 27.4 25.8
Llama3 Instruct 8B R-DPO princeton-nlp/Llama-3-Instruct-8B-RDPO 41.1 37.8 33.1
Llama3 Instruct 8B SimPO princeton-nlp/Llama-3-Instruct-8B-SimPO 44.7 40.5 33.8

Please refer to the generate.py script for detailed instructions on loading the model with the appropriate chat template.

Install Requirements

Our codebase is built upon the alignment-handbook repo. The following steps will guide you through the installation process.

First, create a Python virtual environment using e.g. Conda:

conda create -n handbook python=3.10 && conda activate handbook

Next, install PyTorch v2.2.2. Since this is hardware-dependent, we direct you to the PyTorch Installation Page.

You can then install the remaining package dependencies of alignment-handbook as follows:

git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .

You will also need Flash Attention 2 installed, which can be done by running:

python -m pip install flash-attn --no-build-isolation

Training Scripts

We provide four training config files for the four training setups reported in our paper. The training config is set for 8xH100 GPUs. You may need to adjust num_processes and per_device_train_batch_size based on your computation environment.

  • Mistral-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
  • Mistral-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-instruct-simpo.yaml
  • Llama3-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-base-simpo.yaml
  • Llama3-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-instruct-simpo.yaml

Evaluation

We follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, and MT-Bench, as follows (more details can be found under the eval directory):

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Yu ([email protected]). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you find the repo helpful in your work:

@article{meng2024simpo,
  title={{SimPO}: Simple Preference Optimization with a Reference-Free Reward},
  author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
  year={2024}
}

simpo's People

Contributors

cameron-chen avatar crispstrobe avatar xiamengzhou avatar yumeng5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

simpo's Issues

On Length Normalization in the Code

Congratulations on creating such elegant and effective work! Could you please indicate where the length normalization of the response log probability, as mentioned in the paper, is implemented in the code? I have checked the simplepo_trainer but couldn't find this part of the code. Thank you for your response.

`def simpo_loss(
self,
policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Compute the SimPO loss for a batch of policy model log probabilities.

    Args:
        policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
        policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)

    Returns:
        A tuple of three tensors: (losses, chosen_rewards, rejected_rewards).
        The losses tensor contains the SimPO loss for each example in the batch.
        The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.
    """
    pi_logratios = policy_chosen_logps - policy_rejected_logps
    gamma_logratios = self.gamma / self.beta 
    pi_logratios = pi_logratios.to(self.accelerator.device)
    logits = pi_logratios - gamma_logratios

    if self.loss_type == "sigmoid":
        losses = (
            -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
            - F.logsigmoid(-self.beta * logits) * self.label_smoothing
        )
    elif self.loss_type == "hinge":
        losses = torch.relu(1 - self.beta * logits)
    else:
        raise ValueError(
            f"Unknown loss type: {self.loss_type}. Should be one of ['sigmoid', 'hinge']"
        )

    chosen_rewards = self.beta * policy_chosen_logps.to(self.accelerator.device).detach()
    rejected_rewards = self.beta * policy_rejected_logps.to(self.accelerator.device).detach()

    return losses, chosen_rewards, rejected_rewards`

Why is the AlpacaEval2 score of meta-llama/Meta-LLama-3-8B-Instruct in the paper higher than that on the leaderboard?

Hi,

as described in the title, this is a bit strange. The score I tested locally for meta-llama/Meta-LLama-3-8B-Instruct is LC winrate/ raw win rate =22.87%/22.77%, which is consistent with the official verified score 22.9%/22.6% on the leaderboard. However, in the paper and in the readme, the reported score is 26%/25%, around 3-4% higher.

Meanwhile, I have also tested other baselines such as DPO, KTO released in this repo and the scores are consistent with the paper. So this should rule out minor differences such as the the difference in decoding temperature being the cause of the inconsistency for Meta-LLama-3-8B-Instruct.

Interestingly, the score for my locally trained DPO is also around 4% off from the released DPO checkpoint.

Did you guys use a different checkpoint from the currently available version of meta-llama/Meta-LLama-3-8B-Instruct as the reference policy in your experiment?

I am currently writing a paper and hope to cite your work as a baseline.
So, I would really appreciate it if you can help me out on this issue. Thank you in advance!

Question about the length-normalization

Hi, thank you for the interesting work.

I have a question regarding the length-normalization in the “scripts/simpo_trainer.py” file. It appears that the key difference in the concatenated_forward function between simpo_trainer and dpo_trainer is that simpo_trainer uses average_log_prob. I assume that setting average_log_prob=True averages over loss_mask.sum(-1), contributing to the length-normalization in simpo. Does this interpretation make sense?

Thanks!

Your comparsion is unfair by using different chat templates

In your checkpoint: princeton-nlp/Mistral-7B-Base-SFT-SimPO, the chat_template is:

"chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",

However, the base model alignment-handbook/zephyr-7b-sft-full has a different template:

"chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}",

Why did you change the chat_template when you fine-tuned the alignment-handbook/zephyr-7b-sft-full?

Based on my experiments, the training chat template largely affects the performance on AlpacaEval2 (the most important contribution in your paper).

Length normalization in DPO and other variants

Given how crucial length normalization seems to be in the success of SimPO, and how easy it is to change in the these implementations, did you experiment with length normalized DPO, or using length normalization with any of these other DPO variants? (it's interesting to note, as you do in the paper, that ODPO also uses length normalization)

There has been some discussion of this here (eric-mitchell/direct-preference-optimization#48) and here (eric-mitchell/direct-preference-optimization#40).

I'm trying to investigate this myself, just curious if it has already been tried. Thank you.

Originally posted by @yakazimir in #4 (comment)

Unable to reproduce the results of DPO

Hi,

Thank you for your work.

Could you please tell me what the hyperparameters for Mistral Base 7B DPO (Zephyr), i.e., the vanilla DPO are?

I utilized beta=0.01 and beta=0.1 but can not reproduce the performance. For instance,

it only achieved 0.45 for TruthfulQA, whereas Mistral Base 7B DPO achieves 0.53.

Thank you very much.

Training leads to model collapse

Hi, Interesting work! I immediately trained and tested it on my downstream tasks. However, the generated results showed that the model collapsed, outputting a lot of repetition, similar to "Hi Hi Hi Hi Hi". Upon checking the training log, I found that although the margin and accuracy increased, the accept reward and reject reward rapidly decreased.

To rule out the influence of my private dataset and task format, I conducted training on a small subset of the public datasets hiyouga/DPO-En-Zh-20k, and the training dynamics were similar, the reward both decreased rapidly. Have you encountered the similar issues, and how can I resolve it?

Training Argument:

batch 128
max_step 1000
simpo_gamma 0.5
beta 2.5
optim paged_adamw_32bit

Reward/Reject
下载 (4)

Reward/Accept
下载 (2)

Reward/Margin
下载 (3)

Reward/Accuracy
下载 (1)

Training Loss
下载 (5)

Confusing Code logic

image
Could anyone explain why the parent class DPOTrainer is initialized twice here in SimpoTrainer's constructor function?

Question about apply_chat_template to prompt

In this repo, I noticed that the input prompt was applied to chat_template.

example["text_prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)

However, in your PR to trl, it seems that chat_template was not used for the prompt.
https://github.com/huggingface/trl/blob/223ce737d651a78a9b54ee1a4472fc4e4eb61760/examples/scripts/simpo.py#L95
And I have found this to be common in other examples in trl.
I am not sure what the difference is between these two practices and what the motivation is for each. Could you kindly tell me?

Mismatch of results

Hi, I use the following command to run the code

CUDA_VISIBLE_DEVICES=3,5,6 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run.py training_configs/mistral-7b-base-simpo.yaml

I found that it occupies 70GB per A100 card, even when the batch size is set to 1.

Could you help me with this? Why does it occupy so much GPU memory even with deepspeed_zero3?"

Question about RRHF

Hi,

Very insightful work!
While I have some question about the relation between RRHF[1] and SimPO.
Could you please give a brief introduction?

Thanks!

[1] RRHF: Rank responses to align language models with human feedback. In NeurIPS, 2023.

Usage on Custom Dataset

Please guide me regarding how to run SimPO on a custom dataset containing both positive and negative examples .

About the version of alignment-handbook

Hi,

Thanks for your great work!

I am trying to run the code and reproduce the result. Currently, I found the latest version of alignment-handbook may not match the current code since I encounter these errors:

ImportError: cannot import name 'maybe_insert_system_message' from 'alignment'

accelerator.prepare() CUDA out of memory

Hi, I run the code with ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-instruct-simpo.yaml using 1 A100-80G.

accelerator.prepare() cuda OOM
20240529-110755

I encountered the same problem on 2 A800-80G.

Any ideas how to resolve?
Thanks

ValueError: Unknown split "train". Should be one of ['train_iteration_1', 'test_iteration_1', 'train_iteration_2', 'test_iteration_2',

The training script I used was:
HF_ENDPOINT=https://hf-mirror.com ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-instruct-simpo.yaml

Then the following bug occurred:

File "/opt/conda/envs/simpo/lib/python3.10/site-packages/datasets/load.py", line 2621, in load_dataset
  return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
File "/opt/conda/envs/simpo/lib/python3.10/site-packages/datasets/arrow_reader.py", line 480, in _rel_to_abs_instr
  **raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "train". Should be one of ['train_iteration_1', 'test_iteration_1', 'train_iteration_2', 'test_iteration_2', 'train_iteration_3', 'test_iteration_3'].**
  ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
File "/opt/conda/envs/simpo/lib/python3.10/site-packages/datasets/builder.py", line 1266, in as_dataset
  datasets = map_nested(
image

TRL version should be 0.8.6

Hello,

Thanks for your great work! As TRL released new versions, the arguments, like ref_model_init_kwargs, don't match anymore.

I backtracked the time of your release so I think trl should be 0.8.6. Will be great to add a note or requirements.txt for new comers.

Best,
Blake

Unable to reproduce the mt bench results in the paper

When I downloaded checkpoint published by you on huggingface, I used the following command to test mt bench. The 7.04 obtained by GPT4 evaluation is not consistent with the 7.3 in your article. How can I replicate the MT-bench result?

model: https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SimPO
testset: MT-bench (https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/)

python3 gen_model_answer.py --model-path llm_models/Mistral-7B-Base-SFT-SimPO  --model-id Mistral-7B-Base-SFT-SimPO
python3 gen_judgment.py --model-list Mistral-7B-Base-SFT-SimPO --parallel 10

Upstream `SimPOTrainer` to TRL

Congrats on the fantastic work! Thank you for releasing the code and models. Would you be interested in contributing you SimPOTrainer too trl? This could make community adoption easier.

Length normalization

Thanks for your work and source code :)
I am a little confused. I didn't see the length normalization parameter. Did I miss something? Or is it discovered during implementation that it may be better without length normalization?

If length normalization is not performed, can it be directly understood that based on DPO, the difference between the better and worse of the reference model is regarded as a fixed margin?

Repeated Addition of Assistant Turn in Prompt/Chosen/Rejected Text Using `apply_chat_template`

When using the apply_chat_template function in the alignmentbook and the author's code, the assistant's turn is repeatedly added in the prompt, chosen, and rejected texts. This behavior might be causing discrepancies between SFT and DPO/SimPO, potentially influencing performance outcomes.

Code Snippet:

example["text_prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)

Observed Behavior:

The logs indicate that each section (prompt, chosen, rejected) includes an 'assistant' turn, which seems to be added automatically by the apply_chat_template function. This could lead to an accumulation of assistant responses, affecting the structure of the training data used for models.

Example Log Output:

2024-05-27 11:45:50 - INFO - __main__ - Prompt sample 3278 of the raw training set:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

A middle aged man presents with progressive atrophy and weakness of hands and forearms. On examination he is found to have slight spasticity of the legs, generalized hyper-reflexia and increased signal in the cortico-spinal tracts on T2 weighted MRI. The most likely diagnosis is -

A. Multiple selerosis
B. Amyotrophic latetal sclerosis
C. Subacute combined degeneration
D. Progressive spinal muscular atrophy<|eot_id|><|start_header_id|>assistant<|end_header_id|>


2024-05-27 11:45:50 - INFO - __main__ - Chosen sample 3278 of the raw training set:

<|start_header_id|>assistant<|end_header_id|>

The patient presents with progressive atrophy and weakness of hands and forearms, which suggests a motor neuron disease. The presence of slight spasticity of the legs and generalized hyper-reflexia indicates involvement of the upper motor neurons. The increased signal in the cortico-spinal tracts on T2 weighted MRI also supports this finding.

Given these features, the most likely diagnosis is Amyotrophic Lateral Sclerosis (ALS), which is a disease that affects both upper and lower motor neurons.

So, the answer is B.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


2024-05-27 11:45:50 - INFO - __main__ - Rejected sample 3278 of the raw training set:

<|start_header_id|>assistant<|end_header_id|>

Let's break down the clues:

* Progressive atrophy and weakness of hands and forearms: This suggests a peripheral nerve disorder, possibly affecting the motor neurons.
* Slight spasticity of the legs: This could be due to an upper motor neuron lesion, but it's a subtle finding.
* Generalized hyper-reflexia: This is a strong indication of an upper motor neuron lesion, as it suggests that the reflexes are being facilitated by the abnormal signal transmission.
* Increased signal in the cortico-spinal tracts on T2 weighted MRI: This is a key finding, as it suggests that the cortico-spinal tracts are affected, which is a hallmark of upper motor neuron lesions.

Considering these clues, the most likely diagnosis is:

So, the answer is B.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The prompt and chosen/rejected are directly concatenated:


    def build_tokenized_answer(self, prompt, answer):
        """
        Llama tokenizer does satisfy `enc(a + b) = enc(a) + enc(b)`.
        It does ensure `enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]`.
        Reference:
            https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257
        """

        full_tokenized = self.tokenizer(prompt + answer, add_special_tokens=False)
        prompt_input_ids = self.tokenizer(prompt, add_special_tokens=False)["input_ids"]

I look forward to any insights or suggestions on how to address this issue. Thank you!

QLORA -4bit

Does Simpo support Qlora fine-tuning using 4bit for Llama-3-8B ?
if so, how to set it up?

the outputs of reproduced model has "<|start_header_id|>assistant<|end_header_id|>" at beginning

why the output of my reproduced model in the experiment of llama-3-8B-Instruct-SimPO has "<|start_header_id|>assistant<|end_header_id|>" at beginning.

Example output on alpaca_eval:
{
"dataset":"helpful_base",
"instruction":"I am going to make pumpkin pie for the first time. Can you help me?",
"output":"<|start_header_id|>assistant<|end_header_id|>>Pumpkin pie making for the first time! I'd be delighted to guide you through the process.\n\nMaking a classic pumpkin pie from scratch can be a fun and rewarding experience. Here's a simple recipe and some tips to ensure your pie turns out delicious:\n\nIngredients:\n\nFor the crust:\n\n* 2 1/4 cups all-purpose flour\n* 1 tsp salt\n* 1/2 cup cold unsalted butter, cut into small cubes\n* 1/4 cup ice water\n\nFor the filling:\n\n* 1 cup cooked, mashed pumpkin (canned or fresh, cooked and mashed)\n* 1 1/2 cups heavy cream\n* 1/2 cup granulated sugar\n* 1/2 tsp salt\n* 1/2 tsp ground cinnamon\n* 1/4 tsp ground nutmeg\n* 1/4 tsp ground ginger\n* 2 large eggs\n\nInstructions:\n\n1. Make the crust: In a large bowl, combine flour and salt. Add the cubed butter and use a pastry blender or your fingers to work it into the flour until the mixture resembles coarse crumbs. Gradually add ice water, stirring with a fork until the dough comes together in a ball. Wrap and refrigerate for at least 30 minutes.\n2. Preheat and roll out the crust: Preheat your oven to 425\u00b0F (220\u00b0C). On a lightly floured surface, roll out the chilled dough to a thickness of about 1/8 inch. Transfer the dough to a 9-inch pie dish and trim the edges to fit.\n3. Prepare the filling: In a separate bowl, whisk together pumpkin, heavy cream, sugar, salt, cinnamon, nutmeg, and ginger until well combined. Beat in the eggs until smooth.\n4. Fill and bake the pie: Pour the pumpkin mixture into the pie crust. Bake for 15 minutes, then reduce the oven temperature to 350\u00b0F (180\u00b0C) and continue baking for an additional 30-40 minutes, or until the filling is set and the crust is golden brown.\n5. Cool and serve: Let the pie cool on a wire rack for at least 2 hours before serving. Enjoy!\n\nTips:\n\n* Use fresh, high-quality ingredients, including real pumpkin puree and heavy cream.\n* Don't overmix the crust dough or filling, as this can lead to a tough pie.\n* Blind baking (lining the crust with parchment paper and weights) can help prevent shrinkage, but it's not necessary for a single-crust pie like this recipe.\n* If you're unsure about the pie's doneness, check for a jewel-toned orange filling and a firm, set texture.\n\nI hope this helps you make a delicious pumpkin pie for your occasion! If you have any specific questions or concerns, feel free to ask.",
"generator":"Llama-3-Instruct-8B-SimPO"
},

Unable to reproduce the results of SFT

Hi, thanks again for the interesting work.

I followed the hyperparameter settings for SFT outlined in the paper (learning rate of 2e-5, batch size of 128, and cosine learning rate scheduling), but I am still unable to train an SFT model that achieves similar evaluation results as those reported in the SimPO paper. For AlpacaEval2.0, my SFT model achieves an LC of 4.80 and a WR of 2.89. Could you provide more details about the SFT training process?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.