l294265421 / alpaca-rlhf Goto Github PK

View Code? Open in Web Editor NEW

103.0 3.0 13.0 100.21 MB

Finetuning LLaMA with RLHF (Reinforcement Learning with Human Feedback) based on DeepSpeed Chat

Home Page: https://88aeeb3aef5040507e.gradio.live/

License: MIT License

Shell 10.29% Python 89.71%

chatgpt rlhf llama alpaca reinforcement-learning language-model llm large-language-models

alpaca-rlhf's Introduction

alpaca-rlhf

Finetuning LLaMA with RLHF (Reinforcement Learning with Human Feedback).

Online Demo

SFT
RLHF

Modifications on DeepSpeed Chat

Step 1

alpaca_rlhf/deepspeed_chat/training/step1_supervised_finetuning/main.py#main()
- Set special tokens
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#create_dataset_split()
- Train only on responses and add eos
- Remove end_of_conversation_token
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#PromptDataset#getitem
- Labels differs from input
alpaca_rlhf/deepspeed_chat/training/utils/data/raw_datasets.py#MultiTurnAlpacaDataset
- add MultiTurnAlpacaDataset
alpaca_rlhf/deepspeed_chat/training/utils/module/lora.py#convert_linear_layer_to_lora
- Support multiple module names for lora

Step 2

alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/main.py#main()
- Set special tokens
alpaca_rlhf/deepspeed_chat/training/utils/model/reward_model.py#RewardModel#forward()
- Fixing the numerical instability
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#create_dataset_split()
- Remove end_of_conversation_token

Step 3

alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py#main()
- Set special tokens
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#create_dataset_split()
- Fix max length bug
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#DataCollatorRLHF#call
- Fix padding side bug
alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py#DeepSpeedPPOTrainer#generate_experience
- Normalize reward
alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py#DeepSpeedPPOTrainer#_generate_sequence
- Mask the tokens after the eos

Stey by Step

Running all three steps on 2 x A100 80G
Datasets
- Dahoas/rm-static huggingface paper GitHub
- MultiTurnAlpaca
  - This is a multi-turn version of the alpaca dataset and is built based on AlpacaDataCleaned and ChatAlpaca.
Enter ./alpaca_rlhf directory first, then run the following commands:
- step1: sh run.sh --num_gpus 2 /tmp/pycharm_project_227/alpaca_rlhf/deepspeed_chat/training/step1_supervised_finetuning/main.py --sft_only_data_path MultiTurnAlpaca --data_output_path /root/autodl-tmp/rlhf/tmp/ --model_name_or_path decapoda-research/llama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 3e-4 --num_train_epochs 1 --gradient_accumulation_steps 8 --num_warmup_steps 100 --output_dir /root/autodl-tmp/rlhf/actor --lora_dim 8 --lora_module_name q_proj,k_proj --only_optimize_lora --deepspeed --zero_stage 2
  - when --sft_only_data_path MultiTurnAlpaca is added, please unzip data/data.zip first.
- step2: sh run.sh --num_gpus 2 /tmp/pycharm_project_227/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/main.py --data_output_path /root/autodl-tmp/rlhf/tmp/ --model_name_or_path decapoda-research/llama-7b-hf --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_eval_batch_size 64 --learning_rate 5e-4 --num_train_epochs 1 --gradient_accumulation_steps 1 --num_warmup_steps 0 --zero_stage 2 --deepspeed --output_dir /root/autodl-tmp/rlhf/critic --lora_dim 8 --lora_module_name q_proj,k_proj --only_optimize_lora
  - the training process of step 2
  - The mean and standard deviation of the reward of the chosen responses are collected and used to normalize the reward in step 3. In one experiment, they are -0.8677118420600891 and 0.2210693359375 respectively and are used in the alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py#DeepSpeedPPOTrainer#generate_experience methods: 'rewards': (reward_score - (-0.8677118420600891)) / 0.2210693359375.
- step3: sh run.sh --num_gpus 2 /tmp/pycharm_project_227/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_output_path /root/autodl-tmp/rlhf/tmp/ --actor_model_name_or_path /root/autodl-tmp/rlhf/actor/ --tokenizer_name_or_path decapoda-research/llama-7b-hf --critic_model_name_or_path /root/autodl-tmp/rlhf/critic --actor_zero_stage 2 --critic_zero_stage 2 --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --ppo_epochs 2 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --gradient_accumulation_steps 1 --deepspeed --actor_lora_dim 8 --actor_lora_module_name q_proj --critic_lora_dim 8 --critic_lora_module_name q_proj,k_proj --only_optimize_lora --output_dir /root/autodl-tmp/rlhf/final
  - the training process of step 3
Inference
- nohup sh run_inference.sh 0 alpaca_rlhf/inference/llama_chatbot_gradio.py --path /root/autodl-tmp/rlhf/final/actor > rlhf_inference.log 2>&1 &
- nohup sh run_inference.sh 0 alpaca_rlhf/inference/llama_chatbot_gradio.py --path /root/autodl-tmp/rlhf/actor > sft_inference.log 2>&1 &

Comparison between SFT and RLHF

Chat
- SFT
- RLHF
Write stories
- SFT
- RLHF

References

Articles

Sources

Awesome RLHF

Tools

DeepSpeed-Chat

Datasets

Stanford Human Preferences Dataset (SHP)
HH-RLHF
- hh-rlhf
  - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [paper]
  - Dahoas/static-hh
  - Dahoas/rm-static
GPT-4-LLM
- GitHub
- Paper
- Site
Open-Assistant
- Site
- GitHub
- Paper

Related Repositories

alpaca-rlhf's People

Contributors

Stargazers

Watchers

Forkers

yiran-hao wtwong316 yanshanjing rayjue xuyyx mr-nineteen skepsun abhi1092 ihooercom dfqytcom wangguojim kanji95 anshiquanshu66

alpaca-rlhf's Issues

reward model在v100上训练时会卡住不动

step: 82 loss:0.83251953125, correct_predictions: 0.0, reward: -0.50390625 r_reward: -0.487060546875
step: 83 loss:0.76611328125, correct_predictions: 0.0, reward: -0.492919921875 r_reward: -0.492431640625
step: 84 loss:0.7578125, correct_predictions: 0.0, reward: -0.5439453125 r_reward: -0.5361328125
step: 85 loss:0.83251953125, correct_predictions: 1.0, reward: -0.464111328125 r_reward: -0.467529296875
step: 86 loss:1.537109375, correct_predictions: 1.0, reward: -0.509765625 r_reward: -0.51708984375
step: 87 loss:0.6142578125, correct_predictions: 0.0, reward: -0.5087890625 r_reward: -0.48291015625
step: 88 loss:0.5380859375, correct_predictions: 0.0, reward: -0.451171875 r_reward: -0.44921875
[2023-05-17 14:28:38,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[0.0004994156634161006], mom=[(0.9, 0.95)]
[2023-05-17 14:28:38,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=14.808713117576154, CurrSamplesPerSec=14.958361511223531, MemAllocated=12.34GB, MaxMemAllocated=22.87GB
step: 89 loss:0.67333984375, correct_predictions: 1.0, reward: -0.435302734375 r_reward: -0.43994140625
step: 90 loss:0.35107421875, correct_predictions: 1.0, reward: -0.421875 r_reward: -0.457275390625
step: 91 loss:0.7763671875, correct_predictions: 1.0, reward: -0.439453125 r_reward: -0.442138671875
step: 92 loss:0.69091796875, correct_predictions: 1.0, reward: -0.440185546875 r_reward: -0.46826171875
step: 93 loss:0.355712890625, correct_predictions: 1.0, reward: -0.432373046875 r_reward: -0.455078125
step: 94 loss:0.607421875, correct_predictions: 1.0, reward: -0.425537109375 r_reward: -0.427734375
step: 95 loss:0.87060546875, correct_predictions: 0.0, reward: -0.4775390625 r_reward: -0.468017578125
step: 96 loss:0.7841796875, correct_predictions: 1.0, reward: -0.39013671875 r_reward: -0.404541015625
step: 97 loss:1.23828125, correct_predictions: 0.0, reward: -0.40869140625 r_reward: -0.36572265625
step: 98 loss:0.87890625, correct_predictions: 0.0, reward: -0.445556640625 r_reward: -0.42333984375
[2023-05-17 14:28:43,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[0.0004992664502959351], mom=[(0.9, 0.95)]
[2023-05-17 14:28:43,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=14.80846616069343, CurrSamplesPerSec=14.749032318751523, MemAllocated=12.34GB, MaxMemAllocated=22.87GB
step: 99 loss:0.7666015625, correct_predictions: 0.0, reward: -0.384033203125 r_reward: -0.382080078125

您好，打扰您一下，我在v100上训练reward model训练时，每次卡在第99步就停止不动了，程序不报错，也没有结束。
用nvidia-smi查看后，显卡也是占满的情况，麻烦您帮忙看一下。
ps：程序运行执行速度很快，但是到step 99后就不再动了

这个是我的执行命令：
nohup deepspeed --num_gpus 8 /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/main.py --data_output_path /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --model_name_or_path decapoda-research/llama-7b-hf --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 5e-4 --num_train_epochs 1 --gradient_accumulation_steps 1 --num_warmup_steps 0 --zero_stage 2 --deepspeed --output_dir /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --lora_dim 2 --lora_module_name q_proj,k_proj --only_optimize_lora > nohup1.txt &.

stop at step2 evaluation_reward

Firstly, thank you for your contributions. I consistently pause (but do not exit) at the evaluation_reward during the training of step 2. Hence, I am wondering if there is something wrong. Perhaps the condition args.global_rank == 0 is unnecessary? Any suggestions would be greatly appreciated. Thank you.

step2和step3中padding side似乎不一样？

我看data_utils.py中step2是padding在右侧的，然而step3特意改成padding在左侧。这里面有什么讲究吗？两者不一致会不会导致reward计算出问题？

Step 3: Actor model和Reward model使用不同的tokenizer

作者您好，首先感谢开源。
我在训练第三阶段的时候，用40G显存的GPU无法加载actor model=llama-7b, reward model =llama-7b，会有OOM的问题，因此我尝试把reward model改为更小的bloom1.7b。但是两个模型不互通tokenizer，在step 3，create model的阶段，加载了不同的tokenizer，然而在计算critic_loss的时候，是不是需要把数据转化为critic tokenizer下的表示，然后再计算critic loss？还是说用actor tokenizer处理的数据计算critic loss时不会有影响？
再次感谢！

A question about setting tokens

why set tokenizer.pad_token_id = 0 ？
llama model vocabl pad_token="<0x00>": 3 ，unk_token="": 0.
Why not set it to 3 here?
I think it should be set to tokenizer.pad_token_id = 3.
I hope everyone can answer for me，thank

Steps

Hey how are you ? First of all thank you for us to provide this repo. I have same question for steps.

Are we going to choose every step here one by one? Are we going step by step?

Or will we choose one of these steps and test the results accordingly?

Also, I want to design a Chatbot in a ConversationAI style. How should the data be for this? It keeps it as generate as History, but how do we set them in the data? Well, I’m creating them in my mind. Can you help me with this too??

If there is anything I can’t think of or you want to contribute, I would appreciate it if you add it.

Thank you for everthing

v100 step3 oom

您好，我在运行第三步的时候一直报oom这个错误，具体信息如下
File "/opt/conda/envs/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 6; 31.75 GiB total capacity; 30.95 GiB already allocated; 21.75 MiB free; 30.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我使用的是8卡的v100，我将batchsize相关的都调成1了，max_seq_len设小了，lora_dim设成了1，lora_module_name设置成q_proj，请问还有其他办法来优化吗？

增大max_prompt_len和max_ans_len训练会出现非法的内存访问问题

在进行step3时，对于512的长度训练时没问题的，但是只要增大长度机会报这个错误，在任意iter。有朋友遇到这个问题并解决吗？

v100训练时显存oom

您好，我用v100训练sft和rm时都说显存不够无法运行，具体报错信息如下：
OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 1; 31.75
GiB total capacity; 29.88 GiB already allocated; 11.75 MiB free; 29.98 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory

我已经将per_device_train_batch_size和per_device_eval_batch_size调到1了，但仍然提示说显存不够，请问有什么办法解这个问题吗？

训练效果怎么样

作者，你好，进行RLHF后，模型效果怎么样的，在什么方面上会有提升？

deepspeed.initialize的一些疑惑

您好，请教您一个问题，我在deepspeed.initialize的时候发现，一旦initialize后模型的权重就自动为空了，这个符合预期吗？尤其在训练开始之前有个地方还要执行_generate_sequence。感觉非常疑惑

element 0 of tensors does not require grad and does not have a grad_fn

我的运行脚本如下：
CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed /data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_path /data/bill.bi/RLHFDataset --data_output_path /data/bill.bi/tmp/ --actor_model_name_or_path decapoda-research/llama-7b-hf --tokenizer_name_or_path /data/bill.bi/tmp/rlhf/critic --critic_model_name_or_path /data/bill.bi/tmp/rlhf/critic --num_padding_at_beginning 0 --per_device_train_batch_size 4 --actor_learning_rate 9.85e-6 --critic_learning_rate 5e-6 --ppo_epochs 1 --gradient_accumulation_steps 1 --num_warmup_steps 0 --actor_zero_stage 2 --critic_zero_stage 2 --deepspeed --critic_gradient_checkpointing --actor_gradient_checkpointing --output_dir /data/bill.bi/tmp/rlhf/final --actor_lora_dim 8 --actor_lora_module_name q_proj,k_proj,gate_proj,up_proj --critic_lora_dim 8 --critic_lora_module_name q_proj,k_proj,gate_proj,up_proj --only_optimize_lora --max_prompt_seq_len 1024 1>train_step3.log 2>&1

在执行step3的时候，遇到这个报错，具体的栈信息如下：

Traceback (most recent call last):
File "/data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 563, in
main()
File "/data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 476, in main
actor_loss, critic_loss = trainer.train_rlhf(exp_data)
File "/data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 187, in train_rlhf
self.actor_model.backward(actor_loss)
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1862, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1901, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/data/bill.bi/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

关于Step3中是否需要把生成的answer中eos后面token mask掉

您好！十分感谢您的开源贡献！
我在看您的代码过程中发现您在rlhf中，把模型生成的eos后，后面的所有token mask掉了。这里我有一个疑惑就是：

对于critic model而言，他虽然知道eos后面是pad的，但是也会为pad的hidden state过fc输出一个分数，而并不是不计算分数；
对于actor model而言，只知道每个token所对应的value，用value和reward来监督，但是不知道这个token是否被mask，也不知道这个token是否是pad；
而这样用pad的分来监督actor会不会是更不准确的呢

how to run it, need more details

and how to install alpaca-rlhf

训练问题

请问模型加载时，做模型并行化操作吗？我发现我直接在deepspeed-chat中跑7B的模型都会爆显存，显卡是A100 80G。

Fix pad_token_id bug

很感谢您的代码！
关于
alpaca_rlhf/deepspeed_chat/training/utils/data/data_utils.py#DataCollatorRLHF#call
Fix pad_token_id bug
有一个疑惑的地方，可以看到data_utils.py中class PromptDataset(Dataset)函数最后一行，step3的return为
self.prompt_dataset[idx]["input_ids"],self.prompt_dataset[idx]["attention_mask"], self.pad_token_id
所以data[-1][-1]应该就是self.pad_token_id,原作者代码应该是没有bug的。
希望作者这里也可以确认一下～是否是我理解的bug