huggingface / alignment-handbook Goto Github PK

View Code? Open in Web Editor NEW

4.2K 111.0 358.0 236 KB

Robust recipes to align language models with human and AI preferences

Home Page: https://huggingface.co/HuggingFaceH4

License: Apache License 2.0

MDX 0.07% Makefile 1.03% Python 95.02% Shell 3.88%

llm rlhf transformers

alignment-handbook's People

Contributors

Stargazers

Watchers

Forkers

evelynmitchell anakin87 touristshaun zhaohongtian bbosnjak dumpmemory bluekiji77 cygwynd akaanirban hbcbh1999 abhishek085 bigdatasciencegroup yuan505 awesome-software sidraina89 melanie531 codeaudit wayan123 leejodie ssghost snoopycn sankeerthrao rkp64 jamesliu standardgalactic davidlanz zhangzhuobys ftgreat thzdyjy proj-odin chenin-wang audiebant rajarampanigrahi k2m5t2 sambar1729 expert68 sebastianschramm sdeva14 juangon zeroxclem hugodemenez hololeo milan-chicago tuhinmallick piyushmishra908 suryatmodulus edberg21 rodrigomasiniai rubenszimbres brinsby mansurul11 a7mad-magdy77 morganmcg1 askly-ai chesketh76 arian-askari myaniu kashif jaredquekjz riiduan moxmoussa lionelchg yibit silasdao tiamat-tech dayadaya222 tupleleap quijoteshin dineshdyne janleyva sprklinginfo archiemorph choltha mbrukman hhy5277 spencerx 007aniketkumar jaeahn2010 ebbot-ai hgmt96 techventurebuilder jeffara tcapelle 5l1v3r1 josephrp cnp-ciimar yusuf-jkhan1 chenyangml girrajjangid zhuango supermario-ai techthiyanes munirabobaker haksoomoon bknallas bigdot123456 polya20 oplatek mdmmn378 dongwang218

alignment-handbook's Issues

Did you use RMSprop or AdamW as the optimizer?

Hi to whoever is reading this 🤗

Question

After reading the Zephyr pre-printed paper https://arxiv.org/pdf/2310.16944.pdf and going through the configuration files here, I saw that there was a mismatch between the optimizer used in https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/dpo/config_full.yaml, and the one reported in the paper, AdamW.

So the question is, did you use RMSprop to run the full DPO fine-tuning or AdamW with no weight decay as stated in the paper?

Thanks in advance!

Running on single GPU(16GB)

Hi,

What is the best way to run this on my high performance laptop?
Should this somehow work? Can i calculate how many days/weeks it will run?

Thanks in advance

Specs:

OS: Win 11 (WSL2)
CPU: Intel Core i7 12850HX
Make: Lenovo Thinkpad P16 gen 1
Memory: 128GB DDR5-4800 (2400MHz)
GPU: Nvidia RTX A5500 16GB

I found that this command would work on my laptop it seems:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1

how now run it for 1-2 hours ish:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1
INFO:root:Using nproc_per_node=1.
2023-11-27 15:41:33.914308: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-11-27 15:41:33.941565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-27 15:41:34.582753: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-11-27 15:41:35,164] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.11/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
2023-11-27 15:41:35 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2023-11-27 15:41:35 - INFO - main - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='mistralai/Mistral-7B-v0.1', model_revision='main', model_code_revision=None, torch_dtype='auto', trust_remote_code=False, use_flash_attention_2=True, use_peft=True, lora_r=64, lora_alpha=16, lora_dropout=0.1, lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=True, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
2023-11-27 15:41:35 - INFO - main - Data parameters DataArguments(chat_template=None, dataset_mixer={'HuggingFaceH4/ultrachat_200k': 1.0}, dataset_splits=['train_sft', 'test_sft'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
2023-11-27 15:41:35 - INFO - main - Training/evaluation parameters SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1024,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=zephyr-7b-sft-lora,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/zephyr-7b-sft-lora/runs/Nov27_15-41-35,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_seq_length=2048,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=data/zephyr-7b-sft-lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=data/zephyr-7b-sft-lora,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.NO,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
Overwrite dataset info from restored data version if exists.
2023-11-27 15:41:38 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
2023-11-27 15:41:38 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
2023-11-27 15:41:38 - INFO - datasets.builder - Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
2023-11-27 15:41:38 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
Overwrite dataset info from restored data version if exists.
2023-11-27 15:41:40 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
2023-11-27 15:41:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
2023-11-27 15:41:40 - INFO - datasets.builder - Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
2023-11-27 15:41:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-91f7f728fecb2505.arrow
2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-91f7f728fecb2505.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-83009ff6f17d65d0.arrow
2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-83009ff6f17d65d0.arrow
2023-11-27 15:41:40 - INFO - main - Training on the following datasets and their proportions: ['train : 207865', 'test : 23110']
[INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer.model
[INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer.json
[INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/special_tokens_map.json
[INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer_config.json
Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-3e95fae9b410a2c7.arrow
2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-3e95fae9b410a2c7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-84dc14e69dab5370.arrow
2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-84dc14e69dab5370.arrow
2023-11-27 15:41:40 - INFO - main - Sample 167621 of the processed training set:
........
2023-11-27 15:41:40 - INFO - main - *** Load pretrained model ***
2023-11-27 15:41:40 - INFO - main - *** Model loaded! ***
/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you.
warnings.warn(
[INFO|configuration_utils.py:717] 2023-11-27 15:41:40,964 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/config.json
[INFO|configuration_utils.py:777] 2023-11-27 15:41:40,964 >> Model config MistralConfig {
"_name_or_path": "mistralai/Mistral-7B-v0.1",
"architectures": [
"MistralForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"sliding_window": 4096,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.35.0",
"use_cache": false,
"vocab_size": 32000
}

[INFO|modeling_utils.py:3121] 2023-11-27 15:41:40,972 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/pytorch_model.bin.index.json
[INFO|modeling_utils.py:3184] 2023-11-27 15:41:40,974 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object
[INFO|modeling_utils.py:1222] 2023-11-27 15:41:40,974 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:791] 2023-11-27 15:41:40,976 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"use_cache": false
}

[INFO|modeling_utils.py:3257] 2023-11-27 15:41:41,631 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.75s/it][INFO|modeling_utils.py:3950] 2023-11-27 15:41:51,332 >> All model checkpoint weights were used when initializing MistralForCausalLM.

[INFO|modeling_utils.py:3958] 2023-11-27 15:41:51,332 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
[INFO|configuration_utils.py:751] 2023-11-27 15:41:51,488 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/generation_config.json
[INFO|configuration_utils.py:791] 2023-11-27 15:41:51,488 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2
}

[INFO|training_args.py:1784] 2023-11-27 15:41:51,646 >> PyTorch: setting up devices
/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:247: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code.
warnings.warn(
[INFO|trainer.py:593] 2023-11-27 15:41:52,619 >> Using auto half precision backend
2023-11-27 15:41:52 - INFO - main - *** Train ***
[INFO|trainer.py:1723] 2023-11-27 15:41:53,614 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-27 15:41:53,614 >> Num examples = 207,865
[INFO|trainer.py:1725] 2023-11-27 15:41:53,614 >> Num Epochs = 1
[INFO|trainer.py:1726] 2023-11-27 15:41:53,614 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1729] 2023-11-27 15:41:53,614 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024
[INFO|trainer.py:1730] 2023-11-27 15:41:53,614 >> Gradient Accumulation steps = 1024
[INFO|trainer.py:1731] 2023-11-27 15:41:53,614 >> Total optimization steps = 202
[INFO|trainer.py:1732] 2023-11-27 15:41:53,616 >> Number of trainable parameters = 54,525,952
0%| | 0/202 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-27 15:41:54,956 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2377 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|logging.py:314] 2023-11-27 15:41:55,018 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
[WARNING|logging.py:329] 2023-11-27 15:41:55,763 >> The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
0%|▌ | 1/202 [4:36:47<927:14:16, 16607.25s/it]{'loss': 1.1453, 'learning_rate': 1.9998790632601496e-05, 'epoch': 0.0}
0%|▌ | 1/202 [4:36:47<927:14:16, 16607.25s/it]

Question about DPO learning rate - comparison to neural-chat-7b-v3 training

The learning rate default in the dpo recipe config is set to 5e-7 and https://huggingface.co/Intel/neural-chat-7b-v3 was trained with a learning rate of 1e-4 (using of course a different data set https://huggingface.co/datasets/Open-Orca/SlimOrca).

However, I am wondering about the significant difference in lr and yet both models seem to perform well. Any insights about that, that you can share?
Thank you

Differences between alpha and beta models

Hi!

I am wondering if there is an official documentation of differences between zephyr-7b-alpha and zephyr-7b-beta.

According to this blog

Zephyr beta trains for more DPO epochs (than Zephyr alpha) leading to better chat results!

According to this linkedin post

Compared to Zephyr-Alpha, they filtered the data to get rid of issues related to incorrect casing and weird starts for some answers.

The above sources are from 3rd parties. I'm curious if there's an official reference for these differences. The Zephyr paper doesn't appear to cover them. Thanks a lot!

Tianlin

Hardware used for reproducing

Any information on the exact hardware used when training Zephyr 7B Beta? The deepspeed config looks like no CPU offloading was done so were larger (or many) GPUs used?

Would be great to know the hardware that was used for the sake of reproducing, as well as safely expanding to using other datasets (knowing that the current configuration with regards to batch sizes won't OOM).

Warning about max sequence length

Hi, when I ran the dpo finetuning code, I noticed that there is a warning in the logging output
[WARNING|tokenization_utils_base.py:3831] 2023-12-06 16:44:52,195 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2455 > 2048). Running this sequence through the model will result in indexing errors
I note that this is related to #32 , who seems to have the same issue.

Do I need to worry about this warning? Could anyone explain why this warning appears?

Weird DPO loss

Hi, I would like to raise some attention to issue #38.

It seems that the DPO-Lora training loss (red line) drops abruptly at the beginning of each epoch, which seems weird. (I tried Lora model global batch size 64, multi_gpu acceleration, 8GPUs, learning rate 1e-4, others same suggested)

In the mean time, the full parameter fine tunning has no such problem (official settings).

I don't know if this is normal and assume this is a bug associated with the lora model. Is there any explanations? Has anyone encountered the same issue? If your rerun loss is normal, can you share your configs?

Question about the evaluation dataset

Hi, I wonder which TruthfulQA task you are focusing on during evaluation? MC1, MC2, or generation task?

Max Sequence Length

Had a question about the max_seq_length hyper parameter.

I just started training and set the config for SFT to be the below:

# Model arguments
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true

# Data training arguments
dataset_mixer:
  HuggingFaceH4/ultrachat_200k: 1.0
dataset_splits:
- train_sft
- test_sft
preprocessing_num_workers: 12

# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 32
gradient_checkpointing: true
hub_model_id: custom_1_0
hub_strategy: every_save
learning_rate: 2.0e-05
log_level: info
logging_steps: 5  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 8192
max_steps: -1
num_train_epochs: 1
output_dir: data/scout_version_1_0
overwrite_output_dir: true
per_device_eval_batch_size: 8
per_device_train_batch_size: 2
push_to_hub: true
remove_unused_columns: true
report_to:
- tensorboard
save_strategy: "no"
save_total_limit: null
seed: 42
tf32: true

However, I got this warning below:

[WARNING|tokenization_utils_base.py:3831] 2023-11-15 19:02:06,747 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2576 > 2048). Running this sequence through the model will result in indexingerrors

This was also from the logs:

2023-11-15 19:01:23 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2023-11-15 19:01:23 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='mistralai/Mistral-7B-v0.1', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', trust_remote_code=False, use_flash_attention_2=True, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
2023-11-15 19:01:23 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'HuggingFaceH4/ultrachat_200k': 1.0}, dataset_splits=['train_sft', 'test_sft'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
2023-11-15 19:01:23 - INFO - __main__ - Training/evaluation parameters SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=32,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=scout_version_1_0,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/scout_version_1_0/runs/Nov15_19-01-23_ip-10-10-9-47.ec2.internal,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_seq_length=8192,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1,
optim=adamw_torch,
optim_args=None,
output_dir=data/scout_version_1_0,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=2,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=data/scout_version_1_0,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)

Any thoughts as to what might be happening? Ran into the same thing when trying Llama 7B-hf which has a default token length of 4096.

How to QLoRA training with ZeRO-3 on two or more GPUs?

I added a 4-bit load after the command LoRA training with ZeRO-3 on two or more GPUs to achieve a mix of QLoRA and ZeRO-3. But the program encountered the following error:
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f2ec8daf900>
The command is:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=2 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true

How to perform full parameter finetuning without A100 GPUs

Hi, thank you for your great work! I'd like to reproduce full parameter fine-tuning of dpo training. However I only have 10 * Nvidia A40 GPUs (46 Gbs memory each).

I tried the command

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

and it reported OOM error, even if I set batch size to 1.

I don't mind the program runs a bit slower (e.g., use smaller batchsize and more gradient accumulation steps). However, I don't know if there is a way to successfully deploy the full-dpo code.

Can you help me, please?

Also, I'm wondering how large is the performance gap between lora and full parameter finetunning.

Reproducing of Lora Model Result on MT-Bench

Recently, I attempted to fit the DPO on my own dataset.
Initially, I tried to reproduce the results of your LORA model( 7.43 on MT-Bench).
However, I encountered some issues.
Despite using all your parameters and data, here are my results on MT-Bench:

Model	MT-Bench
Zephyr-SFT-Lora-Own	6.37
Zephyr-DPO-Lora-Own	6.95

Then, I downloaded your models from here, and the results were nearly the same as mine.

Model	MT-Bench
Zephyr-SFT-Lora	6.4
Zephyr-DPO-Lora	6.93

DPO does help improve performance on MT-Bench, but I can't achieve a score of 7.43. Is there any difference between the model described in your paper and the model available on your homepage?
Or could it be the difference between the full and LORA?

By the way, I truly love the "yaml style" argument parser; it's clear and elegant!
@edbeeching @lewtun

Memory Issue with 7b Model Fine-Tuning on 6 H100 GPUs

Hello everyone, I'm encountering a memory issue while fine-tuning a 7b model (such as Mistral) using a repository I found. Despite having 6 H100 GPUs at my disposal, I run into out-of-memory errors when using a batch size of 4. Interestingly, when I use libraries like Axolotl for similar tasks, I don't face this problem. Could anyone provide insights or suggestions on how to resolve these memory issues with the specific repository I'm using for fine-tuning? Any help would be greatly appreciated!

How to specify another GPU to run rather than cuda:0?

I tried to modify the --gpu_ids paramater in recipes/accelerate_configs/multi_gpu.yaml, however, it didn't work, the device was still 'cuda:0'.

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs

I attempted to fine-tune a 6 billion parameter model using 8 A100 GPUs, but the training process encountered interruptions. On the first attempt, it stopped at 0.15 epochs, and on the second attempt, where I started from 2 epochs, it oddly skipped some epochs, jumping from 0.15 directly to 1, and then stopped at 2.25. For more detailed information, you can check this WandB link - https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd/

Configs -

Model arguments

model_name_or_path: 01-ai/Yi-6B
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: false
trust_remote_code: true

Data training arguments

dataset_mixer:
communityai/apt-chat-micro-dataset-llm-v2-714k: 0.4
dataset_splits:

train
test
preprocessing_num_workers: 12

SFT trainer config

bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 4
gradient_checkpointing: false
hub_model_id: apt-chat-yi-6B-sft-full
hub_strategy: every_save
learning_rate: 0.00002
log_level: info
logging_steps: 50
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 4096
max_steps: -1
num_train_epochs: 2
output_dir: data/apt-chat-yi-6B-sft-full
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: true
remove_unused_columns: true
report_to:

wandb
save_strategy: "no"
save_total_limit: null
seed: 42
tf32: true

LOGS -

INFO:root:Using nproc_per_node=8.
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING]
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] *****************************************
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] *****************************************
[2023-11-14 02:09:45,328] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,584] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
[2023-11-14 02:09:45,607] [INFO] [comm.py:637:init_distributed] cdb=None
2023-11-14 02:09:45 - WARNING - main - Process rank: 7, device: cuda:7, n_gpu: 1 distributed training: True, 16-bits training: False
[2023-11-14 02:09:45,646] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,793] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,832] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,834] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,835] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
[2023-11-14 02:09:45,864] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-14 02:09:45,908] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
2023-11-14 02:09:45 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: False
[2023-11-14 02:09:45,939] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-14 02:09:45,939] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-11-14 02:09:45 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2023-11-14 02:09:45 - INFO - main - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='01-ai/Yi-6B', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', trust_remote_code=True, use_flash_attention_2=False, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
2023-11-14 02:09:45 - INFO - main - Data parameters DataArguments(chat_template=None, dataset_mixer={'communityai/apt-chat-micro-dataset-llm-v2-714k': 0.4}, dataset_splits=['train', 'test'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
2023-11-14 02:09:45 - INFO - main - Training/evaluation parameters SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=apt-chat-yi-6B-sft-full,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=50,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_seq_length=4096,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=2,
optim=adamw_torch,
optim_args=None,
output_dir=data/apt-chat-yi-6B-sft-full,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=data/apt-chat-yi-6B-sft-full,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2023-11-14 02:09:48 - INFO - main - Training on the following datasets and their proportions: ['train : 285436', 'test : 500']
++++++++++++++++++++++++++++++++++++++
YiTokenizer(name_or_path='01-ai/Yi-6B', vocab_size=64000, model_max_length=4096, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
2023-11-14 02:09:53 - INFO - main - *** Load pretrained model ***
neftune_noise_alpha - 5.0
training_args - 2023-11-14 02:09:53 - INFO - main - *** Model loaded! ***
neftune_noise_alpha - 5.0
training_args - SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=apt-chat-yi-6B-sft-full,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=5,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=50,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_seq_length=4096,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=2,
optim=adamw_torch,
optim_args=None,
output_dir=data/apt-chat-yi-6B-sft-full,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=data/apt-chat-yi-6B-sft-full,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you.
warnings.warn(
[INFO|configuration_utils.py:717] 2023-11-14 02:09:53,295 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json
[INFO|configuration_utils.py:717] 2023-11-14 02:09:53,384 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json
[INFO|configuration_utils.py:777] 2023-11-14 02:09:53,386 >> Model config YiConfig {
"_name_or_path": "01-ai/Yi-6B",
"architectures": [
"YiForCausalLM"
],
"auto_map": {
"AutoConfig": "01-ai/Yi-6B--configuration_yi.YiConfig",
"AutoModel": "01-ai/Yi-6B--modeling_yi.YiModel",
"AutoModelForCausalLM": "01-ai/Yi-6B--modeling_yi.YiForCausalLM"
},
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "Yi",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 4,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"rope_theta": 5000000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.35.0",
"use_cache": true,
"vocab_size": 64000
}

[INFO|modeling_utils.py:3121] 2023-11-14 02:09:53,499 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/model.safetensors.index.json
[INFO|modeling_utils.py:1222] 2023-11-14 02:09:53,501 >> Instantiating YiForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:791] 2023-11-14 02:09:53,503 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0
}

[2023-11-14 02:10:02,797] [INFO] [config.py:972:print] DeepSpeedEngine configuration:
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_enabled .................. False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_params ................... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] bfloat16_enabled ............. True
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3e8d053e50>
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] communication_data_type ...... None
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_params_legacy ..... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_enabled ...... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] dataloader_drop_last ......... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] disable_allgather ............ False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dump_state ................... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_enabled ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_verbose ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] elasticity_enabled ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_auto_cast ............... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_enabled ................. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] global_rank .................. 0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] grad_accum_dtype ............. None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_accumulation_steps .. 4
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_clipping ............ 0.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] load_universal_checkpoint .... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] loss_scale ................... 1.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] memory_breakdown ............. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_hierarchial_params_gather False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_shard_size .............. -1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_name ............... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_params ............. None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_enabled .................. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_params ................... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] prescale_gradients ........... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_name ............... None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_params ............. None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_attention ............. None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] steps_per_print .............. inf
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_batch_size ............. 32
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 1
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] use_node_local_storage ....... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] wall_clock_breakdown ......... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] weight_quantization_config ... None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] world_size ................... 8
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_allow_untested_optimizer True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_enabled ................. True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_optimization_stage ...... 3
[2023-11-14 02:10:02,799] [INFO] [config.py:962:print_user_config] json = {
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": inf,
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
[INFO|trainer.py:1723] 2023-11-14 02:10:02,799 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-14 02:10:02,799 >> Num examples = 285,436
[INFO|trainer.py:1725] 2023-11-14 02:10:02,799 >> Num Epochs = 2
[INFO|trainer.py:1726] 2023-11-14 02:10:02,799 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1729] 2023-11-14 02:10:02,799 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1730] 2023-11-14 02:10:02,799 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1731] 2023-11-14 02:10:02,799 >> Total optimization steps = 17,840
[INFO|trainer.py:1732] 2023-11-14 02:10:02,801 >> Number of trainable parameters = 6,061,035,520
[INFO|integration_utils.py:718] 2023-11-14 02:10:02,802 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: developer-team018 (neural-network-018). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /workspace/alignment-handbook/wandb/run-20231114_021003-8xmy6gtd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run robust-plasma-26
wandb: ⭐️ View project at https://wandb.ai/neural-network-018/huggingface
wandb: 🚀 View run at https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd
0%| | 0/17840 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-14 02:10:23,512 >> Token indices sequence length is longer than the specified maximum sequence length for this model (6114 > 4096). Running this sequence through the model will result in indexing errors
{'loss': 1.7024, 'learning_rate': 1.9999999844947046e-05, 'epoch': 0.0}
{'loss': 1.1507, 'learning_rate': 1.999961237011484e-05, 'epoch': 0.01}
{'loss': 1.0928, 'learning_rate': 1.9998449510510744e-05, 'epoch': 0.01}
{'loss': 1.0793, 'learning_rate': 1.999651151133954e-05, 'epoch': 0.02}
{'loss': 1.0867, 'learning_rate': 1.999379852284651e-05, 'epoch': 0.02}
{'loss': 1.0857, 'learning_rate': 1.999031075535873e-05, 'epoch': 0.03}
{'loss': 1.0721, 'learning_rate': 1.9986048479268788e-05, 'epoch': 0.03}
{'loss': 1.0923, 'learning_rate': 1.99810120250138e-05, 'epoch': 0.04}
{'loss': 1.0836, 'learning_rate': 1.9975201783049804e-05, 'epoch': 0.04}
{'loss': 1.0769, 'learning_rate': 1.9968618203821487e-05, 'epoch': 0.05}
{'loss': 1.0574, 'learning_rate': 1.9961261797727256e-05, 'epoch': 0.06}
{'loss': 1.042, 'learning_rate': 1.9953133135079686e-05, 'epoch': 0.06}
{'loss': 1.0554, 'learning_rate': 1.9944232846061284e-05, 'epoch': 0.07}
{'loss': 1.0735, 'learning_rate': 1.993456162067566e-05, 'epoch': 0.07}
{'loss': 1.0785, 'learning_rate': 1.992412020869401e-05, 'epoch': 0.08}
{'loss': 1.0654, 'learning_rate': 1.9912909419596993e-05, 'epoch': 0.08}
{'loss': 1.0606, 'learning_rate': 1.9900930122511993e-05, 'epoch': 0.09}
{'loss': 1.0664, 'learning_rate': 1.988818324614572e-05, 'epoch': 0.1}
{'loss': 1.0604, 'learning_rate': 1.9874669778712215e-05, 'epoch': 0.1}
{'loss': 1.0674, 'learning_rate': 1.9860390767856244e-05, 'epoch': 0.11}
{'loss': 1.042, 'learning_rate': 1.984534732057208e-05, 'epoch': 0.11}
{'loss': 1.0452, 'learning_rate': 1.9829540603117667e-05, 'epoch': 0.12}
{'loss': 1.0577, 'learning_rate': 1.9812971840924222e-05, 'epoch': 0.12}
{'loss': 1.0471, 'learning_rate': 1.979564231850122e-05, 'epoch': 0.13}
{'loss': 1.0704, 'learning_rate': 1.977755337933682e-05, 'epoch': 0.13}
{'loss': 1.0282, 'learning_rate': 1.9758706425793702e-05, 'epoch': 0.14}
{'loss': 1.0515, 'learning_rate': 1.973910291900036e-05, 'epoch': 0.15}
{'loss': 1.0548, 'learning_rate': 1.97187443787378e-05, 'epoch': 0.15}
8%|██▌ | 1368/17840 [1:50:57<19:16:41, 4.21s/it][INFO|trainer.py:3158] 2023-11-14 04:01:02,181 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 04:01:02,182 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 04:01:02,182 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s]
3%|█▍ | 2/63 [00:00<00:12, 5.02it/s]
5%|██ | 3/63 [00:00<00:12, 4.76it/s]
6%|██▊ | 4/63 [00:00<00:15, 3.89it/s]
8%|███▍ | 5/63 [00:01<00:16, 3.50it/s]
10%|████▏ | 6/63 [00:01<00:17, 3.30it/s]
11%|████▉ | 7/63 [00:01<00:17, 3.20it/s]
13%|█████▌ | 8/63 [00:02<00:17, 3.12it/s]

                                                                         {'eval_loss': 1.0247304439544678, 'eval_runtime': 4.5889, 'eval_samples_per_second': 108.959, 'eval_steps_per_second': 13.729, 'epoch': 0.15}

8%|██▌ | 1368/17840 [1:51:02<19:16:41, 4.21s/it]
14%|██████▎ | 9/63 [00:02<00:17, 3.14it/s]
{'loss': 0.9636, 'learning_rate': 1.9697632383321755e-05, 'epoch': 1.0}
{'loss': 0.9026, 'learning_rate': 1.96757685694803e-05, 'epoch': 1.01}
{'loss': 0.8808, 'learning_rate': 1.965315463222695e-05, 'epoch': 1.01}
{'loss': 0.8712, 'learning_rate': 1.9629792324729302e-05, 'epoch': 1.02}
{'loss': 0.8967, 'learning_rate': 1.960568345817306e-05, 'epoch': 1.03}
{'loss': 0.8676, 'learning_rate': 1.9580829901621666e-05, 'epoch': 1.03}
{'loss': 0.8723, 'learning_rate': 1.9555233581871366e-05, 'epoch': 1.04}
{'loss': 0.9122, 'learning_rate': 1.9528896483301866e-05, 'epoch': 1.04}
{'loss': 0.8687, 'learning_rate': 1.9501820647722458e-05, 'epoch': 1.05}
{'loss': 0.8726, 'learning_rate': 1.947400817421375e-05, 'epoch': 1.05}
{'loss': 0.8505, 'learning_rate': 1.944546121896493e-05, 'epoch': 1.06}
{'loss': 0.8458, 'learning_rate': 1.9416181995106585e-05, 'epoch': 1.07}
{'loss': 0.8721, 'learning_rate': 1.9386172772539162e-05, 'epoch': 1.07}
{'loss': 0.8676, 'learning_rate': 1.9355435877756957e-05, 'epoch': 1.08}
{'loss': 0.8826, 'learning_rate': 1.9323973693667762e-05, 'epoch': 1.08}
{'loss': 0.8607, 'learning_rate': 1.929178865940815e-05, 'epoch': 1.09}
{'loss': 0.8561, 'learning_rate': 1.925888327015434e-05, 'epoch': 1.09}
{'loss': 0.8687, 'learning_rate': 1.9225260076928783e-05, 'epoch': 1.1}
{'loss': 0.874, 'learning_rate': 1.919092168640239e-05, 'epoch': 1.1}
{'loss': 0.8563, 'learning_rate': 1.915587076069243e-05, 'epoch': 1.11}
{'loss': 0.8445, 'learning_rate': 1.9120110017156172e-05, 'epoch': 1.12}
{'loss': 0.8646, 'learning_rate': 1.908364222818019e-05, 'epoch': 1.12}
{'loss': 0.8479, 'learning_rate': 1.9046470220965457e-05, 'epoch': 1.13}
{'loss': 0.8788, 'learning_rate': 1.9008596877308157e-05, 'epoch': 1.13}
{'loss': 0.9, 'learning_rate': 1.8970025133376252e-05, 'epoch': 1.14}
{'loss': 0.8791, 'learning_rate': 1.893075797948188e-05, 'epoch': 1.14}
{'loss': 0.9254, 'learning_rate': 1.889079845984951e-05, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:25<17:42:31, 4.22s/it][INFO|trainer.py:3158] 2023-11-14 05:52:30,316 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 05:52:30,317 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 05:52:30,317 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s]
3%|█▍ | 2/63 [00:00<00:10, 6.07it/s]
5%|██ | 3/63 [00:00<00:14, 4.20it/s]
6%|██▊ | 4/63 [00:01<00:16, 3.63it/s]
8%|███▍ | 5/63 [00:01<00:17, 3.37it/s]
10%|████▏ | 6/63 [00:01<00:17, 3.23it/s]
11%|████▉ | 7/63 [00:02<00:17, 3.16it/s]
13%|█████▌ | 8/63 [00:02<00:17, 3.06it/s]

{'eval_loss': 1.0676991939544678, 'eval_runtime': 4.5191, 'eval_samples_per_second': 110.641, 'eval_steps_per_second': 13.941, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:30<17:42:31, 4.22s/it]
14%|██████▎ | 9/63 [00:02<00:17, 3.09it/s]
[INFO|trainer.py:1955] 2023-11-14 05:52:34,837 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 13352.0365, 'train_samples_per_second': 42.755, 'train_steps_per_second': 1.336, 'train_loss': 0.9719247023264567, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:30<20:28:20, 4.88s/it]
***** train metrics *****
epoch = 1.15
train_loss = 0.9719
train_runtime = 3:42:32.03
train_samples = 285436
train_samples_per_second = 42.755
train_steps_per_second = 1.336
2023-11-14 05:52:34 - INFO - main - *** Evaluate ***
[INFO|trainer.py:3158] 2023-11-14 05:52:34,843 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 05:52:34,843 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 05:52:34,844 >> Batch size = 1
14%|██████▎ | 9/63 [00:02<00:16, 3.23it/s]
***** eval metrics *****
epoch = 1.15
eval_loss = 1.0677
eval_runtime = 0:00:04.48
eval_samples = 500
eval_samples_per_second = 111.451
eval_steps_per_second = 14.043
2023-11-14 05:52:39 - INFO - main - *** Save model ***
[INFO|trainer.py:2881] 2023-11-14 05:52:43,590 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full
[INFO|configuration_utils.py:461] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
[INFO|configuration_utils.py:564] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json
[INFO|modeling_utils.py:2201] 2023-11-14 05:52:51,334 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2428] 2023-11-14 05:52:51,336 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json
[INFO|tokenization_utils_base.py:2437] 2023-11-14 05:52:51,337 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json
[INFO|trainer.py:2881] 2023-11-14 05:52:55,599 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full
[INFO|configuration_utils.py:461] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
[INFO|configuration_utils.py:564] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json
[INFO|modeling_utils.py:2201] 2023-11-14 05:53:06,302 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2428] 2023-11-14 05:53:06,303 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json
[INFO|tokenization_utils_base.py:2437] 2023-11-14 05:53:06,304 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json
2023-11-14 05:55:20 - INFO - main - Model saved to data/apt-chat-yi-6B-sft-full
[INFO|modelcard.py:452] 2023-11-14 05:55:21,054 >> Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'communityai/apt-chat-micro-dataset-llm-v2-714k', 'type': 'communityai/apt-chat-micro-dataset-llm-v2-714k'}}
[INFO|configuration_utils.py:461] 2023-11-14 05:55:21,057 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
2023-11-14 05:55:21 - INFO - main - Pushing to hub...

How do I get the training scrips to utilize all my GPUs?

Hello there,

I'm running this script:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml

... but on my machine with 2x3090s ... only GPU 0 is being utilized.

What do I need to change to utlize both of my 3090s for the training run?

Thanks

Train on emails

Working as an cms admin in my company.
I have around 1 million emails back and forth to our customers.
How can i utilize the emails to make a chatbot that will understand our products?(our products is not known on the internet)

(understand that i can fine-tune on the data id i structure it like question/answer, but can i train on unstructured emails?)

Thanks in advance.

High VRAM usage with ZeRO 3.

Bug report that I'm hoping to fix with #51

Right now VRAM usage is high, even when using CPU offloading for parameters with ZeRO stage 3. My guess is that this is because if a GPU is detected, the model is being moved to it, regardless of deepspeed settings.

Why zephyr-7b-dpo-lora is finetuned from mistralai/Mistral-7B-v0.1 instead of zepher-7b-sft model?

There is a misalignment between zephyr-7b-dpo-lora and zephyr-7b-dpo-full.
The former one is finetuned from mistralai/Mistral-7B-v0.1.
The latter is finetuned from zephyr-7b-dpo-full.

I wonder what causes this misalignment ?

Also, have you benchmarked performance improvement of the lora finetunning script? In my experiment, lora finetunning seems do not provide any performance improvement compared with the base model on MT-bench. I think maybe some parameters are incorrect.

SFT training doesn't fully go through all samples

Current training uses ConstantLengthDataset. This dataset return fixed length of tokens (2048) in every step, however, the total number of steps are calculated based on the number of samples. I checked some samples and found that quite a few of them are much longer than 2048 (~7000), this means that some of the samples have never been seen in one epoch of training.

Could you please verify if my understanding is correct?

Thanks, appreciate.

code release?

Could you please release the releated codes?

The deepspeed full finetunning get stuck.

I ran this command
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --main_process_port 0 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml

INFO:root:Using nproc_per_node=8.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.elastic.rendezvous.static_tcp_rendezvous:Creating TCPStore as the c10d::Store implementation
[2023-11-14 08:13:14,714] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 08:13:14,899] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 08:13:14,914] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
......
/git/trl/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
[2023-11-14 08:13:19,132] [INFO] [comm.py:637:init_distributed] cdb=None
/git/trl/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
[2023-11-14 08:13:19,235] [INFO] [comm.py:637:init_distributed] cdb=None

The program is stuck after outputting the above log info. I do not know what's wrong since there is no error message. Can you help me with that?

[process exited with code 1 (0x00000001)]

Just wanted to report a crash while training.

Error message: [process exited with code 1 (0x00000001)]

Command i used to start the process: ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1

Explanation: Ran the process for several days, then my wife disconnected my laptop from the power source and moved the pc from the livingroom to another room(as the pc was so noisy), and then it seemed to crash. Not sure if it was triggered by the power source disconnect, or if it just happened around that time.

I will just try to run it again.

Log:

2023-11-27 15:41:40 - INFO - main - *** Load pretrained model ***
2023-11-27 15:41:40 - INFO - main - *** Model loaded! ***
/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you.
warnings.warn(
[INFO|configuration_utils.py:717] 2023-11-27 15:41:40,964 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/config.json
[INFO|configuration_utils.py:777] 2023-11-27 15:41:40,964 >> Model config MistralConfig {
"_name_or_path": "mistralai/Mistral-7B-v0.1",
"architectures": [
"MistralForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"sliding_window": 4096,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.35.0",
"use_cache": false,
"vocab_size": 32000
}

[INFO|modeling_utils.py:3121] 2023-11-27 15:41:40,972 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/pytorch_model.bin.index.json
[INFO|modeling_utils.py:3184] 2023-11-27 15:41:40,974 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object
[INFO|modeling_utils.py:1222] 2023-11-27 15:41:40,974 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:791] 2023-11-27 15:41:40,976 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"use_cache": false
}

[INFO|modeling_utils.py:3257] 2023-11-27 15:41:41,631 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.75s/it][INFO|modeling_utils.py:3950] 2023-11-27 15:41:51,332 >> All model checkpoint weights were used when initializing MistralForCausalLM.

[INFO|modeling_utils.py:3958] 2023-11-27 15:41:51,332 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
[INFO|configuration_utils.py:751] 2023-11-27 15:41:51,488 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/generation_config.json
[INFO|configuration_utils.py:791] 2023-11-27 15:41:51,488 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2
}

[INFO|training_args.py:1784] 2023-11-27 15:41:51,646 >> PyTorch: setting up devices
/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:247: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code.
warnings.warn(
[INFO|trainer.py:593] 2023-11-27 15:41:52,619 >> Using auto half precision backend
2023-11-27 15:41:52 - INFO - main - *** Train ***
[INFO|trainer.py:1723] 2023-11-27 15:41:53,614 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-27 15:41:53,614 >> Num examples = 207,865
[INFO|trainer.py:1725] 2023-11-27 15:41:53,614 >> Num Epochs = 1
[INFO|trainer.py:1726] 2023-11-27 15:41:53,614 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1729] 2023-11-27 15:41:53,614 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024
[INFO|trainer.py:1730] 2023-11-27 15:41:53,614 >> Gradient Accumulation steps = 1024
[INFO|trainer.py:1731] 2023-11-27 15:41:53,614 >> Total optimization steps = 202
[INFO|trainer.py:1732] 2023-11-27 15:41:53,616 >> Number of trainable parameters = 54,525,952
0%| | 0/202 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-27 15:41:54,956 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2377 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|logging.py:314] 2023-11-27 15:41:55,018 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
[WARNING|logging.py:329] 2023-11-27 15:41:55,763 >> The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
0%|▌ | 1/202 [4:36:47<927:14:16, 16607.25s/it]{'loss': 1.1453, 'learning_rate': 1.9998790632601496e-05, 'epoch': 0.0}
{'loss': 1.1416, 'learning_rate': 1.9969780438256295e-05, 'epoch': 0.02}
{'loss': 1.1346, 'learning_rate': 1.987930439740757e-05, 'epoch': 0.05}
{'loss': 1.1079, 'learning_rate': 1.9729118706714377e-05, 'epoch': 0.07}
{'loss': 1.0977, 'learning_rate': 1.95201310753273e-05, 'epoch': 0.1}
{'loss': 1.0881, 'learning_rate': 1.925360460617242e-05, 'epoch': 0.12}
{'loss': 1.0713, 'learning_rate': 1.8931150161867917e-05, 'epoch': 0.15}
{'loss': 1.0523, 'learning_rate': 1.855471662881164e-05, 'epoch': 0.17}
{'loss': 1.0533, 'learning_rate': 1.8126579138282502e-05, 'epoch': 0.2}
{'loss': 1.0427, 'learning_rate': 1.764932531574648e-05, 'epoch': 0.22}
{'loss': 1.0307, 'learning_rate': 1.7125839641475074e-05, 'epoch': 0.25}
{'loss': 1.0395, 'learning_rate': 1.65592860169994e-05, 'epoch': 0.27}
{'loss': 1.0268, 'learning_rate': 1.595308864276666e-05, 'epoch': 0.3}
{'loss': 1.0304, 'learning_rate': 1.531091132257275e-05, 'epoch': 0.32}
{'loss': 1.0264, 'learning_rate': 1.4636635319853274e-05, 'epoch': 0.34}
{'loss': 1.0232, 'learning_rate': 1.3934335899667526e-05, 'epoch': 0.37}
{'loss': 1.0094, 'learning_rate': 1.3208257698153677e-05, 'epoch': 0.39}
{'loss': 1.0238, 'learning_rate': 1.2462789068320016e-05, 'epoch': 0.42}
{'loss': 1.013, 'learning_rate': 1.1702435557223988e-05, 'epoch': 0.44}
{'loss': 1.022, 'learning_rate': 1.0931792674840718e-05, 'epoch': 0.47}
{'loss': 1.0153, 'learning_rate': 1.0155518119203511e-05, 'epoch': 0.49}
{'loss': 1.0143, 'learning_rate': 9.378303625685196e-06, 'epoch': 0.52}
{'loss': 1.0191, 'learning_rate': 8.604846610560771e-06, 'epoch': 0.54}
{'loss': 1.0176, 'learning_rate': 7.839821780235168e-06, 'epoch': 0.57}
{'loss': 1.0169, 'learning_rate': 7.0878528777274814e-06, 'epoch': 0.59}
60%|█████████████████████████████████████████████████████████████████████▍ | 121/202 [142:29:15<78:43:12, 3498.68s/it]
[process exited with code 1 (0x00000001)]
You can now close this terminal with Ctrl+D, or press Enter to restart.

Impossible to load local pretrained model for DPO Alignment

Hi,

I'm using Alignment-Handbook to finetune mistral on a custom dataset.
I've made the SFT Full Fine Tuning and it ended up well, and saved my pretrained model to ./data/zephyr-7b-sft-full.
Now i want to align my model with DPO, everything is okay with the loading of the dataset but it's impossible to load the pretrained model saved on my local machine.
Althought i've tried with absolute and relative path, it always throw this error :

/root/miniconda3/lib/python3.11/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
Traceback (most recent call last):
  File "/workspace/work/finetuning/alignment-handbook/scripts/run_dpo.py", line 224, in <module>
    main()
  File "/workspace/work/finetuning/alignment-handbook/scripts/run_dpo.py", line 119, in main
    if is_adapter_model(model, model_args.model_revision):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/alignment/model_utils.py", line 99, in is_adapter_model
    repo_files = list_repo_files(model_name_or_path, revision=revision)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './data/zephyr-7b-sft-full'. Use `repo_type` argument if needed.

and no matter how much I change, nothing happens.

Here is the /dpo/config_full.yaml file :

# Model arguments
model_name_or_path: ./data/zephyr-7b-sft-full

# Data training arguments
# For definitions, see: src/h4/training/config.py
# Data training arguments
dataset_mixer:
  /workspace/work/finetuning/tickets_dpo/: 1.0
dataset_splits:
- train
- test
preprocessing_num_workers: 12

# DPOTrainer arguments
bf16: true
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 1
gradient_checkpointing: true
#hub_model_id: zephyr-7b-dpo-full
learning_rate: 5.0e-7
log_level: info
logging_steps: 10
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 1024
num_train_epochs: 5
optim: rmsprop
output_dir: data/zephyr-7b-dpo-full
per_device_train_batch_size: 8
per_device_eval_batch_size: 4
push_to_hub: false
save_strategy: "no"
save_total_limit: null
seed: 42
warmup_ratio: 0.1

Please help :)

What is the expected "global batch size"?

In the recipes README there is this statement:

If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant (and thus replicate our results).

Q: What is the expected "global batch size"?

For example, I'm trying to run this on 2x3090s and need to know what the expected global batch size is so I can adjust the accumulation steps and per device train batch size.

Thanks much!

Is there available SFT fine tuning for zephyr-7B families?

Hello, I am so impressed by your models. I tried fine tuning your models with my data and the evaulation_loss is not optimized as shown in the image above. In particular, the blue line is the llama-13b model, and you can see that the zephyr models are performing worse than the llama models when fine tuning, even though the MT performance is much better. Do you have any idea why this is? The script used in my job is based on basic SFTtrainer example on trl library https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py

Thank you!

LoRA + FlashAttention2 speed up？

When fine-tuning Mistral with LoRA, do you think FlashAttention2 helps in speeding up the process? If yes, how significant is the acceleration? Where is the primary acceleration achieved?

Running on single RTX4090

Hi,

Followed the install. Tried to run in shell # Step 1 - SFT. for single gpu as per instructions. i.e --num_processes=1

Kicked up the below error. Is alignment module something we need to seperately pip install?

'''
from alignment import (
ModuleNotFoundError: No module named 'alignment'

'''

Thanks in advance! Jason.

Reproducing SFT results.

I was looking at the logs of your training (from this json file) and realized that the scheduling is messed up.

It's related to the ConstantLength dataset, not computing its actual length. When I train this model, the progress bar and the total number of iterations are calculated from the underlying H4 Dataset (around 208k samples) instead of the packed version that has around 139k packed sequences of 2048.
This affects the scheduler, which does not perform any warmup. I have an 8xA100 node, so I am running 2x grad accum for an adequate batch size 512.

I am sure you are missing a warmup_ratio: 0.1 on the sft configs

~~It would be beneficial to have access to the training logs.~~ I found them on Tensorboard :(

You can follow my training here: https://wandb.ai/capecape/zephyr/runs/zhfrhnr5

PD: When using trl, I manually compute the total number of train steps beforehand to adequately pass the warmup steps to the scheduler. I know the ConstantLength dataset is a generator that yields batches without knowing beforehand how many samples it will have.

Release dSFT data preparation (self-instruct) code?

Role of `prompt` field in SFT

I am planning to run SFT on real chatlogs so naturally I don't have the prompt field like in the Ultrachat dataset. AFAICT, this field is not used to perform SFT so I think I can keep it as an empty string. The code that converts the datapoint into a single string only uses the messages field:

https://github.com/huggingface/alignment-handbook/blob/61a11a5c7d66179ed0a930b0dd12e532fce701dd/src/alignment/data.py#L36C1-L42

Am I missing something here?

DPO loss

I am training DPO with lora, the loss has weird behavior: will decrease sharply at the beginning of each epoch. I wonder if you have same issue before?

Training Finishes Prematurely after Max Length increases

Has anyone else experienced cases where the training finishes early as max length increases?

Ran this script on a custom dataset with the following config. No CUDA errors. It just moved to evaluation before it should have. Also running on a 8XA100 cluster (40GB).

model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: auto
use_flash_attention_2: true

# LoRA arguments
use_peft: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj

# Data training arguments
preprocessing_num_workers: 12

# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 128
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: custom-model
hub_strategy: every_save
learning_rate: 2.0e-05
log_level: info
logging_steps: 5  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 8192
max_steps: -1
num_train_epochs: 1
output_dir: data/zephyr-7b-sft-lora
overwrite_output_dir: true
per_device_eval_batch_size: 8
per_device_train_batch_size: 4
push_to_hub: false
report_to:
- tensorboard
save_strategy: "no"
save_total_limit: null
seed: 42

Got this result:

[INFO|trainer.py:1723] 2023-11-16 03:59:23,969 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-16 03:59:23,969 >>   Num examples = 199,500
[INFO|trainer.py:1725] 2023-11-16 03:59:23,969 >>   Num Epochs = 1
[INFO|trainer.py:1726] 2023-11-16 03:59:23,969 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1729] 2023-11-16 03:59:23,969 >>   Total train batch size (w. parallel, distributed & accumulation) = 4,096
[INFO|trainer.py:1730] 2023-11-16 03:59:23,969 >>   Gradient Accumulation steps = 128
[INFO|trainer.py:1731] 2023-11-16 03:59:23,969 >>   Total optimization steps = 48
[INFO|trainer.py:1732] 2023-11-16 03:59:23,972 >>   Number of trainable parameters = 54,525,952
  0%|                                                                                               | 0/48 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-16 03:59:25,691 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2576 > 2048). Running this sequence through the model will result in indexingerrors
[WARNING|logging.py:314] 2023-11-16 03:59:26,260 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad`method to get a padded encoding.
{'loss': 1.004, 'learning_rate': 1.9978589232386036e-05, 'epoch': 0.02}
{'loss': 0.9924, 'learning_rate': 1.946930129495106e-05, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:42:30<14:42:30, 1231.40s/it][INFO|trainer.py:3158] 2023-11-16 05:48:55,043 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-16 05:48:55,043 >>   Num examples = 10500
[INFO|trainer.py:3163] 2023-11-16 05:48:55,043 >>   Batch size = 8
{'eval_loss': 0.9804360270500183, 'eval_runtime': 113.7535, 'eval_samples_per_second': 92.305, 'eval_steps_per_second': 1.4
51, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:51:24<14:42:30, 1231.40s/it]
[INFO|trainer.py:1955] 2023-11-16 05:50:48,798 >>

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 6684.8263, 'train_samples_per_second': 29.844, 'train_steps_per_second': 0.007, 'train_loss': 1.0607780635356903, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:51:24<15:58:09, 1336.96s/it]
***** train metrics *****
  epoch                    =        0.1
  train_loss               =     1.0608
  train_runtime            = 1:51:24.82
  train_samples            =     199500
  train_samples_per_second =     29.844
  train_steps_per_second   =      0.007
2023-11-16 05:50:48 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3158] 2023-11-16 05:50:48,801 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-16 05:50:48,801 >>   Num examples = 10500
[INFO|trainer.py:3163] 2023-11-16 05:50:48,801 >>   Batch size = 8
 12%|█████████▊                                                                           | 19/165 [01:44<13:21,  5.49s/it]
***** eval metrics *****
  epoch                   =        0.1
  eval_loss               =     0.9817
  eval_runtime            = 0:01:52.87
  eval_samples            =      10500
  eval_samples_per_second =     93.026
  eval_steps_per_second   =      1.462
2023-11-16 05:52:41 - INFO - __main__ - *** Save model ***

Even after lowering it to 4096 tokens, it still ended early, but this time after 20%
When running on the default dataset, same thing occurred but this time at 33%.

Thoughts?

Why does the alignment-handbook account for user & system Inputs in loss calculation

I noticed that the alignment-handbook doesn't ignore the loss calculated from both the user and system inputs Based on my knowledge, many SFT choose to ignore these. I'm curious about the reasoning behind this difference.

Great initiative - can you please share timeline?

Looking forward to it.

Questions about data filtering for zephyr-7b-beta's UltraChat version

I noticed in the model card for zephyr-7b-beta that you mentioned "removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful," resulting in a filtered 200k UltraChat version. Could you please elaborate on the criteria used for this filtering? Would it be possible to open-source the corresponding script? I'm wondering if this could assist me in cleaning and filtering my own SFT data.

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA)

So I'm attempting to run the DPO LoRA script and I'm getting this error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1

... when the model.merge_and_load() runs here:

base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
model.eval()
model = model.merge_and_unload()

Any ideas?

Typo in example commands in zephyr recipe readme

There are some typos in the readme with the zephyr recipe

step 2 dpo for full: path to config_full.yaml is not correct; should be recipes/zephyr-7b-beta/dpo/config_full.yaml
step 1 sft for LoRA: path to config_lora.yaml is not correct; should be recipes/zephyr-7b-beta/sft/config_lora.yaml

Windows installation

It is possible to download and use this entire repo on windows, with the exception of deepspeed. After trying to install the alignment-notebook package I found you can simply remove deepspeed from the setup.py and all the lines with deepspeed, and remove it. Doing so allows you to run “pip install .” With out any errors. Doing this will make multi-gpu’s not work, but that was kinda obvious for anyone who is trying to use multi gpu support. Otherwise everything else in the repo works just great!

SFT checkpoint of zephyr-7b

Hi!

Thanks for releasing the Zephyr-7b-alpha model! I am wondering if it is possible to access the SFT checkpoint of Zephyr-7b (the model prior to DPO training).

As I understand, the SFT training of Zephyr-7b was done based on this script. This is very useful! Since the SFT training script is open-source, could you also release a SFT model checkpoint? Having access to the SFT checkpoint would help us investigate the impact of DPO without requiring significant GPU resources.

Thanks for your consideration!
Tianlin

Misalignment between config_lora.yaml and the model card

Hi, I noticed that in the model card. It says Adam optimizer is used.
However, in the config_lora.yaml file, it uses optim: rmsprop. Could you tell me which one is the actual training configuration?

I don't if there are other hyperparameters I didn't notice. Can you align the scripts with the correct model training configuration, please?

What about the system prompt?

It seems that the system prompt is left to be \n or rather blank.

Inspecting UltraChat (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k?row=5), seems that no system prompt is added to the dataset.

There must be something that I missed in regards to addition of system prompts to the dataset for training, especially since the officially deployed model is able to adhere to system prompt intent (like 'You are a pirate', etc)

Global batch size question

Hi!

Thanks again for the awesome repo. I have a small question regarding the global batch size of DPO training reported in the paper vs used in the code base.

In the paper, it mentions that, for DPO, "We train all models with a global batch size of 32". This is consistent to the the hyperparam of HuggingFaceH4/zephyr-7b-beta.

In the codebase, we are suggested to use 8 GPUs to reproduce zephyr-7b-beta here.

You will require 8 GPUs (80GB of VRAM) to train the full model.

Since per_device_train_batch_size=8 in the recipes/zephyr-7b-beta/dpo/config_full.yaml, this means that the global batch size is 64, and not 32, when using 8 GPUs. While this is different from the paper, the global batchsize = 64 setting is consistent with the hyperparam in alignment-handbook/zephyr-7b-dpo-full.

My guess is that global batchsize = 32 or 64 would give similar performance, say, on MT-bench? Could you confirm it? Many thanks! I am about to launch some experiments, and I wish to get the details correct so as to reproduce the results from the paper as closely as possible 🙏.

Integrate feedback from the community

This issue collects links of community feedback on the type of content to include in the handbook. Feel free to post a comment below with other ideas / requests!

https://twitter.com/waydegilliam/status/1731426451759714352

Question about "ModuleNotFoundError: No module named 'alignment'"

Prectise in python shell is successed, but error is exist when I used "accelarate launch"....

Tokenizer model_max_length

Hello,

I was seeing warning during finetuning Mistral and tracked this line here

https://github.com/huggingface/alignment-handbook/blob/main/src/alignment/model_utils.py#L71

Because Mistral's tokenizer model max length has a large number so the model_max_length set as 2048. However my training data consists sequence length longer than that, e.g. 4000 characters. Would this be a problem?

Thank you!

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout

Here's the call I'm using to run the script:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file examples/hf-alignment-handbook/configs/accelerate_configs/deepspeed_zero3.yaml --num_processes=2 examples/hf-alignment-handbook/run_sft.py examples/hf-alignment-handbook/configs/training_configs/zephyr-7b-beta/config_lora_sft.yaml --load_in_4bit=true

Here's the full trace of the error:

2023-12-01 00:05:43 - INFO - __main__ - Pushing to hub...
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
[2023-12-01 00:35:50,246] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 1817792) of binary: /home/wgilliam/mambaforge/envs/llms/bin/python3.11
Traceback (most recent call last):
  File "/home/wgilliam/mambaforge/envs/llms/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    deepspeed_launcher(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/launch.py", line 695, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Any ideas how to resolve?

Thanks

Is it possible to visualize the loss graph during fine-tuning from Mistral to Zephyr?

I really appreciate you sharing this valuable �project.
Is there a way to view the loss graph when fine-tuning from Mistral to Zephyr? This could be a valuable resource for research.

Training using a custom prompt format and custom dataset

Thank you for sharing the training code.

Can you please point the files or the places needs attention when following the same recipe for to train the model with different prompt template and custom dataset. (Not the chat)

Any suggestions here?

Regards