ledzy / badam Goto Github PK

License: Apache License 2.0

Python 98.99% Shell 1.01%

badam's Issues

Can BlockOptimizer switch learning rates?

Thank you for your excellent work!

I have a quick question: When using the BlockOptimizer to switch blocks, is it possible to simultaneously switch other optimizer parameters, such as the learning rate and weight decay rate?

同样数据，BAdam速度比Lora慢很多

同样数据集，但是使用lora要比BAdam速度要快很多。。。看论文应该BAdam速度应该快才对啊/
deepspeed配置文件如下
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},

"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true

}
}

在LLamaFactory中配置如下
stage: pt
do_train: true
finetuning_type: full
use_badam: true
badam_switch_mode: ascending
badam_switch_interval: 50
badam_verbose: 2
deepspeed: examples/deepspeed/ds_z3_config_badam.json
badam_mode: layer

badam 预训练时间太长比全参数预训练都长

在做的是PT阶段不是SFT
现在已经更改了
badam是1.2.2 版本
这是现在的结果
lora all参数的
0%|▏ | 2/2019 [01:57<32:50:42, 58.62s/it]
badam full参数
1%|▉ | 49/8076 [1:17:24<211:39:01, 94.92s/i

然后这是full 预训练
0%|▏ | 3/2019 [04:01<45:43:10, 81.64s/it]
所以badam 为什么时间这么长
模型是qwen1.5-7B

How to realize the loss curves of badam and lora in the paper

Hello, I am very interested in your work and tried to reproduce the experiments in the paper.
I followed the the experiment setup according to the section 3.1 in the paper. The learning rate is 1e-5, the batch size is 8, the gradient accumulation is 15, and the lora rank is 100.

Here is my badam config

#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
    --use_badam \
    --badam_switch_mode ascending \
    --badam_switch_block_every 10 \
    --badam_verbose 2 \
    --output_dir ../../saves/LLaMA2-7B/badam/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 15 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --warmup_steps 0 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --report_to wandb \
    --num_train_epochs 7 \
    --plot_loss \
    --bf16

Here is my lora config

#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_rank 100 \
    --output_dir ../../saves/LLaMA2-7B/lora/lora-llama-2-7b-r16-batch8-acc15-epoch7-1e-5 \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 15 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --warmup_steps 0 \
    --save_steps 1000 \
    --report_to wandb \
    --learning_rate 1e-5 \
    --num_train_epochs 7 \
    --plot_loss \
    --bf16

When I ran the experiment, I found that badam method cost about 23G memory, while lora method cost about 39G memory on A100-80G (but only cost 24G on RTX3090-24G). I am wondering if my experiment config is correct.

I expected the training loss curves like this:

But what I actually got was this( purple for badam and blue for lora):

type object 'PPODecorators' has no attribute 'empty_cuda_cache'. Did you mean: 'empty_device_cache'?

I will reproduce your code:
Python src/train_bash.py
I used the command you provided: When this situation occurred, I thought it was a module error in Transformers (4.40).
So I dropped to 4.37

Floating point exception when running sft with lora

Hi, I‘m just wondering that do you have any ideas for the Floating point exception (core dumped) that occurs when running sft with lora in this project? I had tried using float32 when stepping the optimizer, but it didn't seem to work. Is it a problem with the torch environment or something?
Your research and projects have really helped me a lot, thank you so much！

What is the difference from existing HiFT "A Hierarchical Full Parameter Fine-Tuning Strategy"?

This work is almost very similar to existing HiFT, what differences do the authors think they have?

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

我在修改了一个训练脚本后报错没有梯度，下面是我修改的代码：

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset_merge if training_args.do_train else None,
        eval_dataset=eval_dataset_merge if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics if training_args.predict_with_generate else None,
        # optimizers = (optimizer, None)
    )

    if model_args.use_badam:
        try:
            from badam import BlockOptimizer
        except ImportError:
            # raise ImportError("Unable to import badam module. Please install badam or disable its usage.")
            logger.info("Unable to import badam module. Please install badam or disable its usage.")

        # Optimizer
        trainer.create_optimizer_and_scheduler(num_training_steps=training_args.max_steps)
        original_optimizer = trainer.optimizer

        # before training, add this line to wrap the original optimizer
        trainer.optimizer = BlockOptimizer(
            base_optimizer=original_optimizer,  # can be any torch.Optimizer
            named_parameters_list=list(model.named_parameters()),
            switch_block_every=100,
            # switch to the new block every 50 updates, the $K$ Adam steps in paper. It can be set adaptively by $K = n/(BD)$, where $n$ is the number of training data points, $B$ is the batch size, and $D$ is the number of blocks in BAdam; see "Hyperparameter Suggestion" section for a detailed explaination about setting this hyperparameter.
            switch_mode="random",
            # update order of blocks, one can choose "random" (random reshuffling update order), "ascending" (update from input layer to output layer), or "descending" (update from output layer to input layer). The default is "random".
            verbose=2,  # information level, will print trainable parameters when setting to 2
            block_prefix_list = None
        )

报错信息如下，有大佬知道怎么解决嘛，蟹蟹~~~

Traceback (most recent call last):
File "/hy-tmp/mnmt/m2m/scripts/run_translation_fast.py", line 700, in
main()
File "/hy-tmp/mnmt/m2m/scripts/run_translation_fast.py", line 616, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/accelerate/accelerator.py", line 2011, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

支持多卡吗？

我复现Llama3-8B效果很好，想尝试一下70B，请问会支持多卡吗？

Error when using BlockOptimizerRatio

Thanks for your impressive work!

I have tried to run BlockOptimizer in my code which utilizes huggingface trainer and DeepSpeed Zero3. In the beginning there is an error which is solved by model.enable_input_require_grads(). The results is not good. So I change the optimizer. However, when I just change to BlockOptimizerRatio, new error occurs:

When I use DeepSpeed Zero3, the error is
RuntimeError: copy_() between dense and sparse Tensors is not implemented! Found self type = CUDABFloat16Type and src type = SparseCUDABFloat16Type

If I do not use DeepSeed, the error is:
NotImplementedError: Cannot access storage of SparseTensorImpl

How can I solve it and use BlockOptimizerRatio in my own code?
Thank you

Question about deepspeed/fsdp compatiblity

Is BAdam compatible to deepspeed & fsdp?
It would be great to have tutorials on how to use BAdam with deepspeed ZeRO3 if deepspeed is supported.

ledzy / badam Goto Github PK

badam's Issues

Can BlockOptimizer switch learning rates?

同样数据，BAdam速度比Lora慢很多

badam 预训练时间太长比全参数预训练都长

How to realize the loss curves of badam and lora in the paper

type object 'PPODecorators' has no attribute 'empty_cuda_cache'. Did you mean: 'empty_device_cache'?

Floating point exception when running sft with lora

What is the difference from existing HiFT "A Hierarchical Full Parameter Fine-Tuning Strategy"?

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

支持多卡吗？

Error when using BlockOptimizerRatio

Question about deepspeed/fsdp compatiblity

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent