Giter Club home page Giter Club logo

badam's Issues

Can BlockOptimizer switch learning rates?

Thank you for your excellent work!

I have a quick question: When using the BlockOptimizer to switch blocks, is it possible to simultaneously switch other optimizer parameters, such as the learning rate and weight decay rate?

同样数据,BAdam速度比Lora慢很多

同样数据集,但是使用lora要比BAdam速度要快很多。。。看论文应该BAdam速度应该快才对啊/
deepspeed配置文件如下
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},

"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true

}
}

在LLamaFactory中配置如下
stage: pt
do_train: true
finetuning_type: full
use_badam: true
badam_switch_mode: ascending
badam_switch_interval: 50
badam_verbose: 2
deepspeed: examples/deepspeed/ds_z3_config_badam.json
badam_mode: layer

badam 预训练时间太长 比全参数预训练都长

在做的是PT阶段 不是SFT
现在已经更改了
badam是1.2.2 版本
这是现在的结果
lora all参数的
0%|▏ | 2/2019 [01:57<32:50:42, 58.62s/it]
badam full参数
1%|▉ | 49/8076 [1:17:24<211:39:01, 94.92s/i

然后这是full 预训练
0%|▏ | 3/2019 [04:01<45:43:10, 81.64s/it]
所以badam 为什么时间这么长
模型是qwen1.5-7B

How to realize the loss curves of badam and lora in the paper

Hello, I am very interested in your work and tried to reproduce the experiments in the paper.
I followed the the experiment setup according to the section 3.1 in the paper. The learning rate is 1e-5, the batch size is 8, the gradient accumulation is 15, and the lora rank is 100.

Here is my badam config
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
    --use_badam \
    --badam_switch_mode ascending \
    --badam_switch_block_every 10 \
    --badam_verbose 2 \
    --output_dir ../../saves/LLaMA2-7B/badam/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 15 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --warmup_steps 0 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --report_to wandb \
    --num_train_epochs 7 \
    --plot_loss \
    --bf16

Here is my lora config
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_rank 100 \
    --output_dir ../../saves/LLaMA2-7B/lora/lora-llama-2-7b-r16-batch8-acc15-epoch7-1e-5 \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 15 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --warmup_steps 0 \
    --save_steps 1000 \
    --report_to wandb \
    --learning_rate 1e-5 \
    --num_train_epochs 7 \
    --plot_loss \
    --bf16


When I ran the experiment, I found that badam method cost about 23G memory, while lora method cost about 39G memory on A100-80G (but only cost 24G on RTX3090-24G). I am wondering if my experiment config is correct.

I expected the training loss curves like this:
image
But what I actually got was this( purple for badam and blue for lora):
image

Floating point exception when running sft with lora

Hi, I‘m just wondering that do you have any ideas for the Floating point exception (core dumped) that occurs when running sft with lora in this project? I had tried using float32 when stepping the optimizer, but it didn't seem to work. Is it a problem with the torch environment or something?
Your research and projects have really helped me a lot, thank you so much!

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

我在修改了一个训练脚本后报错没有梯度,下面是我修改的代码:

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset_merge if training_args.do_train else None,
        eval_dataset=eval_dataset_merge if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics if training_args.predict_with_generate else None,
        # optimizers = (optimizer, None)
    )

    if model_args.use_badam:
        try:
            from badam import BlockOptimizer
        except ImportError:
            # raise ImportError("Unable to import badam module. Please install badam or disable its usage.")
            logger.info("Unable to import badam module. Please install badam or disable its usage.")

        # Optimizer
        trainer.create_optimizer_and_scheduler(num_training_steps=training_args.max_steps)
        original_optimizer = trainer.optimizer

        # before training, add this line to wrap the original optimizer
        trainer.optimizer = BlockOptimizer(
            base_optimizer=original_optimizer,  # can be any torch.Optimizer
            named_parameters_list=list(model.named_parameters()),
            switch_block_every=100,
            # switch to the new block every 50 updates, the $K$ Adam steps in paper. It can be set adaptively by $K = n/(BD)$, where $n$ is the number of training data points, $B$ is the batch size, and $D$ is the number of blocks in BAdam; see "Hyperparameter Suggestion" section for a detailed explaination about setting this hyperparameter.
            switch_mode="random",
            # update order of blocks, one can choose "random" (random reshuffling update order), "ascending" (update from input layer to output layer), or "descending" (update from output layer to input layer). The default is "random".
            verbose=2,  # information level, will print trainable parameters when setting to 2
            block_prefix_list = None
        )


报错信息如下,有大佬知道怎么解决嘛,蟹蟹~~~

Traceback (most recent call last):
File "/hy-tmp/mnmt/m2m/scripts/run_translation_fast.py", line 700, in
main()
File "/hy-tmp/mnmt/m2m/scripts/run_translation_fast.py", line 616, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/accelerate/accelerator.py", line 2011, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/usr/local/miniconda3/envs/nmt/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

支持多卡吗?

我复现Llama3-8B效果很好,想尝试一下70B,请问会支持多卡吗?

Error when using BlockOptimizerRatio

Thanks for your impressive work!

I have tried to run BlockOptimizer in my code which utilizes huggingface trainer and DeepSpeed Zero3. In the beginning there is an error which is solved by model.enable_input_require_grads(). The results is not good. So I change the optimizer. However, when I just change to BlockOptimizerRatio, new error occurs:

When I use DeepSpeed Zero3, the error is
RuntimeError: copy_() between dense and sparse Tensors is not implemented! Found self type = CUDABFloat16Type and src type = SparseCUDABFloat16Type

If I do not use DeepSeed, the error is:
NotImplementedError: Cannot access storage of SparseTensorImpl

How can I solve it and use BlockOptimizerRatio in my own code?
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.