princeton-nlp / mezo Goto Github PK

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333

License: MIT License

Python 97.77% Shell 2.23%

mezo's Introduction

MeZO: Fine-Tuning Language Models with Just Forward Passes

This is the implementation for the paper Fine-Tuning Language Models with Just Forward Passes. In this paper we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical zeroth-order SGD method to operate in-place, thereby fine-tuning language models (LMs) with the same memory footprint as inference.

With a single A100 80GB GPU, MeZO can train a 30-billion parameter OPT model, whereas fine-tuning with Adam can train only a 2.7B LM. MeZO demonstrates comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12× memory reduction. MeZO is also compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning. We also show that MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1).

GPU memory usage comparison between zero-shot, in-context learning (ICL), Adam fine-tuning (FT), and our proposed MeZO.

OPT-13B results with zero-shot, in-context learning (ICL), MeZO (we report the best among MeZO/MeZO (LoRA)/MeZO (prefix)), and fine-tuning with Adam (FT). MeZO demonstrates superior results over zero-shot and ICL and performs on par with FT (within 1%) on 7 out of 11 tasks, despite using only 1/12 memory.

Reproduce our paper results

For reproducing RoBERTa-large experiments, please refer to the medium_models folder. For autoregressive LM (OPT) experiments, please refer to the large_models folder. If you want to learn more about how MeZO works and how we implement it, we recommend you to read the large_models folder as the implementation is clearer and more extensible. If you want to explore more variants of MeZO, we recommend trying out medium_models as it's faster and has more variants implemented.

How to add MeZO to my own code?

Our implementation of MeZO is based on HuggingFace's Trainer. We add MeZO to the official implementation of trainer with minimum editing. Please refer to "How to add MeZO to my own code?" section in large_models README for more details.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Sadhika ([email protected]) or Tianyu ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{malladi2023mezo,
   title={Fine-Tuning Large Language Models with Just Forward Passes},
   author={Malladi, Sadhika and Gao, Tianyu and Nichani, Eshaan and Damian, Alex and Lee, Jason D and Chen, Danqi and Arora, Sanjeev},
   year={2023}
}

mezo's People

Contributors

Stargazers

Watchers

mezo's Issues

About experimentical setting of 1000 examples

Hi,
Thanks for your great work. I would like to ask whether sample 1000 examples is a common setting. I find the training process will be fast and I'm not sure whether this setting is convincing.

Thanks,
Lucas Liu

MeZO running script for roberta-large is not working

Hi @sadhikamalladi @gaotianyu1350 ,

Very interesting work, I am trying to reproduce your MeZO results for roberta large model using SST-2 could you please provide me a running script to run this experiment. Below is what I did and got some error with args.

python medium_models/run.py --zero_order_optim --efficient_zero_order --model_name_or_path roberta-large --task_name SST-2 --data_dir ./data/original/SST-2 --output_dir output

Not convergent in custom dataset.

Hi, glad to see the impressive project.

I overload the trainer.py according to README and the training works properly. However, the model doesn't seem to converge and loss is always around 0.7 no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.

As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.

So is there some constraint that may lead to the failure of training?

In which file is the code implemented by the algorithm？

In fact, I'm a beginner. So I would like to know in which file the code for the implementation of the algorithm in the paper is in？

Nanogpt implementation

Is it possible to provide an implementation of this using nanogpt and or litgpt?

Inconsistent results of MEZO for RoBERTa-large on SST-2

Greetings,

I did a grid search for MEZO within the search space that included the five seeds (13, 21, 42, 87, 100) provided by your code and the configurations based on the hyperparameters from Table 4 cuz I did not notice that you've already presented a more precise search range in Table 13. But since Table 4 is a superset of Table 13, it should not miss the best hyper-parameter used in the paper. Here, I listed all the promising combinations ( I mean it should give at least a test_acc>0.85), which I discovered during the grid search. For the obtained result, only configurations with seed 87 are likely to perform well. I did not find a configuration that can perform well across all five selected seeds used for both data splitting and mezo training.

2023-11-02 21:49:27,370 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.4354574382305145, 'eval_acc': 0.84375, 'test_loss': 0.27337443828582764, 'test_acc': 0.9128440366972477}}
2023-11-02 21:49:27,605 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0.1, 'EPS': 1e-05, 'output': {'eval_loss': 0.5714977979660034, 'eval_acc': 0.84375, 'test_loss': 0.29330992698669434, 'test_acc': 0.9048165137614679}}
2023-11-02 21:49:27,644 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.6124476194381714, 'eval_acc': 0.84375, 'test_loss': 0.3549887239933014, 'test_acc': 0.8841743119266054}}
2023-11-02 21:49:27,677 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0, 'EPS': 1e-05, 'output': {'eval_loss': 0.6217464804649353, 'eval_acc': 0.8125, 'test_loss': 0.3517194986343384, 'test_acc': 0.875}}
2023-11-02 21:49:27,703 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0.1, 'EPS': 0.001, 'output': {'eval_loss': 0.6124476194381714, 'eval_acc': 0.84375, 'test_loss': 0.3549887239933014, 'test_acc': 0.8841743119266054}}
2023-11-02 21:49:27,719 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0.1, 'EPS': 1e-05, 'output': {'eval_loss': 0.6217464804649353, 'eval_acc': 0.8125, 'test_loss': 0.3517194986343384, 'test_acc': 0.875}}
2023-11-02 21:49:27,840 - main - INFO - {'seed': 87, 'lr': 1e-06, 'BS': 64, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.5273604393005371, 'eval_acc': 0.84375, 'test_loss': 0.34729522466659546, 'test_acc': 0.8646788990825688}}
2023-11-02 21:49:27,876 - main - INFO - {'seed': 87, 'lr': 1e-06, 'BS': 64, 'WD': 0.1, 'EPS': 0.001, 'output': {'eval_loss': 0.5273604393005371, 'eval_acc': 0.84375, 'test_loss': 0.34729522466659546, 'test_acc': 0.8646788990825688}}

Since my result differs significantly from that in your paper, I am curious whether you utilized different seeds in your experiments and if you also observed that MEZO can be quite sensitive to the selected seeds. For the best combination I discovered earlier (see below), I also evaluated its performance with 10 different training seeds, while keeping the data seed fixed at 87. I got the following results (see Figures below).

2023-11-02 21:49:27,370 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.4354574382305145, 'eval_acc': 0.84375, 'test_loss': 0.27337443828582764, 'test_acc': 0.9128440366972477}}

However, it seems to me that even with the optimal configuration, its performance is still not consistently stable and somewhat relies on both the training seed and data seed. I think this variability is a bit weird and much higher than the results of the backpropagation methods. Below I gave my grid results of Lora within the search space that included both configurations in Table 13 and the same seeds. Compared with MEZO, Lora seems much more robust to both the hyperparameters and seeds.

I'm curious if the high variability of MEZO performance could be attributed to the characteristics of Mezo. Would it be possible for the authors to offer an explanation for this? I would be very grateful for that. BTW. I also wonder if similar results were observed in your experiment for other models i.e. OPT models on other tasks discussed in your paper.

Best regards,

llama2 problem

HI,

I am trying to execute the following command:
MODEL=meta-llama/Llama-2-7b-hf TASK=SQuAD MODE=ft LR=1e-2 EPS=1e-1 bash mezo.sh --non_diff --evaluation_strategy no --save_strategy no --save_model

Getting the error:

mezo-ft-20000-16-1e-2-1e-1-0 BS: 16 LR: 1e-2 EPS: 1e-1 SEED: 0 TRAIN/EVAL STEPS: 20000/4000 MODE: ft Extra args: --train_as_classification False 2023-09-04 22:10:09,459 - INFO - Created a temporary directory at /tmp/tmpbaz9nwqu 2023-09-04 22:10:09,460 - INFO - Writing /tmp/tmpbaz9nwqu/_remote_module_non_scriptable.py OurArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eos_token=<EOS_TOKEN>, eval_accumulation_steps=None, eval_delay=0, eval_steps=4000, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=False, group_by_length=False, half_precision_backend=auto, head_tuning=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, icl_sfc=False, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.01, length_column_name=length, linear_probing=False, load_best_model_at_end=True, load_bfloat16=False, load_float16=True, load_int8=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=result/SQuAD-Llama-2-7b-hf-mezo-ft-20000-16-1e-2-1e-1-0/runs/Sep04_22-10-09_ip-10-156-46-59.us-east-2.compute.internal, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lora=False, lora_alpha=16, lora_r=8, lp_early_stopping=False, lr_scheduler_type=constant, max_grad_norm=1.0, max_length=2048, max_new_tokens=50, max_steps=20000, metric_for_best_model=loss, model_name=meta-llama/Llama-2-7b-hf, mp_parameters=, no_auto_device=False, no_cuda=False, no_eval=False, no_reparam=True, non_diff=True, num_beams=1, num_dev=500, num_eval=1000, num_prefix=5, num_train=1000, num_train_epochs=3.0, num_train_sets=None, only_train_option=True, optim=adamw_hf, optim_args=None, output_dir=result/SQuAD-Llama-2-7b-hf-mezo-ft-20000-16-1e-2-1e-1-0, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=16, prediction_loss_only=False, prefix_init_by_real_act=True, prefix_tuning=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['mlflow', 'wandb'], result_file=None, resume_from_checkpoint=None, run_name=result/SQuAD-Llama-2-7b-hf-mezo-ft-20000-16-1e-2-1e-1-0, sampling=False, save_model=True, save_on_each_node=False, save_on_interrupt=False, save_safetensors=False, save_steps=4000, save_strategy=no, save_total_limit=1, seed=42, sfc=False, sharded_ddp=[], skip_memory_metrics=True, tag=mezo-ft-20000-16-1e-2-1e-1-0, task_name=SQuAD, temperature=1.0, tf32=None, top_k=None, top_p=0.95, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, train_as_classification=False, train_set_seed=0, trainer=zo, untie_emb=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, verbose=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, zo_eps=0.1, ) 2023-09-04 22:10:09,892 - WARNING - Found cached dataset squad (/home/ec2-user/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 651.90it/s] 2023-09-04 22:10:15,803 - INFO - Sample train set 1500/87599 2023-09-04 22:10:15,803 - INFO - ... including dev set 500 samples 2023-09-04 22:10:15,803 - INFO - Loading model with FP16... 2023-09-04 22:10:17,146 - WARNING - The model weights are not tied. Please use the tie_weightsmethod before using theinfer_auto_devicefunction. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.04s/it] 2023-09-04 22:10:23,337 - INFO - Done with 7.53s 2023-09-04 22:10:23,376 - INFO - Tokenizing training samples... 2023-09-04 22:10:27,600 - INFO - Done with 4.22s /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=Trueto disable this warning warnings.warn( 2023-09-04 22:10:27,610 - INFO - ***** Running training ***** 2023-09-04 22:10:27,610 - INFO - Num examples = 1000 2023-09-04 22:10:27,610 - INFO - Num Epochs = 318 2023-09-04 22:10:27,610 - INFO - Instantaneous batch size per device = 16 2023-09-04 22:10:27,610 - INFO - Total train batch size (w. parallel, distributed & accumulation) = 16 2023-09-04 22:10:27,610 - INFO - Gradient Accumulation steps = 1 2023-09-04 22:10:27,610 - INFO - Total optimization steps = 20000 2023-09-04 22:10:27,611 - INFO - Number of trainable parameters = 6738415616 wandb: Currently logged in as: raahul. Usewandb login --reloginto force relogin wandb: wandb version 0.15.9 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.15.5 wandb: Run data is saved locally in /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/wandb/run-20230904_221028-to4hrtwj wandb: Runwandb offlineto turn off syncing. wandb: Syncing run sunny-thunder-76 wandb: ⭐️ View project at https://wandb.ai/raahul/huggingface wandb: 🚀 View run at https://wandb.ai/raahul/huggingface/runs/to4hrtwj 0%| | 0/20000 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [64,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [65,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [66,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [67,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [68,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [69,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [70,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [71,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [72,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [73,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [74,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [75,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [76,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [77,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [78,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [79,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [80,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [81,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [82,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [83,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [84,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [85,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [86,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [87,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [88,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [89,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [90,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [91,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [92,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [93,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [94,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [106,0,0], thread: [95,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [0,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [1,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [2,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [3,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [4,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [5,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [6,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [7,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [8,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [9,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [10,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [11,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [12,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [13,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [14,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [15,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [16,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [17,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [18,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [19,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [20,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [21,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [22,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [23,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [24,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [25,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [26,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [27,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [28,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [29,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [30,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [26,0,0], thread: [31,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [0,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [1,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [2,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [3,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [4,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [5,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [6,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [7,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [8,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [9,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [10,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [11,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [12,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [13,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [14,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [15,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [16,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [17,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [18,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [19,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [20,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [21,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [22,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [23,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [24,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [25,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [26,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [27,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [28,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [29,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [30,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [135,0,0], thread: [31,0,0] AssertionsrcIndex < srcSelectDimSizefailed. Traceback (most recent call last): File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py", line 525, in <module> main() File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py", line 491, in main framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples) File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py", line 428, in train trainer.train(resume_from_checkpoint=last_checkpoint) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py", line 529, in _inner_training_loop tr_loss_step = self.zo_step(model, inputs) File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py", line 774, in zo_step loss1 = self.zo_forward(model, inputs) File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py", line 722, in zo_forward return self.zo_forward_nondiff(model, inputs) File "/home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py", line 744, in zo_forward_nondiff outputs = self.model.generate( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate return self.greedy_search( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/generation/utils.py", line 2248, in greedy_search outputs = self( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward outputs = self.model( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 536, in forward attention_mask = self._prepare_decoder_attention_mask( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 464, in _prepare_decoder_attention_mask combined_attention_mask = _make_causal_mask( File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py:525 in │
│ │
│ 522 │ │ │ write_metrics_to_file(metrics, "result/" + result_file_tag(args) + "-onetrai │
│ 523 │
│ 524 if name == "main": │
│ ❱ 525 │ main() │
│ 526 │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py:491 in main │
│ │
│ 488 │ │ │ │ │ dev_samples = None │
│ 489 │ │ │ │ │
│ 490 │ │ │ │ # Training │
│ ❱ 491 │ │ │ │ framework.train(train_samples, dev_samples if dev_samples is not None el │
│ 492 │ │ │ │ │
│ 493 │ │ │ │ if not args.no_eval: │
│ 494 │ │ │ │ │ metrics = framework.evaluate([], eval_samples) # No in-context learn │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/run.py:428 in train │
│ │
│ 425 │ │ if self.args.resume_from_checkpoint is not None: │
│ 426 │ │ │ last_checkpoint = self.args.resume_from_checkpoint │
│ 427 │ │ │
│ ❱ 428 │ │ trainer.train(resume_from_checkpoint=last_checkpoint) │
│ 429 │ │ │
│ 430 │ │ # Explicitly save the model │
│ 431 │ │ if self.args.save_model: │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py:529 in │
│ _inner_training_loop │
│ │
│ 526 │ │ │ │ │
│ 527 │ │ │ │ # MeZO added: estimate gradient │
│ 528 │ │ │ │ if args.trainer == "zo": │
│ ❱ 529 │ │ │ │ │ tr_loss_step = self.zo_step(model, inputs) │
│ 530 │ │ │ │ else: │
│ 531 │ │ │ │ │ if ( │
│ 532 │ │ │ │ │ │ ((step + 1) % args.gradient_accumulation_steps != 0) │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py:774 in zo_step │
│ │
│ 771 │ │ │
│ 772 │ │ # First function evaluation │
│ 773 │ │ self.zo_perturb_parameters(scaling_factor=1) │
│ ❱ 774 │ │ loss1 = self.zo_forward(model, inputs) │
│ 775 │ │ │
│ 776 │ │ # Second function evaluation │
│ 777 │ │ self.zo_perturb_parameters(scaling_factor=-2) │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py:722 in zo_forward │
│ │
│ 719 │ │ model.eval() │
│ 720 │ │ if self.args.non_diff: │
│ 721 │ │ │ # Non-differentiable objective (may require autoregressive generation) │
│ ❱ 722 │ │ │ return self.zo_forward_nondiff(model, inputs) │
│ 723 │ │ │
│ 724 │ │ with torch.inference_mode(): │
│ 725 │ │ │ inputs = self._prepare_inputs(inputs) │
│ │
│ /home/ec2-user/rdp-works-llm-finetune/mezo/MeZO/large_models/trainer.py:744 in │
│ zo_forward_nondiff │
│ │
│ 741 │ │ with torch.inference_mode(): │
│ 742 │ │ │ inputs = self._prepare_inputs(inputs) │
│ 743 │ │ │ args = self.args │
│ ❱ 744 │ │ │ outputs = self.model.generate( │
│ 745 │ │ │ │ inputs["input_ids"], do_sample=args.sampling, temperature=args.temperatu │
│ 746 │ │ │ │ num_beams=args.num_beams, top_p=args.top_p, top_k=args.top_k, max_new_to │
│ 747 │ │ │ │ num_return_sequences=1, eos_token_id=[self.tokenizer.encode(args.eos_tok │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in │
│ decorate_context │
│ │
│ 112 │ @functools.wraps(func) │
│ 113 │ def decorate_context(*args, **kwargs): │
│ 114 │ │ with ctx_factory(): │
│ ❱ 115 │ │ │ return func(*args, **kwargs) │
│ 116 │ │
│ 117 │ return decorate_context │
│ 118 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/generation/utils.py:1437 in │
│ generate │
│ │
│ 1434 │ │ │ │ ) │
│ 1435 │ │ │ │
│ 1436 │ │ │ # 11. run greedy search │
│ ❱ 1437 │ │ │ return self.greedy_search( │
│ 1438 │ │ │ │ input_ids, │
│ 1439 │ │ │ │ logits_processor=logits_processor, │
│ 1440 │ │ │ │ stopping_criteria=stopping_criteria, │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/generation/utils.py:2248 in │
│ greedy_search │
│ │
│ 2245 │ │ │ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) │
│ 2246 │ │ │ │
│ 2247 │ │ │ # forward pass to get next token │
│ ❱ 2248 │ │ │ outputs = self( │
│ 2249 │ │ │ │ **model_inputs, │
│ 2250 │ │ │ │ return_dict=True, │
│ 2251 │ │ │ │ output_attentions=output_attentions, │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │
│ _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py │
│ :687 in forward │
│ │
│ 684 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 685 │ │ │
│ 686 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │
│ ❱ 687 │ │ outputs = self.model( │
│ 688 │ │ │ input_ids=input_ids, │
│ 689 │ │ │ attention_mask=attention_mask, │
│ 690 │ │ │ position_ids=position_ids, │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │
│ _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py │
│ :536 in forward │
│ │
│ 533 │ │ │ attention_mask = torch.ones( │
│ 534 │ │ │ │ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embe │
│ 535 │ │ │ ) │
│ ❱ 536 │ │ attention_mask = self._prepare_decoder_attention_mask( │
│ 537 │ │ │ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_len │
│ 538 │ │ ) │
│ 539 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py │
│ :464 in _prepare_decoder_attention_mask │
│ │
│ 461 │ │ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] │
│ 462 │ │ combined_attention_mask = None │
│ 463 │ │ if input_shape[-1] > 1: │
│ ❱ 464 │ │ │ combined_attention_mask = _make_causal_mask( │
│ 465 │ │ │ │ input_shape, │
│ 466 │ │ │ │ inputs_embeds.dtype, │
│ 467 │ │ │ │ device=inputs_embeds.device, │
│ │
│ /opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py │
│ :49 in make_causal_mask │
│ │
│ 46 │ Make causal mask used for bi-directional self-attention. │
│ 47 │ """ │
│ 48 │ bsz, tgt_len = input_ids_shape │
│ ❱ 49 │ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=de │
│ 50 │ mask_cond = torch.arange(mask.size(-1), device=device) │
│ 51 │ mask.masked_fill(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) │
│ 52 │ mask = mask.to(dtype) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run sunny-thunder-76 at: https://wandb.ai/raahul/huggingface/runs/to4hrtwj
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230904_221028-to4hrtwj/logs`

MeZO on continue pre-training

Hi!

Really nice work! I am wondering whether any have tried MeZO on continue pertaining LLM.
If so, I am wondering the performance and insights are, thanks!

Shan

Standard FT does not work

Hi, thank for the great work!

When I tried to run your baseline evaluation script with:

TASK=SST-2 K=16 SEED=42 BS=8 LR=1e-5 MODEL=roberta-large bash finetune.sh

the script will break during evaluation with this error message:
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0

Can you check the standard FT script to see if there is any issue?

How to use MeZO in training a simple CIFAR-10 model

Hi, thanks you for sharing such an amazing work.
To use MeZO more easily, could you provide a minimum demo to show how can we use MeZO as an optimizer to train a CIFAR model?

Any benchmark on (MeZO) v.s. (ZeRO + CpuOffload + Grad checkpointing) ?

Appreciate your excellent work!

Out of curiosity, have you ever compared (MeZO) with other GPU memory-efficient technologies such as (ZeRO-stage1/2/3)? I would be delighted to see metrics on training speed and the largest model that can be trained on a single A100 80GB.

Furthermore, it would be intriguing to see a comparison between (MeZO) and (ZeRO + CpuOffload + Grad checkpointing) since the latter also incorporates Just Forward Passes.

Impact of Dropout?

Hello and thank you for sharing this interesting approach.

I have one question regarding dropout. If I understand the published code correctly, MeZO was tested having dropout deactivated, e.g. lines like here uses model.eval() which disables dropout. However, I did not find dropout being disabled when comparing to the baselines.

Is it correct, that in the experiments, you were comparing FT with dropout against MeZO without dropout? Specifically to the results shown in Table 1, Table 3, and Table 16.
Would it make sense there to provide a baseline dropout disabled too?

Also, I wonder if you could use MeZO while having dropout activated. I suppose this would be possible under the condition that both forward passes use the same dropout masks. This condition could be fulfilled by taking the seed s not only for generating z but also for the forward pass after.
What do you think of this idea?

Best parameters found for datasets

Greetings,

I am working on a reimplementation of this paper in JAX. You can find it here

I am wondering what were the best hyperparameters you found for each dataset. I saw in the paper you present arrays of parameters, but don't seem to share what were the best among them.

From my experience in my implementation, the results are quite sensitive to choice of hyperparameters.

Thanks!

Cannot reproduce some results of OPT

Hi,

I have been attempting to run the three training implementations of MeZO on the OPT-13B, as instructed in the Readme file. However, I have noticed significant differences in some results compared to the ones provided in the research paper.

Could you kindly provide more detailed information in particular the training learning rates used? It would greatly help me in reproducing the expected results.

Thank you for your assistance!

Can you provide more details about how to run the code?

Thanks for opening the source codes.
However, I am confused about how to run the codes. Here are two questions:

The README.md file says that to reproduce the OPT experiments, please refer to the large_models folder.
However, I find some bash examples to train opt models in the medium_models folder.
Yes, maybe opt-125m is a medium model, but there will have an error to train the opt-125m model under the medium_models folder:
I set the template be "*cls**sent_0*_It_was*mask*.*sep+*" on SST-2 task.
There will generate None token in new_tokens from line 112 to line 204 of medium_models/src/dataset.py.
```
116                if (part == 'cls' or part == 'bos') and ('T5' in type(tokenizer).__name__ or tokenizer.model_type == "gpt2"):
117                     # T5 or GPT-2 do not have cls token
118                     continue
```
The opt models also don't have cls token, this will cause an error.

Maybe I use a wrong template, so can you provide more details about this project. Thanks.

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

Got further. Asked chatgpt, and it pointed out there is no loss. I see this in zo_forward

def zo_forward(self, model, inputs):
    """
    **Get (no gradient) loss from the model. Dropout is turned off too.**
    """
    model.eval()
    if self.args.non_diff:
        # Non-differentiable objective (may require autoregressive generation)
        return self.zo_forward_nondiff(model, inputs)

    with torch.inference_mode():
        inputs = self._prepare_inputs(inputs)
        with self.compute_loss_x	context_manager():
            loss = self.compute_loss(model, inputs)
        if self.args.n_gpu > 1:
            # Warning: this is copied from the original Huggingface Trainer. Untested.
            loss = loss.mean()  # mean() to average on multi-gpu parallel training
    return loss.detach()

Traceback (most recent call last):
  File "/home/user/TrainLLMv2/mezo.py", line 72, in <module>
    trainer.train()
  File "/home/user/env-MeZo/lib64/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/TrainLLMv2/functions.py", line 420, in _inner_training_loop
    tr_loss_step = self.zo_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/TrainLLMv2/functions.py", line 679, in zo_step
    loss1 = self.zo_forward(model, inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/TrainLLMv2/functions.py", line 632, in zo_forward
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/env-MeZo/lib64/python3.11/site-packages/transformers/trainer.py", line 2692, in compute_loss
    raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

can not reproduce the the result of roberta large on dataste sst-2

my transformers version == 4.28.1
i run the the following command
TASK=SST-2 K=16 SEED=42 BS=64 LR=1e-6 EPS=1e-3 MODEL=roberta-large bash mezo.sh
get the test acc 70.2 which in your paper is 89.6 (1.2)
i did not modify any scripts or files, should i ?

roberta-large zero shot

hi , i am reproducing your algorithm, i want to know how do you get the results of zero shot of the roberta-large model. i did not find any relative files

MeZO on ChatGLM6B

大佬你好，我是一名初学者，因为课程需要，我需要实现对Chatglm的Zeroth-order Optimize。想咨询下MeZO在ChatGLM6B上进行零阶微调的可行性。如果要实现这一目的，我应该怎么做。
另外，不管是否可行，非常感谢您的库🥰，之前找了好久都没有找到相关实例😰

deepspeed reference on colab

After installing accelerate, deepspeed, transformers and dataset

!MODEL=facebook/opt-1.3b TASK=SST2 MODE=ft LR=1e-5 bash finetune.sh
...
/content/MeZO/large_models/trainer.py:555 in _inner_training_loop │
│ │
│ 552 │ │ │ │ self.current_flos += float(self.floating_point_ops(inp │
│ 553 │ │ │ │ │
│ 554 │ │ │ │ # Optimizer step for deepspeed must be called on every │
│ ❱ 555 │ │ │ │ if self.deepspeed: │
│ 556 │ │ │ │ │ self.deepspeed.step() │
│ 557 │ │ │ │ │
│ 558 │ │ │ │ if (step + 1) % args.gradient_accumulation_steps == 0 │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'OurTrainer' object has no attribute 'deepspeed'
0% 0/625 [00:01<?, ?it/s]

gpt_neo not supported

Results on WSC and WIC datasets cannot be reproduced on OPT-13B with MeZO

Hello,

Thank you for your fantastic work. When I run mezo.sh for WSC and WIC on OPT-13B with MeZO, the reported results in paper cannot be reproduced.

I run mezh.sh on 4 x A100, with per_device_batch = 4, lr = 1e-6, eps = 1e-3.

I want to know more details about the training settings for the reproduction.

Thanks!

Which trainer to use

Hello,

Thank you for sharing your work!

I noticed that the finetune.sh and finetune_fsdp.sh use regular training by default. Should I change it to zo to enable MeZO trainer? Also, I'm getting the error AttributeError: 'LlamaForCausalLM' object has no attribute 'module' when I try to finetune LLaMA on one GPU when using regular as trainer. The error stems from this line:

with model.no_sync():
    tr_loss_step = self.training_step(model, inputs)

Any help would be very much appreciated.

AttributeError: 'TrainingArguments' object has no attribute 'linear_probing'

(env-MeZo) [root@pve-m7330 TrainLLMv2]# python mezo.py
Traceback (most recent call last):
File "/home/user/TrainLLMv2/mezo.py", line 71, in
trainer.train()
File "/home/user/env-MeZo/lib64/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/user/TrainLLMv2/functions.py", line 17, in _inner_training_loop
if self.args.linear_probing:
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'TrainingArguments' object has no attribute 'linear_probing'

I attempted to integrate this

One thing that stands out to me.

Why so much difference between the medium and large trianer.py, if it's 'only a few lines of changes'.

I opted for the large_model (because I wanted to train eventually 7b). I essentially pulled this class in
https://gist.github.com/thistleknot/6c0349c29e81c71cb547574666918c01 and called it MeZOTrainer

and imported all the same statements you had in your original large_models trainer.py (i.e. everything above the class definition), and then one by one started to pull in additional imports (for ex, logger was missing, inspect wasn't defined).

But i'm at a loss for how to addressed this linear_probing.

so now I have to dig in to the code and figure what else I need to patch in

https://github.com/princeton-nlp/MeZO/blob/main/large_models/run.py

Getting a RuntimeError after training with mezo

Hello,
Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

The problem persists when I use multiple GPUs. Thanks!

Results of Trec dataset on Roberta-large(K=512) with MeZO(LoRA)

I use the grid research below but couldn't reproduce the result of the paper. (I have update the code for WD and successfully reproduce the result on SST2)

TASK=trec K=512 SEED=42 BS=64 WD=0.1 LR=1e-4/5e-5/1e-5 EPS=1e-3 MODEL=roberta-large EXTRA_TAG=lora bash mezo.sh --apply_lora --lora_r 8 --lora_alpha 16

Here is my produced result but paper result is Accuracy=95.

LR | Accuracy
1e-4 | 57.4
5e-5 | 60
1e-5 | 58.2

Cannot reproduce the results of OPT on SST2

I try to run the code MODEL=facebook/opt-13b TASK=SST2 MODE=lora LR=1e-4 EPS=1e-2 bash mezo.sh and I find the loss cannot converge. I'm not sure whether my current setting is right and I would like to ask whether I need to tune the hyper-parameteers (e.g., lr).

Best,
Lucas

Cannot reproduce the results for RoBERTa on SST-2

Hello,

Thank you for your fantastic work. When I ran the codes for RoBERTa on SST-2, I found the results could not be reproduced.

For instance, for FT (full parameter fine-tune) with Adam, running

TASK=SST-2 K=16 SEED=42 BS=8 LR=5e-5 MODEL=roberta-large bash finetune.sh

gets val_acc = 0.90625, and test_acc = 0.84518

When I try smaller learning rates, the results are worse.

But Table 16 in the paper claims the acc should be 91.9. Is it acc on test_set or val_set?
If it is on test_acc, it seems difficult to reproduce.

The results for MeZO are worse. About eval_acc < 0.8 when I run

TASK=SST-2 K=16 SEED=42 BS=64 LR=1e-5 EPS=1e-3 MODEL=roberta-large bash mezo.sh

Can the authors provide more details on the best hyper-parameters? I would be very grateful for that.

Best regards,

MeZo can be used in NLG tasks?

Can MeZo be used on NLG tasks? I integrated the _inner_training_loop part of the code and the methods it relies on into the NLG task training code, and performed fine-tuning training on bloom (bloomz-1b), and found that the effect was relatively poor. could you provide some guidance ?

LoRA & p-tuning with multi-GPU

Hi, in table 20, it shows prefix FT with 2 and 4 GPUs. How are those obtained? I tried using MODEL=facebook/opt-13b TASK=SST2 MODE=prefix LR=1e-5 NUM_GPU=8 bash finetune_fsdp.sh, but got some errors.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:5! (when checking argument for argument index in method wrapper_CUDA__index_select)

Zero Order implementation does not converge in CIFAR-10 dataset.

Hi I have implemented Mezo in cifar-10 dataset but the model does not seems to be converging even after large epochs, by taking appropriate parameters it seems to be going down very slowly. And if you increase learning rate it crashes after a certain point giving Nan Values.

Maybe need a requirement.txt file to facilitate environment preparation？

Title Is

max_seq_length and max_seq_len confusion

Hi team!

Are these two max_seq_length and max_seq_len supposed to be the same parameter?

MeZO/medium_models/run_fewshot.sh

Line 37 in 552cb1b

TASK_EXTRA="--max_seq_len 256 --first_sent_limit 240"

MeZO/medium_models/run_fewshot.sh

Line 91 in 552cb1b

--max_seq_length 128

Only max_seq_length is referenced in script. Not sure if its a bug.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

princeton-nlp / mezo Goto Github PK

mezo's Introduction

MeZO: Fine-Tuning Language Models with Just Forward Passes

Reproduce our paper results

How to add MeZO to my own code?

Bugs or questions?

Citation

mezo's People

Contributors

Stargazers

Watchers

Forkers

mezo's Issues

Recommend Projects

Recommend Topics

Recommend Org