Comments (2)
Hello, thanks for this great work! When I was trying to run through the code
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \ --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain \ --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \ --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \ --train_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \ --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \ --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
I met the following raise exception:
Traceback (most recent call last): File "finetune_sdm_yaml.py", line 209, in <module> main_worker(parsed_args) File "finetune_sdm_yaml.py", line 135, in main_worker trainer.setup_model_for_training() File "/data1/tao.wu/DisCo/agent.py", line 978, in setup_model_for_training self.prepare_dist_model() File "/data1/tao.wu/DisCo/agent.py", line 205, in prepare_dist_model lr_scheduler=self.scheduler) File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/__init__.py", line 181, in initialize config_class=config_class) File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 310, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1196, in _configure_optimizer raise ZeRORuntimeException(msg) deepspeed.runtime.zero.utils.ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'torch.optim.adamw.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.
I wonder what may cause such exception, could anyone help me out? Thanks a lot!
hi, have you solved this problem?
from disco.
deepspeed==0.6.3
from disco.
Related Issues (20)
- How to use the dataset of 512 * 512 HOT 2
- Cannot download Human_Attribute_Pretrain.tar.gz HOT 1
- Very slow xformers install on the Colab HOT 3
- No access to tiktok checkpoints HOT 2
- About the version incompatibility of the three-party library
- 'Nan' loss and bad result HOT 1
- Question about the CLIP encoder and VAE encoder HOT 2
- What is the difference between input image and reference image? HOT 2
- Windows 11 OS
- Question about the framework HOT 1
- Model checkpoint for temporal module
- Question about the multi-gpu running: 'mpirun -np ...' HOT 7
- Hope for more instruction about the multiple GPU running
- Error when mpirun -np 3? HOT 2
- how to caculate fvd metrics?
- Where can I get the 10K tiktok style test split?
- Questions about image size
- [BUG] a bug in the dataset/tiktok_video_dataset.py
- How can I get "More TikTok-Style Training Data" please? HOT 1
- the code for computing PSNR is wrong HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from disco.