wangt-cn / disco Goto Github PK

View Code? Open in Web Editor NEW

915.0 915.0 96.0 98.22 MB

[CVPR2024] DisCo: Referring Human Dance Generation in Real World

Home Page: https://disco-dance.github.io/

License: Apache License 2.0

Python 91.19% Dockerfile 0.01% C++ 0.25% Cuda 2.43% Makefile 0.01% Shell 0.20% Jupyter Notebook 5.91%

aigc controlnet human-generation

disco's People

Contributors

Stargazers

Watchers

Forkers

plag-it-hub qun33 joehart03 fj-oms monup165 wduan123 whfay wanyue123456 k8s9 chenchy yzhdm shonwuo babyblue26 ssarswat skymsg anton-karlovskiy wanghang7894 gitai123 zeimbeekor eltociear hevakelcj ask-f anonymizedgithub ledarx cndair liu-coder-dev lildragonz shadel jaedukseo kanotoinrok blackface111 zhanghao5683934 tellmewhy1122 pinkagable xxsoftware hadryan wjgaas ai-machine-vision-lab paperwave le-si peterhamfelt godmapper lpercc mdmmn378 barseghyanartur ateb14 henrynamnguyen soon-yau canslove josegron dingjianfei srikalyan keyman9848 jackzhousz simrit1 hmv174 nylesiddiqui hansance underwo1023 glop-abs wf1024966 sinc865 zjuer-huijun sihui-ji-10-9 sozturk88 bjq123456 penetration28 aspayet liukun24 davidko3 5l1v3r1 tahababou12 lcyhjx aifsh chnxindong sunsmarterjie zhang-yige bqdove camenduru fanghaipeng noinstancename aohan503 peterzs yepjin haochihlin linecode 24623 wangqi-xxxx bitelchux seyyedalirezaghazanfari adambear gitpzh sxnxvilla publicstaticvo

disco's Issues

AttributeError: 'DDPMScheduler' object has no attribute 'remove_noise'

human_img_edit_gradio.ipynb run error

cf = import_filename(args.cf)
Net, inner_collect_fn = cf.Net, cf.inner_collect_fn
cf is config/ref_attn_clip_combine_controlnet/app_demo_image_edit.py, and app_demo_image_edit.py does not have Net and inner_collect_fn

About the extra Tiktok-style data

Hi @Wangt-CN, thanks for the great work!
I have noticed that you've collected an additional 250 TikTok-style short videos from the internet. Will you consider uploading them? This would enable us to make a comparison with the the released model trained on it.

如何从你处理好的数据中拿到skeleton中的key points信息？

您好，
我想从你处理好的数据中获取skeleton中key points的信息。
我从tsv文件中读取出来pose image的信息后，通过数值分析发现对应位置的像素值和create_custom_dataset_tsvs.py中设置的colors里面rgb对应不上，，看上去好像pose image在存储在tsv中的时候使用的是有损压缩？
期待您的回答，谢谢！

Problem when training with 4090

Hi, thanks for the great work. I'm trying to run the training code, but when I do pre_train with multiple 4090 gpus, it always gets stuck and no report is shown. But when training with multiple 3090 and single 4090 everything is fine. I strongly suspect that a deadlock occurs in 4090. I narrow the problem down to deepspeed.initialize in agent.py. But I don't know how to solve this problem.

Any response will be greatly appreciated.

About BatchSize in Fine-tuning with Disentangled Control

Thank you for great work @Wangt-CN.
When Fine-tuning with Disentangled Control in TiktokDance, the paper states that "it is trained on 8 NVIDIA V100 GPUs for 70K steps with an image size of 256 × 256 and a learning rate of 2e−4". I would like to know the value of the local_batch_size in this case.
Thanks a lot.

About the training data

Thanks for sharing this great work!

I find that you uploaded the training data to the google cloud in tsv format. It is inconvenient for me to download the data with google cloud. Could you please upload a copy of the data to other cloud storage, such as google drive, aliyun, or baiduyun?

Thanks

Why does deepspeed is commented in the finetune file?

Why dose if args.deepspeed: ... is commented in finetune_sdm_yaml.py? Is there any bug with that?

'GIT/{:05d}/labels/{:04d}.txt' How to get this file？

Hello author, I would like to ask, how to get the “self.anno_path = 'GIT/{:05d}/labels/{:04d}.txt'” of images in the tiktok dataset

    if 'youtube' in anno_pose_path:
        img_key = self.anno_list[idx % self.num_images]
    else:
        anno = list(open(anno_path))
        img_key = json.loads(anno[0].strip())['image_key']
    """
    example:
    {"num_region": 6, "image_key": "TiktokDance_00001_0002.png", "image_split": "00001", "image_read_error": false}
    {"box_id": 0, "class_name": "aerosol_can", "norm_bbox": [0.5, 0.5, 1.0, 1.0], "conf": 0.0, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
    {"box_id": 1, "class_name": "person", "norm_bbox": [0.46692365407943726, 0.4977584183216095, 0.9338473081588745, 0.995516836643219], "conf": 0.912740170955658, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
    {"box_id": 2, "class_name": "butterfly", "norm_bbox": [0.2368704378604889, 0.5088028907775879, 0.1444256454706192, 0.04199704900383949], "conf": 0.8738771677017212, "region_caption": "a brown butterfly sitting on an orange background.", "caption_conf": 0.9297735554473283}
    {"box_id": 3, "class_name": "butterfly", "norm_bbox": [0.6688584089279175, 0.5137135982513428, 0.11311062425374985, 0.05455022677779198], "conf": 0.8287128806114197, "region_caption": "a brown butterfly sitting on an orange wall.", "caption_conf": 0.9264783379302365}
    {"box_id": 4, "class_name": "blouse", "norm_bbox": [0.4692786931991577, 0.6465241312980652, 0.9283269643783569, 0.6027728319168091], "conf": 0.6851752400398254, "region_caption": "a woman wearing an orange shirt with butterflies on it.", "caption_conf": 0.9978814544264754}
    {"box_id": 5, "class_name": "short_pants", "norm_bbox": [0.44008955359458923, 0.8769687414169312, 0.8799525499343872, 0.2431662678718567], "conf": 0.6741859316825867, "region_caption": "a person wearing an orange shirt and grey sweatpants.", "caption_conf": 0.9731313580907464}
    """

Training/Validation Data Split

Hi, thanks for your great work. I check the TikTok tsv dataset and find that you've already split the dataset into trainig set and validation set. Since it's not easy to match the image with original sequence id of the dataset each by each, Then could you please just clarify that which sequences from original TikTok datset(from 000 to 340) are used for tranining and which are for validation? Thanks!

where is python tool/video/gen_gifs_for_fvd.py？

In the gen_eval.sh， # Generate GIFs of 16 frames, 3 fps

multi process first

python tool/video/gen_gifs_for_fvd.py --root_dir. So where is the gen_gifs_for_fvd.py and the folder video?

where do these poses come from？./demo_data/pose_img/0049.png","./demo_data/pose_img/0198.png","./demo_data/pose_img/0213.png"

"./demo_data/pose_img/0049.png","./demo_data/pose_img/0198.png","./demo_data/pose_img/0213.png","./demo_data/pose_img/0264.png","./demo_data/pose_img/0144.png","./demo_data/pose_img/0054.png"
obviously， these are part of poses. How can I get the entire poses dataset.
In addition, there are dance1-5. How can I get these poses dataset？Thx

About openpose and inference code

Thank you for great work @Wangt-CN .
Where is run.py in openpose?
And can you release inference code? so I can run demo on my machine.
Thanks.

Composite Caption LineList Creation ?

Hi ! Thanks for your Disco paper and explanation for the TSV file preparation.

In the composite yaml file, you have a 'caption linelist' file which is used.
caption_linelist: train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.caption.linelist.tsv
Could you explain how you make this file ?

LPIPS evaluation should add `normalize=True`

Hi, thanks for the great work.
I noticed that the LPIPS evaluation does not include normalize=True given that the inputs are in the [0,1] range. Adding this would change the results from 0.292 to 0.339. Despite this increase, the result still remains significantly better than the baseline.

DisCo/tool/metrics/ssim_l1_lpips_psnr.py

Lines 74 to 97 in 8cb9387

 def compute_lpips(gen_inst_name_full, gt_inst_name_full): 

 gen_inst_name_full = sorted(gen_inst_name_full) 

 gt_inst_name_full =sorted(gt_inst_name_full) 

 convert_tensor = transforms.ToTensor() 

 loss_fn_vgg = lpips.LPIPS(net='vgg') 

 scores = [] 

 for gen_path, gt_path in tqdm(zip(gen_inst_name_full, gt_inst_name_full)): 

 gen_filename = os.path.splitext( 

 os.path.basename(gen_path))[0] 

 gt_filename = os.path.splitext( 

 os.path.basename(gt_path))[0] 

 assert gen_filename == gt_filename, 'file mismatch' 

 image1 = convert_tensor(load_image(gen_path)).unsqueeze(0) 

 image2 = convert_tensor(load_image(gt_path)).unsqueeze(0) 

 score = loss_fn_vgg(image1, image2).item() 

 scores.append(score) 

 score_ave = np.mean(scores) 

 return score_ave

LPIPS:
https://github.com/richzhang/PerceptualSimilarity/blob/31bc1271ae6f13b7e281b9959ac24a5e8f2ed522/lpips/lpips.py#L112-L115

Please fix the command for prerpocessing with Grounded-SAM.

In grounded-sam of PREPRO.md, the command is wrong to preprocess images. I think it should be something like run_local_test.sh.

run_local_test.sh processes single image, but openpose processes images in directories.
What is the file structure of outputs?

Unable to open the Online Gradio Demo

Your work is very good, I would love to experience your work, but found that the URL does not open, can you open it for a while?

unexpected_keys when Loaded data from mp_rank_00_model_states.pt

Hi, thanks for your great work.
I only change the '--root_dir', '--pretrained_model', '--pretrained_model_path' according to my local settings. But when I run the cell in human_img_edit_gradio.ipynb
`## prepare the eval

logger.warning("Do eval_visu...")
if getattr(args, 'refer_clip_preprocess', None):
eval_dataset = BaseDataset(args, args.val_yaml, split='val', preprocesser=model.feature_extractor)
else:
eval_dataset = BaseDataset(args, args.val_yaml, split='val')
eval_dataloader, eval_info = make_data_loader(
args, args.local_eval_batch_size,
eval_dataset)

trainer = Agent_LDM(args=args, model=model)
trainer.eval_demo_pre()`
seems lots of weights of controlnet failed to load from mp_rank_00_model_states.pt.
And after the gradio launched, the background and pose not work.
Please help me out, thx.

support multi-gpus

hi, I've tried running the code on multiple GPUs, but it seems that it doesn't utilize all available GPU resources. Could you please provide some guidance on how I can modify the code or which commands I should use to enable multi-GPU support? Thank you very much for your help.

Not able to generate results

Hi, I am running for below configuration but output is not having texture preserved.

deepspeed.runtime.zero.utis.ZeR0RuntimeException

Hello, thanks for this great work! When I was trying to run through the code

AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain \
--epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \
--learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \
--train_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \
--unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \
--conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0

I met the following raise exception:

Traceback (most recent call last):
  File "finetune_sdm_yaml.py", line 209, in <module>
    main_worker(parsed_args)
  File "finetune_sdm_yaml.py", line 135, in main_worker
    trainer.setup_model_for_training()
  File "/data1/tao.wu/DisCo/agent.py", line 978, in setup_model_for_training
    self.prepare_dist_model()
  File "/data1/tao.wu/DisCo/agent.py", line 205, in prepare_dist_model
    lr_scheduler=self.scheduler)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/__init__.py", line 181, in initialize
    config_class=config_class)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 310, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1196, in _configure_optimizer
    raise ZeRORuntimeException(msg)
deepspeed.runtime.zero.utils.ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'torch.optim.adamw.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.

I wonder what may cause such exception, could anyone help me out? Thanks a lot!

why this effect?

hi，so amazing work
i try this work today，but the effect is so bad，the face is severely deformed，and clothes are also changed

can you tell me how to solve this problem?

Incorrect parameter name in config scripts

When running the Gradio Demo, this error kept generating when it was loading the pre-trained unet: TypeError: get_down_block() got an unexpected keyword argument 'attn_num_head_channels'

Looking at how the other parameters were named, I tried changing it to 'attention_head_dim', however that then created this error: TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'

Once I expanded the error and viewed it in full, I noticed num_attention_heads was mentioned yet this was not present in any of the scripts. Therefore, I tried changing the parameter name to this and the code ran successfully.

Hence, all instances of attn_num_head_channels in the following scripts need to be changed to num_attention_heads:

controlnet_main.py
controlnet.py
unet_2d_condition.py

change sd models

Hello, thank you for the great code. However, I have some slight concerns about the image quality,
so I wanted to ask if it's possible to replace the sd-image-variations-diffusers model with another model.
It seems difficult to make an immediate change due to the image_encoder file.

Thank you, and I hope you have a wonderful day.

Human specific finetune

Hi, Thanks a lot for this great work!

I am trying to run the finetuning code with the provided instructions but the code references a lot of data that I don't have (tiktok data etc.). Is all this data needed for the finetuning? could you perhaps clarify the structure of the finetune data and how to reference it?

Thanks!

Multi-GPU failed to run

Hi, thanks for the great work.

My GPU: 2080ti * 10

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 mpirun -np 8 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir run_test \ --local_train_batch_size 8 --local_eval_batch_size 8 --log_dir exp/tiktok_pretrain \ --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \ --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \ --train_yaml /data/mfyan/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml /data/mfyan/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \ --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \ --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0

The first is error reporting RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12475 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12475 (errno: 98 - Address already in use). . Then I changed the port number in utils/dist.py to something else, and found that the same type of error was still reported, so I changed the port number to random.randint(10000, 20000), and it worked. But I found 8 processes running only on GPU 0, resulting in RuntimeError: CUDA error: out of memory .

How to improve the output resolution?

Can I use high-res samples for fine tuning to get high-res results?

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like /home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers is not the path to a directory containing a scheduler_config.json file. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.

[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 355, in load_config
config_file = hf_hub_download(
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/DisCo/finetune_sdm_yaml.py", line 209, in
main_worker(parsed_args)
File "/data/DisCo/finetune_sdm_yaml.py", line 90, in main_worker
model = Net(args)
File "/data/DisCo/config/ref_attn_clip_combine_controlnet_attr_pretraining/net.py", line 38, in init
tr_noise_scheduler = DDPMScheduler.from_pretrained(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/schedulers/scheduling_utils.py", line 139, in from_pretrained
config, kwargs, commit_hash = cls.load_config(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 391, in load_config
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like /home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers is not the path to a directory containing a scheduler_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.
how can I fix this？Thx

About video frame consistency

Thanks for your great work!
I am curious about this model how to process the 'video frame consistency'. The paper seems to not consider this issue.

I try the video pose transfer and result as follows (far from paper shows. Am I missing some steps ? ):

out.mp4

Some issues about deepspeed

Hi, thank you very much for your amazing work. I have successfully run the Gradio Demo using the model you provided.

However, I encountered the following error during the fine-tuning training phase using fp16 deepspeed:
[INFO] [stage_1_and_2.py:1651:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0. How should I configure deepspeed.py to solve this problem?

How can it be successfully run inference on multiple GPUs?

Whichever port I use for multiple-GPU inference, I always get the error that is the address already in use.

For example, I set "export MASTER_PORT=65530" before inference on multiple GPUs and then I will get an error as follows:

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:65530 (errno: 98 - Address already in use). The server socket has failed to bind to?UNKNOWN? (errno: 98 - Address already in use).

There is sth wrong with ./annotator/grounded-sam/run.py

I modify the args and used
python ./annotator/grounded-sam/run.py --dataset_root ./single/ --partition 1。
Under groundsam_vis folder，I got 001.png.mask.jpg and 001.png.mask.png. The original size of pic is 540960, 001.png.mask.png is 540960, but it is all black. 001.png.mask.jpg's foreground color is yellow and its background is purple,
but its size is 1299 x 2310.

For hap training, it stopped after step 47999

my log is like this

as for metric.json, it is
{"Step 0": {"eval": {"FID": 290.6777284435151, "time": "0:03:31.655149"}}, "Epoch2": {"train": {"loss_total": 0.09805237877084273, "time": "2:00:58.214095"}}, "Step2000": {"eval": {"FID": 39.84386908484396, "time": "0:04:10.178784"}}, "Epoch3": {"train": {"loss_total": 0.08273238215732012, "time": "1:05:12.567647"}}, "Step4000": {"eval": {"FID": 43.90990567108605, "time": "0:04:07.709888"}}, "Epoch4": {"train": {"loss_total": 0.08007360270170316, "time": "0:12:27.958903"}}, "Epoch5": {"train": {"loss_total": 0.07951762963632482, "time": "2:10:26.386287"}}, "Step6000": {"eval": {"FID": 47.38045255087343, "time": "0:04:09.181442"}}, "Epoch6": {"train": {"loss_total": 0.0778031051322654, "time": "1:17:38.705795"}}, "Step8000": {"eval": {"FID": 44.090458160650826, "time": "0:04:09.042723"}}, "Epoch7": {"train": {"loss_total": 0.07623744776395902, "time": "0:24:55.471577"}}, "Epoch8": {"train": {"loss_total": 0.07622077868163159, "time": "2:22:49.398254"}}, "Step10000": {"eval": {"FID": 31.904819727152358, "time": "0:04:09.004639"}}, "Epoch9": {"train": {"loss_total": 0.07504117791188147, "time": "1:30:05.510775"}}, "Step12000": {"eval": {"FID": 27.483082697985367, "time": "0:04:09.374085"}}, "Epoch10": {"train": {"loss_total": 0.07409243798521284, "time": "0:37:20.796301"}}, "Epoch11": {"train": {"loss_total": 0.07395339482515068, "time": "2:35:10.650475"}}, "Step14000": {"eval": {"FID": 31.168737757947156, "time": "0:04:09.885283"}}, "Epoch12": {"train": {"loss_total": 0.07372492651599219, "time": "1:42:29.514715"}}, "Step16000": {"eval": {"FID": 27.21106589500107, "time": "0:04:08.266375"}}, "Epoch13": {"train": {"loss_total": 0.07312383905231748, "time": "0:49:47.800272"}}, "Epoch14": {"train": {"loss_total": 0.07289745142666333, "time": "2:47:37.351242"}}, "Step18000": {"eval": {"FID": 23.106254980103927, "time": "0:04:07.923059"}}, "Epoch15": {"train": {"loss_total": 0.07242165016734459, "time": "1:54:56.220633"}}, "Step20000": {"eval": {"FID": 27.248582831371834, "time": "0:04:08.544414"}}, "Epoch16": {"train": {"loss_total": 0.07194297805632631, "time": "1:02:14.190223"}}, "Step22000": {"eval": {"FID": 24.803106175247194, "time": "0:04:07.721933"}}, "Epoch17": {"train": {"loss_total": 0.07178588963246771, "time": "0:09:32.512436"}}, "Epoch18": {"train": {"loss_total": 0.07133925958989136, "time": "2:07:23.062100"}}, "Step24000": {"eval": {"FID": 24.043788111684535, "time": "0:04:08.888000"}}, "Epoch19": {"train": {"loss_total": 0.07101746586461861, "time": "1:14:44.514627"}}, "Step26000": {"eval": {"FID": 21.995370168790316, "time": "0:04:09.391666"}}, "Epoch20": {"train": {"loss_total": 0.0711026608135349, "time": "0:21:59.766503"}}, "Epoch21": {"train": {"loss_total": 0.07048133608498951, "time": "2:19:49.997902"}}, "Step28000": {"eval": {"FID": 24.72485237611329, "time": "0:04:08.996139"}}, "Epoch22": {"train": {"loss_total": 0.07035766966611788, "time": "1:27:07.072584"}}, "Step30000": {"eval": {"FID": 23.718398035524274, "time": "0:04:08.026119"}}, "Epoch23": {"train": {"loss_total": 0.07033483951472409, "time": "0:34:28.609709"}}, "Epoch24": {"train": {"loss_total": 0.0700808005451255, "time": "2:32:20.204781"}}, "Step32000": {"eval": {"FID": 22.28186004474327, "time": "0:04:09.402551"}}, "Epoch25": {"train": {"loss_total": 0.06941193997643072, "time": "1:39:37.250015"}}, "Step34000": {"eval": {"FID": 22.008737701972393, "time": "0:04:08.751428"}}, "Epoch26": {"train": {"loss_total": 0.06975648029961369, "time": "0:46:56.325861"}}, "Epoch27": {"train": {"loss_total": 0.06926694908420368, "time": "2:44:55.461831"}}, "Step36000": {"eval": {"FID": 20.646719371436802, "time": "0:04:09.315956"}}, "Epoch28": {"train": {"loss_total": 0.06902978069161715, "time": "1:51:58.898925"}}, "Step38000": {"eval": {"FID": 20.947136795766653, "time": "0:04:08.238444"}}, "Epoch29": {"train": {"loss_total": 0.06883154165577786, "time": "0:59:21.183679"}}, "Step40000": {"eval": {"FID": 21.91535396280665, "time": "0:04:08.135138"}}, "Epoch30": {"train": {"loss_total": 0.0685038694108908, "time": "0:06:39.512788"}}, "Epoch31": {"train": {"loss_total": 0.06833649117448559, "time": "2:04:39.496344"}}, "Step42000": {"eval": {"FID": 21.03997708430046, "time": "0:04:08.954742"}}, "Epoch32": {"train": {"loss_total": 0.06803931710874384, "time": "1:11:49.735317"}}, "Step44000": {"eval": {"FID": 20.869328025712207, "time": "0:04:08.273694"}}, "Epoch33": {"train": {"loss_total": 0.06827863852959126, "time": "0:19:08.113691"}}, "Epoch34": {"train": {"loss_total": 0.06779086924764614, "time": "2:16:59.691355"}}, "Step46000": {"eval": {"FID": 21.228485860542378, "time": "0:04:07.829478"}}, "Epoch35": {"train": {"loss_total": 0.06772437718459348, "time": "1:24:18.001555"}}, "Step48000": {"eval": {"FID": 21.43211396473822, "time": "0:04:08.974074"}}, "Epoch36": {"train": {"loss_total": 0.06746692015109836, "time": "0:31:33.721808"}}}
Does the code have the mechanism of early stop or my code have encounter sth error?
I ran this code using this
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0 >> log.txt 2>&1

for human-specific fine-tuning，I can't execute

I set my dataset like the toy_dataset you provided. However, it can't execute. I encountered this problem
Original Traceback (most recent call last):
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 569, in getitem
raw_data = self.get_img_txt_pair(idx)
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 512, in get_img_txt_pair
anno = list(open(anno_path))
FileNotFoundError: [Errno 2] No such file or directory: './719__242.png'

Difficulty finding the inference function

Hi, @Wangt-CN, first off, great work!!

I want to run inference through code, not gradio. I tried searching for the function to do that, the closest I found is the Agent_LDM. But this takes a reference fg, bg and skeleton. Is there a function which just takes in an image (with a character in it) and a skeleton, and returns the output?

Additionally, any function for end to end video gen?

Thanks

Just another: Will the code be compatible with PyTorch 2?

No module named 'torch._six'

When I run the google colab DisCo_Demo.ipynb, the following error occurred.

ModuleNotFoundError                       Traceback (most recent call last)
[<ipython-input-25-7aa731641df4>](https://localhost:8080/#) in <cell line: 6>()
      4 
      5 from utils.wutils_ldm import *
----> 6 from agent import Agent_LDM, WarmupLinearLR, WarmupLinearConstantLR
      7 import torch
      8 from config import BasicArgs

3 frames
[/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py](https://localhost:8080/#) in <module>
     16 
     17 import torch
---> 18 from torch._six import inf
     19 import torch.distributed as dist
     20 

ModuleNotFoundError: No module named 'torch._six'

But, 'torch._six' was on or under torch==1.7.0.
Next, the following error occurred.

!pip install pip install torch==1.7.0
Requirement already satisfied: pip in /usr/local/lib/python3.10/dist-packages (23.1.2)
Collecting install
  Downloading install-1.3.5-py3-none-any.whl (3.2 kB)
ERROR: Could not find a version that satisfies the requirement torch==1.7.0 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.7.0

Please revise the gooble colab torch version on or over 2.0.0.

Where is pose encoder in code?

where is the pose encoder in your genius code?

Video frame 'expand' when performing FVD

Hi, thank you for your great work! I have a question about the FVD evaluation. I intend to follow this work, but I have some problems when evaluating FVD. (The other quanti results are consistent with the paper)

When I check the configs of the videos generated from gif (in tool/metrics/utils.py 'DatasetFVDVideoResize'), I find that the video has the size of [128, 112, 112, 3], however, the gif has only 16 frames. So when I check out the ffmpeg function in tool/metrics/utils.py line 358

out, _ = (ffmpeg.input(path).output('pipe:', format='rawvideo', pix_fmt='rgb24').run(capture_stdout=True, quiet=False))

it outputs something like below, which means it transfers the 16-frame gif to a 128 frame video (and segment it into 8 pieces for the num_seg parameter):

Input #0, gif, from '/root/autodl-tmp/DisCo/run_test/exp/tiktok_ft/outputs//pred_gs1.5_scale-cond1.0-ref1.0_gif/TiktokDance_00337_0010png.gif': Duration: 00:00:05.28, start: 0.000000, bitrate: 866 kb/s Stream #0:0: Video: gif, bgra, 256x256, 3.03 fps, 24.25 tbr, 100 tbn, 100 tbc

Output #0, rawvideo, to 'pipe:': Metadata: encoder : Lavf58.29.100 Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 256x256, q=2-31, 38141 kb/s, 24.25 fps, 24.25 tbn, 24.25 tbc Metadata: encoder : Lavc58.54.100 rawvideo

And if I set the fps in gen_eval.sh as 25 (and the video will be 16 frames), the FVD-3DRN50 will become 96.15 (from More TikTok-Style Training Data (FID-FVD: 15.7))
even if I don't change the fps (remain as 3), the FVD-3DRN50 is 20.34, different from the paper.

So I have 3 questions on this evaluation:

Should we change fps in gen_eval.sh?
Like #25 , I evaluate the fvd using: FID-VID：resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth" with MD5 a044310dff79e2688c342d55a0b202d2, FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit" with MD5 c275f5caff95bea0b712515feedad130. Are these two correct for evalulation?
In #27 , the authors say the evaluation uses 335-340 and 5 OL video as evaluation, but the provided new10val_TiktokDance-poses-masks.yaml outputs 337/338/201/202/203. Maybe the correct yaml will lead to the paper FVD results?

Thank you!

No problem now

About the training data.

Thanks for your great work!
I am curious about the Human Attribute Pre-training stage, if you pre-trained the model with the full-body image in SHHQ or just use the cropped upper-body image (e.g. the showed tiktok video results)?

Can we use any other controls instead of pose?

Great work @Wangt-CN. Currently the repo is using 2D keypoints to control the pose of output. Is it possible to replace pose with canny or depth based control? If yes, would it require only replacing controlnet model or retraining complete model? Thanks.

Any way to automatically separate the foreground and background?

How can I mask the foreground and background? Is there an easy utility for that?

About the reference image

what are your criteria about choosing the reference image in 1) Pre-training, 2) General fine-tuning and 3) Human-specific fine-tuning respectively? are they all the first images of a dataset?

Pre-training dataset

Thank you very much for such an outstanding work, will the pre-training dataset be open sourced?

Human Specific Finetuning

Hi,

Could you please provide some more information about the human specific finetuning model?

I tried running it and have generated checkpoint files however their dictionary keys are wildly different to the provided checkpoint, 'mp_rank_00_model_states.pt':

My checkpoint: dict_keys(['models', 'optimizer', 'epoch', 'global_step', 'scheduler'])

Checkpoint provided: dict_keys(['module', 'buffer_names', 'optimizer', 'param_shapes', 'lr_scheduler', 'sparse_tensor_module_names', 'skipped_steps', 'global_steps', 'global_samples', 'dp_world_size', 'mp_world_size', 'ds_config', 'ds_version'])

Therefore, when I try to generate new images using my checkpoint it fails at the load_checkpoint_for_deepspeed_diff_gpu function with this message:

Traceback (most recent call last): File "/home/emily/DisCo/VideoGenerationModel/run.py", line 645, in <module> trainer.eval_demo_pre() File "/home/emily/DisCo/agent.py", line 422, in eval_demo_pre self.prepare_dist_model() File "/home/emily/DisCo/agent.py", line 199, in prepare_dist_model self.load_checkpoint_for_deepspeed_diff_gpu(self.pretrained_model) # load pt model with default pytorch File "/home/emily/DisCo/agent.py", line 813, in load_checkpoint_for_deepspeed_diff_gpu adaptively_load_state_dict(self.model, checkpoint['module']) KeyError: 'module'

I'm not really sure what to do about this issue as it seems the new checkpoints are supposed to be made this way, please advise. Many thanks :)

deploy the model inference locally

Appreciate your great works !@Wangt-CN ，how can I deploy the model inference locally?
Could you please release the inference code? Thanks a lot :)

How to try demo with my own dataset?

Hi, i'd like to try the demo with another dataset, and i follow the PREPRO.md, successfully run the GroundSAM and Openpose script.

The question is Openpose does not return the images show as demo_data/pose_img/*.png, Openpose output the keypoints with json style , and draw the skeleton directly on original rgb images. So is there any script to generate this style images? The pose image is resized to 256x256, if my dataset image is larger, should i crop the pose area and resize to 256 at first?

e.g. (openpose result i got)
00001.jpg.json.txt

hope your response, thanks!

How should I set the learning rate on 8*v100 32g gpus？

This is the parameters I deployed
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
and the loss I got after Human attribute pretraining.
Metering:{'loss_total': '0.0667'}: 100%|██████████| 55280/55280 [81:48:04<00:00, 5.33s/it]

I noticed that on your article, you mentioned All pre-training experiments are conducted on 4x8 NVIDIA V100 GPUs for 25K steps with image size 256×256 and learning rate 1e−3.
Because I only got one-fourth of the number of gpus you have，should I reduce the learning rate to one-fourth of 1e-3？
what is your loss after training on this stage？Thank you！

Question about training data structure

Hi, I would like to use a different dataset for the second step of fine-tuning. How should I structure the data as provided by you? For example, how can I obtain the files train_images.lineidx and train_images.lineidx.8b?
Can you provide a brief tutorial on how to use tsv_file_ops.py and tsv_file.py?

Incorrect FID-VID and FVD

Thanks for great work. @Wangt-CN

I tried to reproduce the results using "gen_eval.sh," but I noticed that the FID-VID and FVD do not match the results reported in the paper. Can you help me with this issue? Is it possible that I am using the incorrect checkpoints?

download checkpoints:
pth ： TikTok Training Data (FID-FVD: 18.8)

FID-VID：resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth"

FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit"

	def compute_lpips(gen_inst_name_full, gt_inst_name_full):
	gen_inst_name_full = sorted(gen_inst_name_full)
	gt_inst_name_full =sorted(gt_inst_name_full)
	convert_tensor = transforms.ToTensor()
	loss_fn_vgg = lpips.LPIPS(net='vgg')

	scores = []
	for gen_path, gt_path in tqdm(zip(gen_inst_name_full, gt_inst_name_full)):
	gen_filename = os.path.splitext(
	os.path.basename(gen_path))[0]
	gt_filename = os.path.splitext(
	os.path.basename(gt_path))[0]

	assert gen_filename == gt_filename, 'file mismatch'

	image1 = convert_tensor(load_image(gen_path)).unsqueeze(0)
	image2 = convert_tensor(load_image(gt_path)).unsqueeze(0)

	score = loss_fn_vgg(image1, image2).item()

	scores.append(score)

	score_ave = np.mean(scores)
	return score_ave