Comments (16)
Sorry, it should be "guided-diffusion_64_256_upsampler.pt". README and scripts have been updated.
from mm-diffusion.
Get it! Thanks so much for such a quick reply! Thank you!
from mm-diffusion.
Also, I would like to know which file is the "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in multimodal_train.sh.
from mm-diffusion.
Training multimodal-generation model requires no initialization, it has been updated now.
from mm-diffusion.
Thinks for your reply!
from mm-diffusion.
Hello, I'm disturbing you again. I saw in the supplemental material of the paper that the training process uses 32 V100s with a batch size of 128, but the current open source training script has a batch size of 4 and the number of graphics cards used is 1. Could you please provide me with the training script you used in your experiments? I look forward to your reply, thank you very much.
from mm-diffusion.
The batchsize aims at one GPU. For example, set "--GPU 0,1,2,3,4,5,6,7 mpiexec -n 8 python..." , the total batchsize equals 4*8=32. Our training requires 4 Nodes, that is 32*GPUs, you need to apply the scripts across multiple nodes according the requirements of your own cluster.
from mm-diffusion.
Get it,thank you!
from mm-diffusion.
Hello, I am training AIST dataset with 8 A100 cards, each card has a batch size of 12 and the overall batch size is 96. After 10,000 steps of training, the video as well as the sound from the test is still full noise. I'm not sure what the reason is at the moment.
How long do you think it takes to converge to a reasonable result during training?
Below is my training script, is there any difference between this and your original script?
#!/bin/bash
#################256 x 256 uncondition###########################################################
MODEL_FLAGS="--cross_attention_resolutions 2,4,8 --cross_attention_windows 1,4,8
--cross_attention_shift True --dropout 0.1
--video_attention_resolutions 2,4,8
--audio_attention_resolutions -1
--video_size 16,3,64,64 --audio_size 1,25600 --learn_sigma False --num_channels 128
--num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True
--use_scale_shift_norm True --num_workers 12"
# Modify --devices to your own GPU ID
TRAIN_FLAGS="--lr 0.0001 --batch_size 12
--devices 0,1,2,3,4,5,6,7 --log_interval 1 --save_interval 500 --use_db False" #--schedule_sampler loss-second-moment
DIFFUSION_FLAGS="--noise_schedule linear --diffusion_steps 1000 --save_type mp4 --sample_fn dpm_solver++"
# Modify the following pathes to your own paths
DATA_DIR="/nvme/datasets/video_diffusion/AIST++_crop/train"
OUTPUT_DIR="debug"
NUM_GPUS=8
mpiexec -n $NUM_GPUS python py_scripts/multimodal_train.py --data_dir ${DATA_DIR} --output_dir ${OUTPUT_DIR} $MODEL_FLAGS $TRAIN_FLAGS $VIDEO_FLAGS $DIFFUSION_FLAGS
Looking forward to your reply, thank you!
from mm-diffusion.
In my experiments, updating to 50000 steps will have meaningful results.
I recomand you to set --save_interval 10000
to save the storage.
Set --sample_fn ddpm
to test the intermediate checkpoints. Because when the model does not converge enough, accelerated sampling methods can produce worse results.
You can follow these advices and continue training on the current training checkpoints.
from mm-diffusion.
Thank you very much for your reply, I am sure it will help me in my experiment!
from mm-diffusion.
Hello @ludanruan , thanks for sharing information. I was wondering what is the average time in hours to have meaningful results (or average step time) on Landscape or AIST++ datasets?
from mm-diffusion.
from mm-diffusion.
@ludanruan Perhaps, my question was not so clear. I meant the average training time in hours/days to achieve this 50,000 iterations with V100 GPU's (as I have read from the paper.)?
Thanks, best
from mm-diffusion.
from mm-diffusion.
Thank you for the information! I needed since I am planning to do research on this :)
from mm-diffusion.
Related Issues (13)
- source code and checkpoints HOT 3
- Training landscape dataset HOT 1
- TypeError: 'ABCMeta' object is not subscriptable
- About generating 10 seconds of audio HOT 1
- RuntimeError: "avg_pool3d_out_frame" not implemented for 'Half HOT 1
- Pip install requirements command fails
- Training requirements
- The dataset have no sound, could you share a sounding video dataset?
- Hand + Face of Human Pose
- RuntimeError: "slow_conv2d_cpu" not implemented for 'Half' HOT 2
- _pickle.UnpicklingError: invalid load key, '<'. HOT 1
- question about paper HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mm-diffusion.