researchmm / mm-diffusion Goto Github PK
View Code? Open in Web Editor NEW[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
License: MIT License
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
License: MIT License
pip install -r requirements.txt
Does not work, since the textfile is requirement.txt
pip install -r requirement.txt
should fix it
Hello, thank you for this awesome project. I have a question to ask you. I saw that your data set is a short video clip of 10 seconds, but why is the final sample generated only one second? How to sample 10 seconds of audio?
May I ask what audio encoders, video encoders, and corresponding decoders are used in the model? It seems that the paper did not mention.
I am curious about the requirements it needs to train the model. Thank you!
Hello, I encountered the following error when running the sampling command——bash ssh_scripts/multimodal_sample_sr.sh
,Is this related to my python version (3.8.18) and pytorch version (1.12.1+cu116). How should I fix it?Thank you for your help!!
Traceback (most recent call last):
File "py_scripts/multimodal_train.py", line 7, in <module>
from mm_diffusion import dist_util, logger
File "/root/MM-Diffusion/mm_diffusion/dist_util.py", line 8, in <module>
import blobfile as bf
File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/__init__.py", line 6, in <module>
from blobfile._ops import (
File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_ops.py", line 19, in <module>
from blobfile._common import DirEntry, Stat, RemoteOrLocalPath
File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_common.py", line 1025, in <module>
RemoteOrLocalPath = Union[str, BlobPathLike, os.PathLike[str]]
TypeError: 'ABCMeta' object is not subscriptable
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[46321,1],0]
Exit code: 1
HI, could I know how many training steps are set for training? Many thanks.
I am have trained landscape dataset for nearly 180000 steps still the samples i obtained as output is not as clear as pretrained model for landscape dataset. Can you please tell me how many steps was landscape model was originally trained?
Hello Authors, thank you for the wonderful work!
While trying to reproduce the results for the pretrained models, I found a mismatch between the metrics mentioned in the paper.
One of the small bugs that i found in the evaluation script is that instead of multiplying the FAD by 1e4, it has been multiplied by 1e3.
Even ignoring that, the results do not match and I believe it could be because of the difference in the number of iterations that the model has been trained for compared to the one mentioned in the paper.
Can you provide details about the pretrained models that you have shared?
Thank you so much!
Hello, authors!
Would you inform the GPU requirements for the ssh scripts ?
I currently use 2 * 3090 Ti GPUS (2 * 24GB), but a Cuda OOM error occurs.
Are there any other ways to reduce the GPU memory?
Hi,
I found a file "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in the training script. Is this file the landscape.pt in the open-source model? I am looking forward to your answer, thank you very much.
Thanks for the great work! However, directly run the following commands do not produce the same FVD/KVD/FAD scores as in the paper. May I ask the data and configurations for reproducing results in paper? i.e., FVD=117.20, KVD=5.78, FAD=10.72
bash ssh_scripts/multimodal_sample_sr.sh
bash ssh_scripts/multimodal_eval.sh
When running the above commands, the output I got is:
evaluate for 2048 samples
metric:{'fvd': 338.2535400390625, 'kvd': 10.005603799600976, 'fad': 1.3610674068331718}
Thanks for your excellent work! :)
I encounter a bug. File path is MM-Diffusion/evaluations/fvd/fvd.py , the fvd pretrained url is 404(Google Drive).Could u upgrade the code. Thanks!
`def download(id, fname):
destination = os.path.join(ROOT, fname)
if os.path.exists(destination):
return destination
os.makedirs(ROOT, exist_ok=True)
destination = os.path.join(ROOT, fname)
URL = 'https://drive.google.com/uc?export=download'
session = requests.Session()
response = session.get(URL, params={'id': id}, stream=True)
token = get_confirm_token(response)
if token:
params = {'id': id, 'confirm': token}
response = session.get(URL, params=params, stream=True)
save_response_content(response, destination)
return destination
`
Thanks for your excellent work:)
But I got a problem..
/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE'If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source?
warn(
Traceback (most recent call last):
File "py_scripts/multimodal_sample_sr.py", line 309, in
main()
File "py_scripts/multimodal_sample_sr.py", line 125, in main
sample = dpm_solver.sample(
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1293, in sample
x = self.singlestep_dpm_solver_update(x, vec_s, vec_t, order, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1060, in singlestep_dpm_solver_update
return self.singlestep_dpm_solver_third_update(x, s, t, return_intermediate=return_intermediate, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 819, in singlestep_dpm_solver_third_update
model_s = self.model_fn(x, s)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 449, in model_fn
return self.noise_prediction_fn(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 417, in noise_prediction_fn
return self.model(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 350, in model_fn
return noise_pred_fn(x, t_continuous)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 305, in noise_pred_fn
video_output,audio_output = model(x["video"], x["audio"], t_input, **model_kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 1085, in forward
video, audio = module(video, audio, emb)#
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 45, in forward
video, audio = layer(video, audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 694, in forward
return self.video_conv(video), self.audio_conv(audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 96, in forward
video = self.video_conv_spatial(video)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
When I download the video by google drive, these two datasets are no sounding, could you share the sounding video datasets? Thank you so much
Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?
I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?
I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video
P.S. Any Model that could supports Pose+Hand+Face, can be used!
Thanks
Best regards
I would like to ask whether the computational complexity in the paper is correct.
Should it be O((SHW)*( F * T/F)). instead of O((SHW)*( S * T/F)). ?
I think in RS-MMA, there is F audio pitches (length T/F), and each audio pitch is calculated with video pitch (length SHW). Thus, the computational complexity should be O((SHW)*( F * T/F)).
May I ask if my idea is correct? Your comments will be really appreciated.
In the requirement file, "mkl-fft==1.3.1 mkl-random==1.2.2", which requires numpy<1.23.0,>=1.22.3, yet it's also demanded that: numpy==1.23.5, which causes conflict
Are you planning to release the source code and the checkpoints of this work in the future? Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.