researchmm / mm-diffusion Goto Github PK

[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

License: MIT License

Python 99.11% Shell 0.89%

audio-generation content-creation diffusion-models multi-modality video-generation

mm-diffusion's Issues

Pip install requirements command fails

pip install -r requirements.txt Does not work, since the textfile is requirement.txt
pip install -r requirement.txt should fix it

The dataset doesn't include sound.

About generating 10 seconds of audio

Hello, thank you for this awesome project. I have a question to ask you. I saw that your data set is a short video clip of 10 seconds, but why is the final sample generated only one second? How to sample 10 seconds of audio?

question about paper

May I ask what audio encoders, video encoders, and corresponding decoders are used in the model? It seems that the paper did not mention.

Training requirements

I am curious about the requirements it needs to train the model. Thank you!

TypeError: 'ABCMeta' object is not subscriptable

Hello, I encountered the following error when running the sampling command——bash ssh_scripts/multimodal_sample_sr.sh，Is this related to my python version (3.8.18) and pytorch version (1.12.1+cu116). How should I fix it?Thank you for your help！！

Traceback (most recent call last):
  File "py_scripts/multimodal_train.py", line 7, in <module>
    from mm_diffusion import dist_util, logger
  File "/root/MM-Diffusion/mm_diffusion/dist_util.py", line 8, in <module>
    import blobfile as bf
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/__init__.py", line 6, in <module>
    from blobfile._ops import (
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_ops.py", line 19, in <module>
    from blobfile._common import DirEntry, Stat, RemoteOrLocalPath
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_common.py", line 1025, in <module>
    RemoteOrLocalPath = Union[str, BlobPathLike, os.PathLike[str]]
TypeError: 'ABCMeta' object is not subscriptable
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[46321,1],0]
  Exit code:    1

Training steps

HI, could I know how many training steps are set for training? Many thanks.

Training landscape dataset

I am have trained landscape dataset for nearly 180000 steps still the samples i obtained as output is not as clear as pretrained model for landscape dataset. Can you please tell me how many steps was landscape model was originally trained?

How many iterations have the pretrained models been trained for?

Hello Authors, thank you for the wonderful work!

While trying to reproduce the results for the pretrained models, I found a mismatch between the metrics mentioned in the paper.
One of the small bugs that i found in the evaluation script is that instead of multiplying the FAD by 1e4, it has been multiplied by 1e3.

Even ignoring that, the results do not match and I believe it could be because of the difference in the number of iterations that the model has been trained for compared to the one mentioned in the paper.

Can you provide details about the pretrained models that you have shared?

Thank you so much!

RuntimeError: "avg_pool3d_out_frame" not implemented for 'Half

OOM in multimodal_sample_sr.sh

Hello, authors!

Would you inform the GPU requirements for the ssh scripts ?
I currently use 2 * 3090 Ti GPUS (2 * 24GB), but a Cuda OOM error occurs.

Are there any other ways to reduce the GPU memory?

Training issues

Hi,
I found a file "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in the training script. Is this file the landscape.pt in the open-source model? I am looking forward to your answer, thank you very much.

Reproducing results in the paper

Thanks for the great work! However, directly run the following commands do not produce the same FVD/KVD/FAD scores as in the paper. May I ask the data and configurations for reproducing results in paper? i.e., FVD=117.20, KVD=5.78, FAD=10.72

bash ssh_scripts/multimodal_sample_sr.sh
bash ssh_scripts/multimodal_eval.sh

When running the above commands, the output I got is:

evaluate for 2048 samples
metric:{'fvd': 338.2535400390625, 'kvd': 10.005603799600976, 'fad': 1.3610674068331718}

_pickle.UnpicklingError: invalid load key, '<'.

Thanks for your excellent work! :)
I encounter a bug. File path is MM-Diffusion/evaluations/fvd/fvd.py , the fvd pretrained url is 404(Google Drive).Could u upgrade the code. Thanks!
`def download(id, fname):

destination = os.path.join(ROOT, fname)
if os.path.exists(destination):
    return destination 

os.makedirs(ROOT, exist_ok=True)
destination = os.path.join(ROOT, fname)

URL = 'https://drive.google.com/uc?export=download'
session = requests.Session()

response = session.get(URL, params={'id': id}, stream=True)
token = get_confirm_token(response)

if token:
    params = {'id': id, 'confirm': token}
    response = session.get(URL, params=params, stream=True)
save_response_content(response, destination)
return destination

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Thanks for your excellent work:)
But I got a problem..

/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Traceback (most recent call last):
File "py_scripts/multimodal_sample_sr.py", line 309, in
main()
File "py_scripts/multimodal_sample_sr.py", line 125, in main
sample = dpm_solver.sample(
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1293, in sample
x = self.singlestep_dpm_solver_update(x, vec_s, vec_t, order, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1060, in singlestep_dpm_solver_update
return self.singlestep_dpm_solver_third_update(x, s, t, return_intermediate=return_intermediate, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 819, in singlestep_dpm_solver_third_update
model_s = self.model_fn(x, s)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 449, in model_fn
return self.noise_prediction_fn(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 417, in noise_prediction_fn
return self.model(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 350, in model_fn
return noise_pred_fn(x, t_continuous)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 305, in noise_pred_fn
video_output,audio_output = model(x["video"], x["audio"], t_input, **model_kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 1085, in forward
video, audio = module(video, audio, emb)#
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 45, in forward
video, audio = layer(video, audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 694, in forward
return self.video_conv(video), self.audio_conv(audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 96, in forward
video = self.video_conv_spatial(video)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

The dataset have no sound, could you share a sounding video dataset?

When I download the video by google drive, these two datasets are no sounding, could you share the sounding video datasets? Thank you so much

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

P.S. Any Model that could supports Pose+Hand+Face, can be used!

Thanks
Best regards

computational complexity in paper

I would like to ask whether the computational complexity in the paper is correct.

Should it be O((SHW)*( F * T/F)). instead of O((SHW)*( S * T/F)). ?
I think in RS-MMA, there is F audio pitches (length T/F), and each audio pitch is calculated with video pitch (length SHW). Thus, the computational complexity should be O((SHW)*( F * T/F)).

May I ask if my idea is correct? Your comments will be really appreciated.

Dependency problem

In the requirement file, "mkl-fft==1.3.1 mkl-random==1.2.2", which requires numpy<1.23.0,>=1.22.3, yet it's also demanded that: numpy==1.23.5, which causes conflict

source code and checkpoints

Are you planning to release the source code and the checkpoints of this work in the future? Thank you.

researchmm / mm-diffusion Goto Github PK

mm-diffusion's Issues

Recommend Projects

Recommend Topics

Recommend Org