Giter Club home page Giter Club logo

mm-diffusion's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mm-diffusion's Issues

computational complexity in paper

I would like to ask whether the computational complexity in the paper is correct.

Should it be O((SHW)*( F * T/F)). instead of O((SHW)*( S * T/F)). ?
I think in RS-MMA, there is F audio pitches (length T/F), and each audio pitch is calculated with video pitch (length SHW). Thus, the computational complexity should be O((SHW)*( F * T/F)).

May I ask if my idea is correct? Your comments will be really appreciated.

Dependency problem

In the requirement file, "mkl-fft==1.3.1 mkl-random==1.2.2", which requires numpy<1.23.0,>=1.22.3, yet it's also demanded that: numpy==1.23.5, which causes conflict

Training landscape dataset

I am have trained landscape dataset for nearly 180000 steps still the samples i obtained as output is not as clear as pretrained model for landscape dataset. Can you please tell me how many steps was landscape model was originally trained?

How many iterations have the pretrained models been trained for?

Hello Authors, thank you for the wonderful work!

While trying to reproduce the results for the pretrained models, I found a mismatch between the metrics mentioned in the paper.
One of the small bugs that i found in the evaluation script is that instead of multiplying the FAD by 1e4, it has been multiplied by 1e3.

Even ignoring that, the results do not match and I believe it could be because of the difference in the number of iterations that the model has been trained for compared to the one mentioned in the paper.

Can you provide details about the pretrained models that you have shared?

Thank you so much!

TypeError: 'ABCMeta' object is not subscriptable

Hello, I encountered the following error when running the sampling command——bash ssh_scripts/multimodal_sample_sr.sh,Is this related to my python version (3.8.18) and pytorch version (1.12.1+cu116). How should I fix it?Thank you for your help!!

Traceback (most recent call last):
  File "py_scripts/multimodal_train.py", line 7, in <module>
    from mm_diffusion import dist_util, logger
  File "/root/MM-Diffusion/mm_diffusion/dist_util.py", line 8, in <module>
    import blobfile as bf
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/__init__.py", line 6, in <module>
    from blobfile._ops import (
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_ops.py", line 19, in <module>
    from blobfile._common import DirEntry, Stat, RemoteOrLocalPath
  File "/root/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/blobfile/_common.py", line 1025, in <module>
    RemoteOrLocalPath = Union[str, BlobPathLike, os.PathLike[str]]
TypeError: 'ABCMeta' object is not subscriptable
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[46321,1],0]
  Exit code:    1

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

P.S. Any Model that could supports Pose+Hand+Face, can be used!

Thanks
Best regards

Training steps

HI, could I know how many training steps are set for training? Many thanks.

Training issues

Hi,
I found a file "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in the training script. Is this file the landscape.pt in the open-source model? I am looking forward to your answer, thank you very much.

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

Thanks for your excellent work:)
But I got a problem..

/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Traceback (most recent call last):
File "py_scripts/multimodal_sample_sr.py", line 309, in
main()
File "py_scripts/multimodal_sample_sr.py", line 125, in main
sample = dpm_solver.sample(
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1293, in sample
x = self.singlestep_dpm_solver_update(x, vec_s, vec_t, order, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 1060, in singlestep_dpm_solver_update
return self.singlestep_dpm_solver_third_update(x, s, t, return_intermediate=return_intermediate, solver_type=solver_type, r1=r1, r2=r2)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 819, in singlestep_dpm_solver_third_update
model_s = self.model_fn(x, s)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 449, in model_fn
return self.noise_prediction_fn(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 417, in noise_prediction_fn
return self.model(x, t)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 350, in model_fn
return noise_pred_fn(x, t_continuous)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_dpm_solver_plus.py", line 305, in noise_pred_fn
video_output,audio_output = model(x["video"], x["audio"], t_input, **model_kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 1085, in forward
video, audio = module(video, audio, emb)#
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 45, in forward
video, audio = layer(video, audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 694, in forward
return self.video_conv(video), self.audio_conv(audio)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/wga/MM-Diffusion/mm_diffusion/multimodal_unet.py", line 96, in forward
video = self.video_conv_spatial(video)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/wga/miniconda3/envs/mmdiffusion/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

OOM in multimodal_sample_sr.sh

Hello, authors!

Would you inform the GPU requirements for the ssh scripts ?
I currently use 2 * 3090 Ti GPUS (2 * 24GB), but a Cuda OOM error occurs.

Are there any other ways to reduce the GPU memory?

question about paper

May I ask what audio encoders, video encoders, and corresponding decoders are used in the model? It seems that the paper did not mention.

Reproducing results in the paper

Thanks for the great work! However, directly run the following commands do not produce the same FVD/KVD/FAD scores as in the paper. May I ask the data and configurations for reproducing results in paper? i.e., FVD=117.20, KVD=5.78, FAD=10.72

bash ssh_scripts/multimodal_sample_sr.sh
bash ssh_scripts/multimodal_eval.sh

When running the above commands, the output I got is:

evaluate for 2048 samples
metric:{'fvd': 338.2535400390625, 'kvd': 10.005603799600976, 'fad': 1.3610674068331718}

About generating 10 seconds of audio

Hello, thank you for this awesome project. I have a question to ask you. I saw that your data set is a short video clip of 10 seconds, but why is the final sample generated only one second? How to sample 10 seconds of audio?

_pickle.UnpicklingError: invalid load key, '<'.

Thanks for your excellent work! :)
I encounter a bug. File path is MM-Diffusion/evaluations/fvd/fvd.py , the fvd pretrained url is 404(Google Drive).Could u upgrade the code. Thanks!
`def download(id, fname):

destination = os.path.join(ROOT, fname)
if os.path.exists(destination):
    return destination 

os.makedirs(ROOT, exist_ok=True)
destination = os.path.join(ROOT, fname)

URL = 'https://drive.google.com/uc?export=download'
session = requests.Session()

response = session.get(URL, params={'id': id}, stream=True)
token = get_confirm_token(response)

if token:
    params = {'id': id, 'confirm': token}
    response = session.get(URL, params=params, stream=True)
save_response_content(response, destination)
return destination

`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.