Giter Club home page Giter Club logo

diffgesture's Introduction

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation (CVPR 2023)

This is the official code for Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation.

Abstract

Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations.

Installation & Preparation

  1. Clone this repository and install packages:

    git clone https://github.com/Advocate99/DiffGesture.git
    pip install -r requirements.txt
    
  2. Download pretrained fasttext model from here and put crawl-300d-2M-subword.bin and crawl-300d-2M-subword.vec at data/fasttext/.

  3. Download the autoencoder used for FGD which include the following:

    For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility the ckpt in the train_h36m_gesture_autoencoder folder.

    For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided here.

    Save the models in output/train_h36m_gesture_autoencoder/gesture_autoencoder_checkpoint_best.bin for TED Gesture, and output/TED_Expressive_output/AE-cos1e-3/checkpoint_best.bin for TED Expressive.

  4. Refer to HA2G to download the two datasets.

  5. The pretrained models can be found here.

Training

While the test metrics may vary slightly, overall, the training procedure with the given config files tends to yield similar performance results and normally outperforms all the comparison methods.

python scripts/train_ted.py --config=config/pose_diffusion_ted.yml
python scripts/train_expressive.py --config=config/pose_diffusion_expressive.yml

Inference

# synthesize short videos
python scripts/test_ted.py short
python scripts/test_expressive.py short

# synthesize long videos
python scripts/test_ted.py long
python scripts/test_expressive.py long

# metrics evaluation
python scripts/test_ted.py eval
python scripts/test_expressive.py eval

Citation

If you find our work useful, please kindly cite as:

@inproceedings{zhu2023taming,
  title={Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation},
  author={Zhu, Lingting and Liu, Xian and Liu, Xuanyu and Qian, Rui and Liu, Ziwei and Yu, Lequan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10544--10553},
  year={2023}
}

Related Links

If you are interested in Audio-Driven Co-Speech Gesture Generation, we would also like to recommend you to check out our other related works:

  • Hierarchical Audio-to-Gesture, HA2G.

  • Audio-Driven Co-Speech Gesture Video Generation, ANGIE.

Acknowledgement

diffgesture's People

Contributors

advocate99 avatar alvinliu0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffgesture's Issues

Error while reading data from the mdb files using Pyarrow

Hi,

I am getting an error when I try deserializing the mdb files from both ted_gesture and ted_expressive datasets.

A sample code block to generate the error:

import lmdb
import pyarrow

# Path to the LMDB database file
db_path = '/path/to/mdb/'

# Open the LMDB environment
env = lmdb.open(db_path, readonly=True)  # Open in read-only mode

# Begin a read-only transaction
with env.begin(write=False) as txn:
    # Open a cursor to iterate over the records in the database
    cursor = txn.cursor()

    # Iterate over the records
    for key, value in cursor:
        # Deserialize the value using pyarrow
        data = pyarrow.deserialize(value)

# Close the LMDB environment
env.close()
Error Message

Traceback (most recent call last):
File "tmp.py", line 19, in
data = pyarrow.deserialize(value)
File "pyarrow/serialization.pxi", line 550, in pyarrow.lib.deserialize
File "pyarrow/serialization.pxi", line 556, in pyarrow.lib._deserialize
File "pyarrow/serialization.pxi", line 285, in pyarrow.lib.SerializedPyObject.deserialize
File "pyarrow/error.pxi", line 129, in pyarrow.lib.check_status
pyarrow.lib.ArrowSerializationError: Cannot convert string: "" to int8_t

To make sure that this issue is not because of corrupted mdb files, I re-downloaded the files. But, unfortunately, the issue persists.

The issue is similar to: #3 (comment)

Please can you help me out with this issue.

Thank you in advance:)

P.S. serialize() and deserialize() functions have been removed from the latest pyarrow version (12.0.1), so I am using pyarrow version 11.0.0 (the version mentioned in the requirements.txt is pyarrow==0.14.1. But this version does not exist).

pyarrow error

I follow the requirement in 3090Ti.

I try to install the pyarrow==0.14.1, but I can't install it.

ERROR: Could not find a version that satisfies the requirement pyarrow==0.14.1 (from versions: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.17.1, 1.0.0, 1.0.1, 2.0.0, 3.0.0, 4.0.0, 4.0.1, 5.0.0, 6.0.0, 6.0.1, 7.0.0, 8.0.0, 9.0.0, 10.0.0, 10.0.1, 11.0.0)
ERROR: No matching distribution found for pyarrow==0.14.1

The version seems already updated. So I install the latest version. When I run the following code:

video = pyarrow.deserialize(value)

I am getting the following error:

Exception has occurred: OSError
Expected IPC message of type unknown but got unknown

Do you have any questions or detailed configuration?

Inference using Audio or Text

Hello, Dr. Zhu!
Thank you for your great work, it is really an outstanding work.
I Have run the model successfully, thanks to your great description. Now I want to use the model on arbitrary audio, how to do so? I tried editing on the dataloder but it seems mistiness for me.
Also I wonder if it is possible to run the model on a given text only, that to make it faster as the speed is a matter to me?
Thank you

Inquiry on Jittering Issues with Beat Datasets

Hi

I hope this message finds you well. Firstly, I would like to extend my sincere gratitude for your significant contribution to the field of audio-driven co-speech gesture generation.

However, I have encountered an issue that I am keen to discuss with you. When I was benchmarking beat datasets using your source code and trying to visualize the results, I noticed that your BC, FGD, and Diversity metrics were leading in performance. Moreover, there was an issue I didn't anticipate: jittering in the visualization outcomes. Even after I followed the smooth motion code you provided in Github, the jittering issue remained.

I am reaching out to inquire if there might be additional considerations or adjustments that should be made when working with beat datasets to mitigate this issue. Your insights or any further guidance you could provide would be immensely valuable and greatly appreciated.

Thank you very much for your time and for sharing your expertise.

Best regards

ValueError: unknown file extension: .mp4

Hello! Thanks for your outstanding work
I meet this error when I run python scripts/test_ted.py long:
ValueError: unknown file extension: .mp4
and it seems like in test_ted.py line 391.
I looked up some information and got the answer that pillow does not support MP4 format
Can you give me some advice?

Missing of model checkpoint

Good morning,

Thanks for your good work, I would like to have a try and run it, I got the following error when I run inference, do you have any idea how to fix it? or there is something missing? thanks :

FileNotFoundError: [Errno 2] No such file or directory: 'output/train_diffusion_ted/pose_diffusion_checkpoint_499.bin'

Regarding dataset path configuration issues

Hello author, I have set my own defined file path, but in the configuration file path function, train_dataset=SpeedMotionDataset (args. train_data_path [0]),
N-poses=args. n-poses,
Subdivision_stride=args. subdivision_stride,
Pose resampling_fps=args. motion resampling frame,
Mean dir_vec=mean dir_vec,
Mean pose=args. mean pose,
Remove_wordting=(args. input_context=='text ')

Args. train_data_path [0], which only displays the first letter "E" of the file path, and I tried to change it to args. train_data_path, but there were still errors where I couldn't find the dataset. I don't know how to solve this problem

how to produce the evaluation metrics numbers?

Hi and thanks for this great work . I trying to produce the metrics number of the paper but when I run the :
python scripts/test_ted.py eval
I get these logs :
python scripts/test_ted.py eval
loading checkpoint output/train_diffusion_ted/pose_diffusion_checkpoint_499.bin
epoch 499
init diffusion model
2023-12-05 10:22:03,301: PyTorch version: 1.13.1+cu117
2023-12-05 10:22:03,301: CUDA version: 11.7
2023-12-05 10:22:03,302: 1 GPUs, default cuda:0
2023-12-05 10:22:03,302: {'batch_size': 128,
'block_depth': 8,
'classifier_free': True,
'config': 'config/pose_diffusion_ted.yml',
'diff_hidden_dim': 256,
'epochs': 500,
'eval_net_path': 'output/train_h36m_gesture_autoencoder/gesture_autoencoder_checkpoint_best.bin',
'freeze_wordembed': False,
'hidden_size': 300,
'input_context': 'audio',
'latent_dim': 128,
'learning_rate': 0.0005,
'loader_workers': 4,
'mean_dir_vec': [[0.0154009],
[-0.9690125],
[-0.0884354],
[-0.0022264],
[-0.8655276],
[0.4342174],
[-0.0035145],
[-0.8755367],
[-0.4121039],
[-0.9236511],
[0.3061306],
[-0.0012415],
[-0.5155854],
[0.8129665],
[0.0871897],
[0.2348464],
[0.1846561],
[0.8091402],
[0.9271948],
[0.2960011],
[-0.013189],
[0.5233978],
[0.8092403],
[0.0725451],
[-0.2037076],
[0.1924306],
[0.8196916]],
'mean_pose': [[3.06e-05],
[0.0004946],
[0.0008437],
[0.0033759],
[-0.2051629],
[-0.0143453],
[0.0031566],
[-0.3054764],
[0.0411491],
[0.0029072],
[-0.4254303],
[-0.001311],
[-0.1458413],
[-0.1505532],
[-0.0138192],
[-0.2835603],
[0.0670333],
[0.0107002],
[-0.2280813],
[0.112117],
[0.2087789],
[0.1523502],
[-0.1521499],
[-0.0161503],
[0.291909],
[0.0644232],
[0.0040145],
[0.2452035],
[0.1115339],
[0.2051307]],
'model': 'pose_diffusion',
'model_save_path': 'output/train_diffusion_ted',
'motion_resampling_framerate': 15,
'n_poses': 34,
'n_pre_poses': 4,
'name': 'pose_diffusion',
'null_cond_prob': 0.1,
'pose_dim': 27,
'pose_representation': '3d_vec',
'random_seed': -1,
'save_result_video': True,
'subdivision_stride': 10,
'test_data_path': ['data/ted_dataset/lmdb_test'],
'train_data_path': ['data/ted_dataset/lmdb_train'],
'val_data_path': ['data/ted_dataset/lmdb_val'],
'wordembed_dim': 300,
'wordembed_path': 'data/fasttext/crawl-300d-2M-subword.bin'}
2023-12-05 10:22:03,334: Reading data 'data/ted_dataset/lmdb_val'...
2023-12-05 10:22:03,334: Found the cache data/ted_dataset/lmdb_val_cache
0

only a zero . and it dose the same for
python scripts/test_ted.py short
I traced some part of your code and I noticed that it never enter to for loop in line of 178 of test_ted.py file .
how can I repair your code ?

Some questions about experiment and code

Hi ,Dr.Zhu!
This is a nice job ,what inspires me a lot.
While I have some question.

First , in training process, sometimes loss will converge to 1 , especially in the first epoch , it confuses me a lot . However in some train processes the loss begin to decrease after 1-4 epoch and a few processes will remain 1.

Second , in some epoch , the loss will suddenly increase from 0.03 to 10000+ ??? And then rapidly decrease . This phenomenent occurs irregularly , offen it occurs every few tens of epoch.

Last, in the code " ./scripts/model/diffusion_util.py/TransfomerModel" ,why add a rand tensor "self.pos_embedding = nn.Parameter(torch.randn(1, num_pose, hidden_dim))" ?

Why use the val dataset to do the model testing?

Hi,

If I'm not mistaken, both HA2G and DiffGesture use the val dataset as the dataset for the testing model. While this doesn't really affect performance comparisons between models, shouldn't we generally be using the test dataset?

Thanks in advance!

lang_model and fasttext?

Hello, I have two questions:
The first question is what is the function of lang_model? There are two places indicating lang_model,

  1. lang_model = checkpoint['lang_model'] and
  2. with open(vocab_cache_path, 'rb') as f:
    lang_model = pickle.load(f)
    The second question is, is the fasttext model used in the inference process? I can’t find where to call it?

[DiffGesture]

Hello, Dr. Zhu!
This is a very outstanding job, and it has aroused my great interest.
But I have a small problem. I found that the maximum n_pose in the dataset ted_gesture_dataset seems to be fixed to 42. Is it correct that we use 34 during training? Is there a way to train longer frames (such as 64 frames, 128 frames)?

About Inference

Thanks for your great work! It seems that the inference code is running on the validation set. Is it possible to run on an arbitrary audio?

Issues about DataPreprocessor when running the code

When I run

python scripts/train_expressive.py --config=config/pose_diffusion_expressive.yml

An issue occured on DataPreprocessor:

class DataPreprocessor:
    def __init__(self, clip_lmdb_dir, out_lmdb_dir, n_poses, subdivision_stride,
                 pose_resampling_fps, mean_pose, mean_dir_vec, disable_filtering=False):
        self.n_poses = n_poses
        self.subdivision_stride = subdivision_stride
        self.skeleton_resampling_fps = pose_resampling_fps
        self.mean_pose = mean_pose
        self.mean_dir_vec = mean_dir_vec
        self.disable_filtering = disable_filtering

        self.src_lmdb_env = lmdb.open(clip_lmdb_dir, readonly=True, lock=False)
        with self.src_lmdb_env.begin() as txn:
            self.n_videos = txn.stat()['entries']
TypeError: Transaction.stat() missing 1 required positional argument: 'db'

Please help me solve this issue, thanks a lot.

pretrained model

Hello, I am very interested in your research. Can you provide me with your pretrained model

Real Pose Generate

Thanks for your great work, how can I generate real poses corresponding to the openpose format? I found that output_poses generated by inference script are quite small numbers

About reproduction of the experiment result in Paper.

Thanks for you outstanding work! I am very interested in this method and thanks for your hard work. However, for the uploaded code I have ran the training code for Ted Gesture dataset and got worser result compared with the result in Paper. I am asking whether there are some changes need to perform to reproduce the best result, also is the checkpoint at 499th epoch produces that result in Paper ?

lmdb.Error: data/ted_dataset/lmdb_train: No such file or directory

Traceback (most recent call last):
File "scripts/train_ted.py", line 179, in
main({'args': _args})
File "scripts/train_ted.py", line 133, in main
train_dataset = SpeechMotionDataset(args.train_data_path[0],
File "D:\anzbao-biancheng\githubproject\DiffGesture-main\scripts\data_loader\lmdb_data_loader.py", line 86, in init
data_sampler = DataPreprocessor(lmdb_dir, preloaded_dir, n_poses_extended,
File "D:\anzbao-biancheng\githubproject\DiffGesture-main\scripts\data_loader\data_preprocessor.py", line 30, in init
self.src_lmdb_env = lmdb.open(clip_lmdb_dir, readonly=True, lock=False)
lmdb.Error: data/ted_dataset/lmdb_train: No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.