********* June 19, 2020 *********
The code and data are going through the internal review and will be released later!
********* August 26, 2020 *********
The dataset is still going through the internal review, please wait.
********* September 7, 2020 *********
The code & pose data are released!
********* April 4, 2021 *********
Two versions of codebase are released. Have a try to train your AI dancer!
This repo is the PyTorch implementation of "Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning". Our proposed approach significantly outperforms the existing SOTAs in extensive experiments, including automatic metrics and human judgements. It can generate creative long dance sequences, e.g., about 1-minute length under 15 FPS, from the input music clips, which are smooth, natural-looking, diverse, style-consistent and beat-matching with the music from test set. With the help of 3D human pose estimation and 3D animation driving, this technique can be used to drive various 3D character models such as the 3D model of Hatsune Miku (very popular virtual character in Japan), and has the great potential for the virtual advertisement video generation.
Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning. ICLR 2021.
Ruozi Huang*, Huang Hu*, Wei Wu, Kei Sawada, Mi Zhang and Daxin Jiang.
[Paper] [YouTube] [Project]
- Python 3.7
- PyTorch 1.6.0
- ffmpeg
-
We released the dance pose data and the corresponding audio data into [Google Drive]. Please put the downloaded
data/
into the project directoryDanceRevolution/
and runprepro.py
that will generate the training data directorydata/train_1min
and test data directorydata/test_1min
. The pose sequences are extracted from the collected dance videos with original 30FPS while the audio data is m4a format. After the generation is finished, you can runinterpolate_to30fps.py
to increase the 15FPS to 30FPS to produce visually smoother dance videos for generated results. -
If you plan to train the model with your own dance data (2D), please install OpenPose for the human pose extraction. After that, please follow the hierarchical structure of directory
data/
to place your own extracted data and runprepro.py
to generate the training data and test data. Note that, we develope ainterpolate_missing_keyjoints.py
script to find the missing keyjoints to reduce the noise in the pose data, which is introduced by the imperfect extraction of OpenPose. -
Due to the lack of 3D dance data in hand, we did not test our approach on 3D data. Recently, there is a wonderful paper conducted by Li et al., "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation"(https://arxiv.org/abs/2101.08779), on music-conditioned 3D dance generation, which releases a large-scale 3D human dance motion dataset, AIST++. Just heard from the authors of this paper that our approach also performs well on their released 3D dataset and add our approach as one of compared baselines in their work!
We released two versions of codebases that have passed the test. In V1 version, the local self-attention module is implemented base on the longformer that provides the custom CUDA kernel to accelerate the training speed and save GPU memory for long sequence inputs. While V2 version just implements the local self-attention module via the naive PyTorch implementation, i.e., the attention mask operations. In practice, we found the performance of V2 is more stable and recommend to use V2 version. Here are some training tricks that may be helpful for you:
- Small batch sizes, such as 32 and 16, would help model to converge better and the model usually converges well at around the 3000-th epoch. It takes about 3 days to train the model well under these settings.
- Increasing sliding window size of local self-attention is beneficial to the more stable performance while the cost (e.g., training time and GPU memory usage) would become high. This point has been justified in the ablation study of encoder structures in the paper. So if you are free of GPU resource limitation, we recommend to use the large sliding window size for training.
- Ballet style
- Hiphop style
- Japanese Pop style
- Photo-Realisitc Videos by Video-to-Video
We map the generated skeleton dances to the photo-realistic videos by Video-to-Video. Specifically, We record a random dance video of a team memebr to train the vid2vid model. Then we generate photo-realistic videos by feeding the generated skeleton dances to the trained vid2vid model. Note that, our team member has authorized us the usage of her portrait in following demos.
- Driving 3D model by applying 3D human pose estimation and Unity animation to generated skeleton dances.
If you find this work useful for your research, please cite the following paper:
@inproceedings{
huang2021,
title={ Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning},
author={Ruozi Huang and Huang Hu and Wei Wu and Kei Sawada and Mi Zhang and Daxin Jiang},
booktitle={International Conference on Learning Representations},
year={2021}
}