amirbar / speech2gesture Goto Github PK

View Code? Open in Web Editor NEW

361.0 27.0 41.0 1.03 MB

code for training the models from the paper "Learning Individual Styles of Conversational Gestures"

Python 100.00%

speech2gesture's Introduction

Learning Individual Styles of Conversational Gestures

Shiry Ginosar* , Amir Bar* , Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik

Back to main project page

Prerequisites:

python 2.7
cuda 9.0
cuDNN v7.6.2
sudo apt-get install ffmpeg
pip install -r requirments.txt

Data

Download the dataset as described here

Instructions

Extract training/validation data
Train a model
Perform inference using a trained model

Extract training data

Start by extracting training data:

python -m data.train_test_data_extraction.extract_data_for_training --base_dataset_path <base_path> --speaker <speaker_name> -np <number of processes> --speaker <speaker name>`

once done you should see the following directories structure:
(notice train.csv and a train folder within the relevant speaker)

Gestures
├── frames.csv
├── train.csv
├── almaram
    ├── frames
    ├── videos
    ├── keypoints_all
    ├── keypoints_simple
    ├── videos
    └── train
...
└── shelly
    ├── frames
    ├── videos
    ├── keypoints_all
    ├── keypoints_simple
    ├── videos
    └── train

train.csv is a csv file in which every row represents a single training sample. Unlike in frames.csv, here, a sample is few seconds long.

Columns documentation:

audio_fn - path to audio filename associated with training sample
dataset - train/dev/test
start - start time in the video
end - end time in the video
pose_fn - path to .npz file containing training sample
speaker - name of a speaker in the dataset
video_fn - name of the video file

Training a speaker specific model

Training run command example:

python -m audio_to_multiple_pose_gan.train --gans 1 --name test_run --arch_g audio_to_pose_gans --arch_d pose_D --speaker oliver --output_path /tmp

During training, example outputs are saved in the define output_path

Inference

optionally get a pretrained model here.

Perform inference on a random sample from validation set:

python -m audio_to_multiple_pose_gan.predict_to_videos --train_csv <path/to/train.csv>--seq_len 64 --output_path </tmp/my_output_folder> --checkpoint <model checkpoint path> --speaker <speaker_name> -ag audio_to_pose_gans --gans 1

Perform inference on an audio sample:

python -m audio_to_multiple_pose_gan.predict_audio --audio_path <path_to_file.wav> --output_path </tmp/my_output_folder> --checkpoint <model checkpoint path> --speaker <speaker_name> -ag audio_to_pose_gans --gans 1

Reference

If you found this code useful, please cite the following paper:

@InProceedings{ginosar2019gestures,
  author={S. Ginosar and A. Bar and G. Kohavi and C. Chan and A. Owens and J. Malik},
  title = {Learning Individual Styles of Conversational Gesture},
  booktitle = {Computer Vision and Pattern Recognition (CVPR)}
  publisher = {IEEE},
  year={2019},
  month=jun
}

speech2gesture's People

Contributors

Stargazers

Watchers

speech2gesture's Issues

Question about the video sampling rate.

Excuse me, I'm new in video processing, and is confused by the 29.97 in this script。It look like you resample all the videos to 29.97 Hz. But why not 30Hz? Thanks in advance.

A few typos & missed instructions in dataset.md

This is a great project! I found a few typos / missed instructions while following the dataset.md page to prepare the dataset. Hope this could be helpful to other people.

Typos:

Folder structure in Download speaker data / 4 Download the speaker videos from youtube.. There should be one video folder for each speaker instead of two
Download crop_intervals.py rather than crop_intervals.csv

Missed Instructions

youtube_dl is used in download_youtube.py, but it is not in the requirements.txt. To solve this:
pip install --upgrade youtube_dl

Can you provide some help on getting the videos?

Thank you for the data you collected. However, I encountered some problems in getting the data.
We noticed that the videos of "Rock" is currently unavailable for download.
I also don't know how to download the videos of "Jon", could you give me some help?

where is the frame_df.csv.tgz

hello, appricate your work!
have a question, where is the frame_df.csv.tgz? I cannot find it in the link, please tell me how and where to get it , thanks.

Random noise in Generator

Hello Amir,

Does your generator provide any random noise as input? or it generates the same pose based on the same audio input? I'm working on a similar sequence generation project. I'm curious how you put the randomness into sequence.

Thanks.

need more information about how to build data set

hi and thanks for this work .
Could you please give me the script or tell me how did you build this data set?i am trying to build similar data set for another language and there are so many files like jpg files and some columns like interval_id and frame_id in data frames that It is not clear how they were extracted .
Please help me with this.

Code Release

Hello,
When do you plan to release your code?

Get strange results using the pre-trained model

Thank you very much for publish this excellent work. I am trying the inference code with your provided pre-trained model. I downloaded some audio from ellen show, and use these audio as the model input audio. But the pose generated looks not right, especially for the hands. I wonder if I messed something up. or I need to do some preprocess for the audio I downloaded from web.
Another question is what I will get if I take somebody else's audio as input of Ellen's model? Will I still get reasonable results?

Thank you very much!

This is how I've made this work -- inference only

Installing packages:
conda env create x python=3.6
conda activate x
conda install --file requi* -y

Download audio:
youtube-dl https://www.youtube.com/watch?v=6IdXEOdRxPs -x --audio-format wav

Errors and solutions:

ModuleNotFoundError: No module named 'numba.decorators'
conda install numba==0.48
TypeError: unsupported operand type(s) for /: 'Dimension' and 'int'
change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1]/2))
to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))
TypeError: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64
change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))
to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, int(shape[-1].value/2)))
DataLossError (see above for traceback): Unable to open table file /media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

change: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/'
to: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp'
-- not *.ckp.(index|meta|data)

You will have to change the paths
How to run:
/home/vali/system/apps/anaconda3/envs/x/bin/python -m audio_to_multiple_pose_gan.predict_audio --audio_path '/media/data/study/AI/speech2gesture/History of Rock, Part 1 by University of Rochester-6IdXEOdRxPs.wav' --output_path '/media/data/study/AI/speech2gesture/tmp/' --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp' --speaker rock -ag audio_to_pose_gans --gans 1

after predict

After predicting ，how can I get the video of real man?

need dataset files

hello ,thanks for your great work.
I also need the following dataset files for study and research~~,the provided link is invalid~

[the file frames_df_10_19_19.csv]、[a single or multiple speakers keypoints & frames tar file]、[the file containing all video links video_links.csv] and [intervals_df.csv].

inference

Hi I tried to run this:

python -m audio_to_multiple_pose_gan.predict_audio --audio_path /content/parte1_2.wav --output_path /content/ --checkpoint ????? --speaker angelica -ag audio_to_pose_gans --gans 1

I have problems with checkpoints! because in the given folder of pretrained models there is not ckp file!!
Can anyone help?

How can you run your network on arbitrary audio durations at test time?

"During training, we take as input spectrograms corresponding to about 4 seconds of audio and predict 64 pose vectors, which correspond to about 4 seconds at a 15Hz frame-rate. At test time we can run our network on arbitrary audio durations"(Section 4.3).

What are the details of the testing implementation?

OpenPose Keypoint

May I ask you to use OpenPose to extract the details of key points, there are 49 key points, how are they corresponding

Some questions about the code?

I notice that the loss function of the discriminator is mse. In my view, cross-entropy is a commonly used loss function in binary classification tasks in discriminator. So I wanna know why? And I'm also confused in the output of the discriminator. The output of the discriminator is a vector with unequal length to the input, what dose it denote?

Insufficient driver version: 410.79.0

Hi!

I've tried to run the inference on ellen's checkpoint by installing CUDA 9.0 on Google Colab, and running the command
!python -m audio_to_multiple_pose_gan.predict_audio --audio_path speech.wav --output_path output --checkpoint checkpoint/ckpt-step-296700.ckp --speaker ellen -ag audio_to_pose_gans --gans 1

It throws the following error:

2019-08-06 01:49:17.499788: E tensorflow/stream_executor/cuda/cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-08-06 01:49:17.499934: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] possibly insufficient driver version: 410.79.0

Training Procedure to Achieve Presented Results

Hi,

Thank you for your great work.
In the paper you mentioned that you train for 300K/90K iterations with and without an adversarial loss, respectively, to achieve the results presented in Tables 1-3.
I assume that you first run the train.py script with --epochs 1000 and --lambda_gan 1.0. Then, you select the best model and run the same script again with --epochs 300 and --lambda_gan 0. The best model of the second run should achieve the presented results. Is that assumption correct, or did you use a different approach?

Missing frames_df_10_19_19.csv and pretrained models

Hi, thank you so much for your work!

I met some problems reproducing your work. Many pre-trained models are missing in google drive. All of them do not have data-00000-of-00001 files. Some of them only contain a .index file. https://drive.google.com/drive/folders/1yBJur-FjtMGNZTKKvEY5WuppG2yp2SJO

Then I try to retrain your model. But frames_df_10_19_19.csv is missing in google drive.

It is really strange that someone could reproduce them in 2021. So did you remove those files from the drive?

Looking forward to your reply!

https://drive.google.com/drive/folders/1qvvnfGwas8DUBrwD4DoBnvj8anjSLldZ

The link in the dataset document failed

Hello，I‘m try to do work which needs the speech2gesture dataset. But it seems that the link provided in dataset.md is invalid now. Could you please provide another link? Thanks!

about gan

hello, appricate your work!
have a question, after try your pretrained models, the result is some skeletal animation，but i want to get some animation with human action, did i do something wrong? if i did wrong, pleace help me which file should i run, thanks.

Question about SPEAKERS_CONFIG.

Could you please explain about how to get SPEAKER_CONFIG params in consts.py? It seems these params are uesd to normalize the keypoints, but i am confused about how these params are determined and the purpose of the normalization process.

DataLossError when trying to open pre-trained checkpoints!

@amirbar I'm facing an error when I run python -m audio_to_multiple_pose_gan.predict_audio --audio_path oliver_test.wav -output_path tmp_output/ --checkpoint Gestures/pretrained_models/oliver/ckpt-step-296700.ckp.data-00000-of-00001 --speaker oliver -ag audio_to_pose_gans --gans 1 for inference on audio.

This is the output error I get:
DataLossError (see above for traceback): Unable to open table file Gestures/pretrained_models/conan/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Any idea what's causing this? (Do let me know if you want the full log)

Misalignment of audio and frames

According to the file common/consts.py, we know that

SR = 16000
AUDIO_SHAPE = 67267
FPS = 15
FRAMES_PER_SAMPLE = 64

From the first three constants, we can compute num_frames = AUDIO_SHAPE / SR * FPS = 67267 / 16000 * 15 = 63.06281249999999 which is about one whole frame less than FRAMES_PER_SAMPLE.

We have encountered this problem when we were trying to test the model on a longer audio sequence, for which the misalignment is magnified.

Output is only predicted pose plots

After running inference, the output is only predicted pose plots, not synthesized video.

Enhancement - Gesture 2 Speech

Hi @amirbar,
Can the dataset be helpful in the generation of speech using gestures?
Sign language applications, human-computer interaction can be some of the applications.