Giter Club home page Giter Club logo

speech2gesture's People

Contributors

amirbar avatar shinydinosaur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech2gesture's Issues

need more information about how to build data set

hi and thanks for this work .
Could you please give me the script or tell me how did you build this data set?i am trying to build similar data set for another language and there are so many files like jpg files and some columns like interval_id and frame_id in data frames that It is not clear how they were extracted .
Please help me with this.

Training Procedure to Achieve Presented Results

Hi,

Thank you for your great work.
In the paper you mentioned that you train for 300K/90K iterations with and without an adversarial loss, respectively, to achieve the results presented in Tables 1-3.
I assume that you first run the train.py script with --epochs 1000 and --lambda_gan 1.0. Then, you select the best model and run the same script again with --epochs 300 and --lambda_gan 0. The best model of the second run should achieve the presented results. Is that assumption correct, or did you use a different approach?

where is the frame_df.csv.tgz

hello, appricate your work!
have a question, where is the frame_df.csv.tgz? I cannot find it in the link, please tell me how and where to get it , thanks.

Misalignment of audio and frames

According to the file common/consts.py, we know that

SR = 16000
AUDIO_SHAPE = 67267
FPS = 15
FRAMES_PER_SAMPLE = 64

From the first three constants, we can compute num_frames = AUDIO_SHAPE / SR * FPS = 67267 / 16000 * 15 = 63.06281249999999 which is about one whole frame less than FRAMES_PER_SAMPLE.

We have encountered this problem when we were trying to test the model on a longer audio sequence, for which the misalignment is magnified.

Random noise in Generator

Hello Amir,

Does your generator provide any random noise as input? or it generates the same pose based on the same audio input? I'm working on a similar sequence generation project. I'm curious how you put the randomness into sequence.

Thanks.

Code Release

Hello,
When do you plan to release your code?

about gan

hello, appricate your work!
have a question, after try your pretrained models, the result is some skeletal animation,but i want to get some animation with human action, did i do something wrong? if i did wrong, pleace help me which file should i run, thanks.

How can you run your network on arbitrary audio durations at test time?

"During training, we take as input spectrograms corresponding to about 4 seconds of audio and predict 64 pose vectors, which correspond to about 4 seconds at a 15Hz frame-rate. At test time we can run our network on arbitrary audio durations"(Section 4.3).

What are the details of the testing implementation?

Enhancement - Gesture 2 Speech

Hi @amirbar,
Can the dataset be helpful in the generation of speech using gestures?
Sign language applications, human-computer interaction can be some of the applications.

Some questions about the code?

I notice that the loss function of the discriminator is mse. In my view, cross-entropy is a commonly used loss function in binary classification tasks in discriminator. So I wanna know why? And I'm also confused in the output of the discriminator. The output of the discriminator is a vector with unequal length to the input, what dose it denote?

Can you provide some help on getting the videos?

Thank you for the data you collected. However, I encountered some problems in getting the data.
We noticed that the videos of "Rock" is currently unavailable for download.
I also don't know how to download the videos of "Jon", could you give me some help?

Missing frames_df_10_19_19.csv and pretrained models

Hi, thank you so much for your work!

I met some problems reproducing your work. Many pre-trained models are missing in google drive. All of them do not have data-00000-of-00001 files. Some of them only contain a .index file. https://drive.google.com/drive/folders/1yBJur-FjtMGNZTKKvEY5WuppG2yp2SJO

Then I try to retrain your model. But frames_df_10_19_19.csv is missing in google drive.

It is really strange that someone could reproduce them in 2021. So did you remove those files from the drive?

Looking forward to your reply!

https://drive.google.com/drive/folders/1qvvnfGwas8DUBrwD4DoBnvj8anjSLldZ

Get strange results using the pre-trained model

Thank you very much for publish this excellent work. I am trying the inference code with your provided pre-trained model. I downloaded some audio from ellen show, and use these audio as the model input audio. But the pose generated looks not right, especially for the hands. I wonder if I messed something up. or I need to do some preprocess for the audio I downloaded from web.
Another question is what I will get if I take somebody else's audio as input of Ellen's model? Will I still get reasonable results?

Thank you very much!

inference

Hi I tried to run this:

python -m audio_to_multiple_pose_gan.predict_audio --audio_path /content/parte1_2.wav --output_path /content/ --checkpoint ????? --speaker angelica -ag audio_to_pose_gans --gans 1

I have problems with checkpoints! because in the given folder of pretrained models there is not ckp file!!
Can anyone help?

Question about SPEAKERS_CONFIG.

Could you please explain about how to get SPEAKER_CONFIG params in consts.py? It seems these params are uesd to normalize the keypoints, but i am confused about how these params are determined and the purpose of the normalization process.

Insufficient driver version: 410.79.0

Hi!

I've tried to run the inference on ellen's checkpoint by installing CUDA 9.0 on Google Colab, and running the command
!python -m audio_to_multiple_pose_gan.predict_audio --audio_path speech.wav --output_path output --checkpoint checkpoint/ckpt-step-296700.ckp --speaker ellen -ag audio_to_pose_gans --gans 1

It throws the following error:

2019-08-06 01:49:17.499788: E tensorflow/stream_executor/cuda/cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-08-06 01:49:17.499934: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] possibly insufficient driver version: 410.79.0

after predict

After predicting ,how can I get the video of real man?

A few typos & missed instructions in dataset.md

This is a great project! I found a few typos / missed instructions while following the dataset.md page to prepare the dataset. Hope this could be helpful to other people.

Typos:

  1. Folder structure in Download speaker data / 4 Download the speaker videos from youtube.. There should be one video folder for each speaker instead of two

  2. Download crop_intervals.py rather than crop_intervals.csv

Missed Instructions

  1. youtube_dl is used in download_youtube.py, but it is not in the requirements.txt. To solve this:
    pip install --upgrade youtube_dl

DataLossError when trying to open pre-trained checkpoints!

@amirbar I'm facing an error when I run python -m audio_to_multiple_pose_gan.predict_audio --audio_path oliver_test.wav -output_path tmp_output/ --checkpoint Gestures/pretrained_models/oliver/ckpt-step-296700.ckp.data-00000-of-00001 --speaker oliver -ag audio_to_pose_gans --gans 1 for inference on audio.

This is the output error I get:
DataLossError (see above for traceback): Unable to open table file Gestures/pretrained_models/conan/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Any idea what's causing this? (Do let me know if you want the full log)

This is how I've made this work -- inference only

Installing packages:
conda env create x python=3.6
conda activate x
conda install --file requi* -y

Download audio:
youtube-dl https://www.youtube.com/watch?v=6IdXEOdRxPs -x --audio-format wav

Errors and solutions:

  • ModuleNotFoundError: No module named 'numba.decorators'
    conda install numba==0.48

  • TypeError: unsupported operand type(s) for /: 'Dimension' and 'int'
    change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1]/2))
    to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))

  • TypeError: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64
    change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))
    to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, int(shape[-1].value/2)))

  • DataLossError (see above for traceback): Unable to open table file /media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
    [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

    change: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/'
    to: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp'
    -- not *.ckp.(index|meta|data)

You will have to change the paths
How to run:
/home/vali/system/apps/anaconda3/envs/x/bin/python -m audio_to_multiple_pose_gan.predict_audio --audio_path '/media/data/study/AI/speech2gesture/History of Rock, Part 1 by University of Rochester-6IdXEOdRxPs.wav' --output_path '/media/data/study/AI/speech2gesture/tmp/' --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp' --speaker rock -ag audio_to_pose_gans --gans 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.