Giter Club home page Giter Club logo

speech-backbones's Introduction

Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Grad-TTS

Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.

Authors: Vadim Popov*, Ivan Vovk*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.

*Equal contribution.

SPIRAL

Official implementation of SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training. For all details check out our paper accepted to ICLR 2022 via this link.

Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu.

DiffVC

Official implementation of the paper "Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme" (ICLR 2022, Oral). Link.

Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, Jiansheng Wei.

speech-backbones's People

Contributors

huawei-noah-admin avatar ivanvovk avatar wenyong-h avatar ytyeung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech-backbones's Issues

ASR finetune ?

Hi,
I want to finetune this great model with my own data set. is it possible ? Is there any pre-trained model for this ?

Thanks in advance

Not able to generate audio using libritts of as good quality as using ljspeech

Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

Clipping distortion of the generated waveform

Hi, thanks for sharing the code. I have tried it on different datasets including Chinese and English. However, there is some clipping on some of the generated waveforms (like the generated Mel spectrum is too energetic at some positions?). I first tried different vocoders including hifigan and griffin-lim and this happened. Then I tried different ranges of values for the Mel spectrum, including the log domain and normalization to [-1,1], again without avoiding this phenomenon. Finally, I tried different values of temperature, 1.0, 1.3, 1.5, and again this phenomenon could not be avoided. I would like to know the possible causes of this phenomenon and how to solve it. If anyone has encountered this situation, please feel free to discuss it.
image
As shown above, the value of the waveform at some locations is out of range, thus causing clipping.

About the prior loss and MAS algorithm

Great work! I've been studying the paper and the code recently and there's something that confuses me much.

In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

mels_mode generation

Hi,
I have created TextGrid files in the subfolder textgrids using MFA.
Im facing issues to get average voice mel-spectrograms in the subfolder mels_mode.
Im using get_avg_mels.ipynb jupyter noteboook to get average voice mel-spectrograms.
Its generating mels_mode dictionary with phonemes as keys. But there is not further instructions to map them with spakers and create mels_mode subfolder using this dictionary.
@ivanvovk @ytyeung @wenyong-h @huawei-noah-admin @zhangjiajin2 Pls help.

for p in phoneme_list: mels_mode[p] = mode(np.asarray(mels_mode_dict[p]), 0).mode[0] lens[p] = np.mean(np.asarray(lens_dict[p]))

Two questions about DiffVC

Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond?
(2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x?
I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.

Possibly missing __dict__ in the Projector class' constructor

While loading the pretrained weights of the ST2VecEncoder, I had to replace **conv_cfg_i with **conv_cfg_i.__dict__ in __init__ of the Projector class (SPIRAL/nemo/collections/asr/parts/spec2vec.py). Doing this allowed me to load all the weights and match the keys successfully -- nonetheless, i was curious to know if I was missing any installation?

GradTTS device compatibility

I tried to use GradTTS by moving the model to MPS (.to('mps')) but the terminal's error message told me that MPS doesn't support 3D padding. So I tried it on CPU and it worked but it is too slow. Is there any way to adapt GradTTS to Mac's MPS device?

Diffusion loss not decreasing

Hi,
I have trained the GradTTS model on the Indian accent English dataset, and the results are pretty awesome.

Looking at the logs, I was startled to see that the diffusion loss throughout the training was not decreasing unlike other losses, and was also fluctuating a lot. Can anyone explain to me why this is the case and if the diffusion loss fluctuates so much why is it used in the total loss calculation?

I have attached my tensorboard outputs.

  • Training Diffusion Loss
    Training Diffusion Loss
  • Training Prior Loss
    Training Prior Loss
  • Training Duration Loss
    Training Duration Loss

About end2end implementation

Hi, thank you for sharing your excellent work.
I want to ask about your end-to-end TTS model. In the paper, you stated that only the decoder is changed such that it can generate waveform (by using WaveGrad architecture). So the $\mu$ vector is no longer mean statistics of melspectrogram, just simply hidden features. I wonder what probability you put into monotonic alignment if the model can not access to mel features (mu_x and y in your code). Did you still keep the $\mathcal{L}_{enc}$ loss that constrain similarity between mu_y and mel spectrogram? Furthermore, since we do not know mean statistics of waveform, does the SDE have to change also? I have listened to some samples from e2e model in your website and noticed that although audio had noises, their alignments were quite decent, did you try to improve audio's quality with adversarial loss like Hifi gan?
Thank you.

A bug in model/tts.py

Speaking formally, the shape of variable y_cut_mask from here, might not match the shape of variable y_cut at the last dimension (which is out_size for y_cut).
To comprehend, take a look at the function sequence_mask, which we invoke to create y_cut_mask. As parameter max_length is not provided, the length dimension will be of size max(length) (look here). Thus, if all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size, the last dimension of the shape of y_cut_mask will not match the last dimension of y_cut.
An easy experiment can show up an issue. Start training GradTTS with batch_size==1. In that case if there is any sequence shorter than out_size, training will fail with shape mismatch.
The fix I suggest is elementary: provide parameter max_length=out_size when calling sequence_mask here.
Moreover, we better skip cropping out mel when all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size. Concrete, I suggest to add condition y_max_length > out_size here.

about diffVC on Mandarin datasets

Hello, I adapted the diffvc code on Mandarin datasets. However, the audio after VC has the problem of tone sandhi. I want to ask the performance is normal ?

Model training question

Hi, thanks for sharing the code.
I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders.
How exactly to fill them. Maybe there is some more detailed instructions?

How is `out_size` in `params` determined

Hi, I am modifying the code for my own purposes. I notice here:

out_size = fix_len_compatibility(2*22050//256)

the argument is hard-coded, and I guess 22050 and 256 to be the sampling rate and frame shift in the case of LJspeech, right? If this is true, should I change to another value if I am dealing with different datasets?

Grad-TTS in multispeaker setting

Thank you for the releasing original implementation of Grad-TTS. I would like to know if a multispeaker setting is available or planned for release.

I am implementing a multispeaker setting using this repo. Will the maintainer of this repo be interested in discussing or providing feedback on multispeaker Grad-TTS implementation?

Regards
Ajinkya

[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'

[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1208, in cythonize_one
result = compile_single(pyx_file, options, full_module_name=full_module_name)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 727, in compile_single
return run_pipeline(source, options, full_module_name)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 515, in run_pipeline
err, enddata = Pipeline.run_pipeline(pipeline, source)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 355, in run_pipeline
data = run(phase, data)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 335, in run
return phase(data)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 52, in generate_pyx_code_stage
module_node.process_implementation(options, result)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 143, in process_implementation
self.generate_c_code(env, options, result)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 411, in generate_c_code
f = open_new_file(result.c_file)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Utils.py", line 76, in open_new_file
return codecs.open(path, "w", encoding="ISO-8859-1")
File "/usr/local/lib/python3.8/codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)

Not possible to build

This cant be build at all. Package like torchaudio==0.5.1 can not be found anymore. Cython errors. Any way this will be corrected? I cant get to run DiffVC in this case.

Typo in some equations in GradTTS paper

Thanks for your great work on GradTTS!
However I recently found a tiny error in the arxiv version 2 of GradTTS paper (https://arxiv.org/pdf/2105.06337.pdf). In Eq.31 and Eq.32 in the appendix, the $X_t$ and $\mu$ are put in the wrong order, i.e. it probably should be $\mu - X_t$ rather than $X_t - \mu$. This typo could be originated from the line above Eq.31, " In our case $f(X_t, t) = \frac 12 \Sigma^{-1}(X_t โˆ’\mu)\beta$ and ...", where it should be $f(X_t, t)= \frac 12 \Sigma^{-1}(\mu-X_t)\beta$ instead. The other parts of the paper seem not to be affected by this, and the derivations are solid and fluent.
Again, great thanks for the work!

Generated outputs sound robotic in some cases!

Hello!

I have trained GradTTS on quite a few datasets and observed that it produces excellent results on most of the voices, but the generated results on heavy voices sound very robotic! Can anyone explain any possible reasons for this?

I have attached the samples of original and generated voices.

Thank you!

  • Original male voice
org_sagar.mov
  • Generated male voice
gen_sagar.mov
  • Original female voice
org_anupama.mov
  • Generated female voice
gen_anupama.mov

Multi-GPU training and expected epochs

Hi,

First of all, thanks for the nice paper and release code. I am testing your model for a different dataset and two questions come up:

  1. Which is the estimated number of epochs to train the model? We have expierenced some degradation when the model is overtrained (overfitting?) the data.
  2. Is there a way to train the model in a multi-gpu setup? We have more GPUs available, however the code seems to just work on the first available GPU given by the CUDA_VISIBLE_DEVICES argument.

Thanks!

Attention layer in GradTTS

Hey @ivanvovk et al.

Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:

k = k.softmax(dim=-1)

computes the softmax on the projected key values instead of computing it on the product of query and key.

Usually, I know self-attention as:

Value x Softmax(Query x Key^T / d_k)

but it seems like here it is

(Value x Softmax(Key)) x Query

=> Is it similar to self-attention? Where does it come from?

Best,
Patrick

Different Implementation of Diffusion Model

I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.

According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?

Generated Samples are noisy

I have used the pretrained model as provided in the google drive of the official repo. Based on the check point of the pre-trained model when I executed the infernce.py file, the generated samples quality I observed are very noisy for different values of reverse diffusion process (10,20,30,40,50,70). If I can get any suggestion regarding the same. I have used the ckpt of Libri-TTS model (not the LJ-Speech). However, in the demo page the quality of the samples are sufficiently good.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.