huawei-noah / speech-backbones Goto Github PK

View Code? Open in Web Editor NEW

547.0 23.0 112.0 34.63 MB

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Python 47.16% Jupyter Notebook 52.79% Cython 0.05%

speech-synthesis speech-recognition speech-processing

speech-backbones's Introduction

Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Grad-TTS

Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.

Authors: Vadim Popov*, Ivan Vovk*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.

^{*Equal contribution.}

SPIRAL

Official implementation of SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training. For all details check out our paper accepted to ICLR 2022 via this link.

Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu.

DiffVC

Official implementation of the paper "Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme" (ICLR 2022, Oral). Link.

Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, Jiansheng Wei.

speech-backbones's People

Contributors

Stargazers

Watchers

Forkers

ishine aggreybosire shaun95 ak391 sx-tts ggsonic silenzio777 khuongnd6 dbcooperptit nolanyuu patrykneubauer whitefu smksyj pppku leoniuschen sciai-ai uberduck-ai maxmax2016 miralan xzm2004260 hiyoung-asr amillert tranduyquang2205 vincentwei2021 cocowang1010 techthiyanes normonisping hxtuniverse wenyong-h aidasdir assemblyai bookbot-hive liguinan beckgom mali82 sdli1995 wgwangang aniketp02 cantabile-kwok zhanfengdog wyj1996 youngja87 eschmidbauer amir-shokri neural-space akaazure jupinter amorjnyh klaasibub angadhn mingthu elbum wendonggan space-pope vision-oihyxc tenebo muramasa2 aryangoyal7 molanischen yukihimex slives-lab math-examples sisanime josh-zhu linakrawt kkibria a-why-not-fork-repositories-good-luck vaibhavs10 kamil-roszak silyfox kkulesz zehuachenimperial 5hyeons trinhtuanvubk jmhxxi zhenhezhang yutonishimura-v2 eunjong55 enmyuel greerviau shivammehta25 cyancity kh-kim alandarker hungvocs47 undeadyequ maxlvov aydous oytunturk mattias421 arnabdas8901 chep0k zshy1205 nactemha sadik-abd ziyaad30 zhengrachel pan310 zengchang233 chaksseu

speech-backbones's Issues

Grad-TTS: Colab Notebook

great work, can you please add a google colab for inference for grad-tts

ASR finetune ?

Hi,
I want to finetune this great model with my own data set. is it possible ? Is there any pre-trained model for this ?

Thanks in advance

Fine-tuning / Transfer Learning

Is it possible in this repo to fine-tune a single or multi speaker model from a base pre-trained model?

Not able to generate audio using libritts of as good quality as using ljspeech

Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

Clipping distortion of the generated waveform

Hi, thanks for sharing the code. I have tried it on different datasets including Chinese and English. However, there is some clipping on some of the generated waveforms (like the generated Mel spectrum is too energetic at some positions?). I first tried different vocoders including hifigan and griffin-lim and this happened. Then I tried different ranges of values for the Mel spectrum, including the log domain and normalization to [-1,1], again without avoiding this phenomenon. Finally, I tried different values of temperature, 1.0, 1.3, 1.5, and again this phenomenon could not be avoided. I would like to know the possible causes of this phenomenon and how to solve it. If anyone has encountered this situation, please feel free to discuss it.

As shown above, the value of the waveform at some locations is out of range, thus causing clipping.

About the prior loss and MAS algorithm

Great work! I've been studying the paper and the code recently and there's something that confuses me much.

In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

mels_mode generation

Hi,
I have created TextGrid files in the subfolder textgrids using MFA.
Im facing issues to get average voice mel-spectrograms in the subfolder mels_mode.
Im using get_avg_mels.ipynb jupyter noteboook to get average voice mel-spectrograms.
Its generating mels_mode dictionary with phonemes as keys. But there is not further instructions to map them with spakers and create mels_mode subfolder using this dictionary.
@ivanvovk @ytyeung @wenyong-h @huawei-noah-admin @zhangjiajin2 Pls help.

for p in phoneme_list: mels_mode[p] = mode(np.asarray(mels_mode_dict[p]), 0).mode[0] lens[p] = np.mean(np.asarray(lens_dict[p]))

Two questions about DiffVC

Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond?
(2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x?
I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.

Possibly missing dict in the Projector class' constructor

While loading the pretrained weights of the ST2VecEncoder, I had to replace **conv_cfg_i with **conv_cfg_i.__dict__ in __init__ of the Projector class (SPIRAL/nemo/collections/asr/parts/spec2vec.py). Doing this allowed me to load all the weights and match the keys successfully -- nonetheless, i was curious to know if I was missing any installation?

Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

i tried the pre-trained model--bneSeq2seqMoL-vctk-libritts460-oneshot provided in bneSeq2seqMoL-vctk-libritts460-oneshot.Then converted source wavs to target wavs in provided demo by your paper .Yours in paper performed better than the model trained in the original paper.Why? have u trained the hifi-gan model again?Thank u!

GradTTS device compatibility

I tried to use GradTTS by moving the model to MPS (.to('mps')) but the terminal's error message told me that MPS doesn't support 3D padding. So I tried it on CPU and it worked but it is too slow. Is there any way to adapt GradTTS to Mac's MPS device?

Diffusion loss not decreasing

Hi,
I have trained the GradTTS model on the Indian accent English dataset, and the results are pretty awesome.

Looking at the logs, I was startled to see that the diffusion loss throughout the training was not decreasing unlike other losses, and was also fluctuating a lot. Can anyone explain to me why this is the case and if the diffusion loss fluctuates so much why is it used in the total loss calculation?

I have attached my tensorboard outputs.

Training Diffusion Loss
Training Prior Loss
Training Duration Loss

About end2end implementation

Hi, thank you for sharing your excellent work.
I want to ask about your end-to-end TTS model. In the paper, you stated that only the decoder is changed such that it can generate waveform (by using WaveGrad architecture). So the $\mu$ vector is no longer mean statistics of melspectrogram, just simply hidden features. I wonder what probability you put into monotonic alignment if the model can not access to mel features (mu_x and y in your code). Did you still keep the $\mathcal{L}_{enc}$ loss that constrain similarity between mu_y and mel spectrogram? Furthermore, since we do not know mean statistics of waveform, does the SDE have to change also? I have listened to some samples from e2e model in your website and noticed that although audio had noises, their alignments were quite decent, did you try to improve audio's quality with adversarial loss like Hifi gan?
Thank you.

A bug in model/tts.py

Speaking formally, the shape of variable y_cut_mask from here, might not match the shape of variable y_cut at the last dimension (which is out_size for y_cut).
To comprehend, take a look at the function sequence_mask, which we invoke to create y_cut_mask. As parameter max_length is not provided, the length dimension will be of size max(length) (look here). Thus, if all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size, the last dimension of the shape of y_cut_mask will not match the last dimension of y_cut.
An easy experiment can show up an issue. Start training GradTTS with batch_size==1. In that case if there is any sequence shorter than out_size, training will fail with shape mismatch.
The fix I suggest is elementary: provide parameter max_length=out_size when calling sequence_mask here.
Moreover, we better skip cropping out mel when all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size. Concrete, I suggest to add condition y_max_length > out_size here.

about diffVC on Mandarin datasets

Hello, I adapted the diffvc code on Mandarin datasets. However, the audio after VC has the problem of tone sandhi. I want to ask the performance is normal ?

support for bigvgan

Is it possible to add support for bigvgan instead of hifigan?

Model training question

Hi, thanks for sharing the code.
I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders.
How exactly to fill them. Maybe there is some more detailed instructions?

How is `out_size` in `params` determined

Hi, I am modifying the code for my own purposes. I notice here:

Speech-Backbones/Grad-TTS/params.py

Line 53 in b82fdd5

out_size = fix_len_compatibility(2*22050//256)

the argument is hard-coded, and I guess 22050 and 256 to be the sampling rate and frame shift in the case of LJspeech, right? If this is true, should I change to another value if I am dealing with different datasets?

Grad-TTS in multispeaker setting

Thank you for the releasing original implementation of Grad-TTS. I would like to know if a multispeaker setting is available or planned for release.

I am implementing a multispeaker setting using this repo. Will the maintainer of this repo be interested in discussing or providing feedback on multispeaker Grad-TTS implementation?

Regards
Ajinkya

[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'

[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1208, in cythonize_one
result = compile_single(pyx_file, options, full_module_name=full_module_name)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 727, in compile_single
return run_pipeline(source, options, full_module_name)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 515, in run_pipeline
err, enddata = Pipeline.run_pipeline(pipeline, source)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 355, in run_pipeline
data = run(phase, data)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 335, in run
return phase(data)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 52, in generate_pyx_code_stage
module_node.process_implementation(options, result)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 143, in process_implementation
self.generate_c_code(env, options, result)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 411, in generate_c_code
f = open_new_file(result.c_file)
File "/home/user/.local/lib/python3.8/site-packages/Cython/Utils.py", line 76, in open_new_file
return codecs.open(path, "w", encoding="ISO-8859-1")
File "/usr/local/lib/python3.8/codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)

Not possible to build

This cant be build at all. Package like torchaudio==0.5.1 can not be found anymore. Cython errors. Any way this will be corrected? I cant get to run DiffVC in this case.

Typo in some equations in GradTTS paper

Thanks for your great work on GradTTS!
However I recently found a tiny error in the arxiv version 2 of GradTTS paper (https://arxiv.org/pdf/2105.06337.pdf). In Eq.31 and Eq.32 in the appendix, the $X_t$ and $\mu$ are put in the wrong order, i.e. it probably should be $\mu - X_t$ rather than $X_t - \mu$. This typo could be originated from the line above Eq.31, " In our case $f(X_t, t) = \frac 12 \Sigma^{-1}(X_t −\mu)\beta$ and ...", where it should be $f(X_t, t)= \frac 12 \Sigma^{-1}(\mu-X_t)\beta$ instead. The other parts of the paper seem not to be affected by this, and the derivations are solid and fluent.
Again, great thanks for the work!

Generated outputs sound robotic in some cases!

Hello!

I have trained GradTTS on quite a few datasets and observed that it produces excellent results on most of the voices, but the generated results on heavy voices sound very robotic! Can anyone explain any possible reasons for this?

I have attached the samples of original and generated voices.

Thank you!

Original male voice

org_sagar.mov

Generated male voice

gen_sagar.mov

Original female voice

org_anupama.mov

Generated female voice

gen_anupama.mov

Multi-GPU training and expected epochs

Hi,

First of all, thanks for the nice paper and release code. I am testing your model for a different dataset and two questions come up:

Which is the estimated number of epochs to train the model? We have expierenced some degradation when the model is overtrained (overfitting?) the data.
Is there a way to train the model in a multi-gpu setup? We have more GPUs available, however the code seems to just work on the first available GPU given by the CUDA_VISIBLE_DEVICES argument.

Thanks!

Attention layer in GradTTS

Hey @ivanvovk et al.

Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:

Speech-Backbones/Grad-TTS/model/diffusion.py

Line 95 in b82fdd5

k = k.softmax(dim=-1)

computes the softmax on the projected key values instead of computing it on the product of query and key.

Usually, I know self-attention as:

Value x Softmax(Query x Key^T / d_k)

but it seems like here it is

(Value x Softmax(Key)) x Query

=> Is it similar to self-attention? Where does it come from?

Best,
Patrick

Finetuning a Grad-TTS model on a small dataset?

Hi! Thanks for the great work!
I have a small dataset of ~2 hours.
How do I finetune a Grad-TTS model on it?
Thanks!

Different Implementation of Diffusion Model

I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.

According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?

Generated Samples are noisy

I have used the pretrained model as provided in the google drive of the official repo. Based on the check point of the pre-trained model when I executed the infernce.py file, the generated samples quality I observed are very noisy for different values of reverse diffusion process (10,20,30,40,50,70). If I can get any suggestion regarding the same. I have used the ckpt of Libri-TTS model (not the LJ-Speech). However, in the demo page the quality of the samples are sufficiently good.