xcmyz / fastspeech Goto Github PK

View Code? Open in Web Editor NEW

846.0 34.0 208.0 17.76 MB

The Implementation of FastSpeech based on pytorch.

License: MIT License

Python 100.00%

speech-synthesis deep-learning pytorch

fastspeech's Introduction

FastSpeech-Pytorch

The Implementation of FastSpeech Based on Pytorch.

Update (2020/07/20)

Optimize the training process.
Optimize the implementation of length regulator.
Use the same hyper parameter as FastSpeech2.
The measures of the 1, 2 and 3 make the training process 3 times faster than before.
Better speech quality.

Model

My Blog

Prepare Dataset

Download and extract LJSpeech dataset.
Put LJSpeech dataset in data.
Unzip alignments.zip.
Put Nvidia pretrained waveglow model in the waveglow/pretrained_model and rename as waveglow_256channels.pt;
Run python3 preprocess.py.

Training

Run python3 train.py.

Evaluation

Run python3 eval.py.

Notes

In the paper of FastSpeech, authors use pre-trained Transformer-TTS model to provide the target of alignment. I didn't have a well-trained Transformer-TTS model so I use Tacotron2 instead.
I use the same hyper-parameter as FastSpeech2.
The examples of audio are in sample.
pretrained model.

Reference

Repository

Paper

fastspeech's People

Contributors

Stargazers

Watchers

Forkers

shaun95 templeblock vitansoz dlgyy caoyuji1986 linzai1992 scpark20 entn-at shartoo peter05010402 tarsbase superhg2012 lbxcfx trendingtechnology dencechen gswyhq xinge333 batikim09 dengliqun reloadbrain artificial-intelligence-group nipengmath akshitmittal1 macroustc dachengai ml2457 geneing iftaken zhipingzhou lizongyao123 ensky0 irentang jumponthemoon alokprasad whaozl mingkzhou robingong x-ccs gladuo fzc20070415 yh646492956 diego-s wynla yhgon ina299 husin123 enamoria ruclion mingboma chavesliu bai-yeqi jeanru mjc14 stevemurr zhoulinmin areiser jiahenghuang ntzzc yangmingqi y-y-q hash2430 zrb250 maloyan del18687058912 hlp2819 gongchenghhu pandagst nianzu-ethan-zheng shangqwe123 lsgzh zshy1205 syoyo ttjason malarinv nukea araffin peillu liangsli ppfliu ismailfatih giranntu zr-hebo meigm idcore qlee01 18030410071 tuanad121 lunarberial letterligo flysky2008 xiaoyangnihao navs ronggan soobo-seo wangfn kidconan dedmari inconnu11 cutekih dattachandan

fastspeech's Issues

pretrained Model ?

Where i can find the model.
./model_new/checkpoint_148000.pth.tar

forward() missing 2 required positional arguments: 'src_seq' and 'src_pos' when do synthesis.py

My pytorch model is 1.0.1.Is that the reason that the pytorch is the wrong model.

Training slower with multiple GPU's

Hi, the training time for each step was slower when using multiple GPU's.

I'm running:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train_accelerated.py

And for single GPU:
CUDA_VISIBLE_DEVICES=0 python3 train_accelerated.py

I can fit batch_size=16 in a single gpu so I tried with batch_size=16 for single and multi-gpu.

I also tried batch_size=64 for the multi-gpu case.

Output of synthesis_waveglow()

At the results folder, there are no WaveGlow output samples. I saw there is a function synthesis_waveglow. Can you please also output WaveGlow synthesized result too?

real time on CPU

Whether FastSpeech can be synthesized on CPU in real time?
(CPU上面是否能实时合成？）

Scripts for synthesizing waves when model is ready

Is these script to generate mels and wavs when model is ready?

about alignment.py

hello，da niu.Can you explain what the role of alignment.py is? how to use?

Breaks between words

In the original paper,the author proposes the breaks between words,and verifies that FastSpeech can add breaks between adjacent words by lengthening the duration of the space characters in the sentence, which can improve the prosody of voice.

So,how do you add breaks between words in your code?
I'll be appreciated if you can help me.

How long will the training process take ？

I run train_accelerated.py, and start training.
Howerver, the estimated time is more than 9400000s ( about 100 days ).

Could you introduce the training process and the training tine？
How many steps can we get a good result ？

I run the total training on 2 GPUs (Tesla V100).

Pretrained model

Hello Zhengxi,

Will you provide the pre-trained model?

Angelo

Solved the slow training problem

After analyzing, I found the reason of slow training speed. The code is not optimized for parallel data training. I optimized the distributed data-parallel loader. I will create a pull request later.

BTW, the author os FastSpeech paper should publish their code soon. They say they will publish it once the paper is accepted. This repo is a very good resource, but it needs more maintains and optimization.

希望作者能够更多的参与github和知乎上的讨论，非常感谢

About alignments

Hi, Good reference to your good implementation. Such a nice work!
I have some questions about alignments (while getting the duration from pretrained-model)

It was understood that the proposed d_i (duration) can be applied only when the index increases monotonically. Since the alignment is not a perfect diagonal, the index value does not increase monotonically, when the argmax value is taken. How did you solve this case?

Thank you.

UPDATE!

Fix bugs in alignment;
Fix bugs in transformer;
Fix bugs in LengthRegulator;
Change the way to process audio;
Use waveglow to synthesize.

I have trouble getting pretrained model

I cannot access to your pretrained model which is on Baidu. Is there another way to get that file?

is it real time on CPU ?

hello , i wanna ask is it real time on CPU,thank you

Should we change sample_rate=20000 to 22050 in FastSpeech/hparams.py? Other related params may also need to change correspondingly, thank you.

First of all, thanks for your quick and great implementation.
The default sample rate is 22050 for LJSpeech-1.1 wavs, Tacotron2/hparams.py, and possibly the used pre-trained Tacotron2 model published by NVIDIA.
So, should we change sample_rate=20000 to 22050 in FastSpeech/hparams.py? Other related params may also need to change correspondingly.
Please check and help on this issue, thank you.

我想使用新的alignment，但是并未找到alignment.py文件

minor issue that might prevent some trouble later on

Hello,
Thanks a lot for publishing this amzaing repo.
I just have a little concern that I would like to let you know.
I see that under the 'paper' directory you have uploaded some papers.
I don't think that is a safe thing to do regarding copyrights.
The copyrights for each papers might differ, but just to be safe, how about replacing the papers with a link to it or just name of the papers???

CUDA out of memory

Hello,
I tried train FastSpeech with alignment_target and dataset on TeslaV100.
but I found the error like this.
Traceback (most recent call last):
File "train_accelerated.py", line 191, in
main(args)
File "train_accelerated.py", line 109, in main
length_target=alignment_target)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/FastSpeech.py", line 33, in forward
decoder_output = self.decoder(length_regulator_output, decoder_pos)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Models.py", line 141, in forward
slf_attn_mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Layers.py", line 125, in forward
enc_input, enc_input, enc_input, mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/SubLayers.py", line 60, in forward
output, attn = self.attention(q, k, v, mask=mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Modules.py", line 21, in forward
attn = attn.masked_fill(mask, -np.inf)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 337, in masked_fill
return self.clone().masked_fill_(mask, value)
RuntimeError: CUDA out of memory. Tried to allocate 316.50 MiB (GPU 0; 15.75 GiB total capacity; 14.15 GiB already allocated; 146.88 MiB free; 317.24 MiB cached)
Could explain the reason to me?

implementing length_regular with tensorflow discusstion

Hi, I am implementing fastspeech with tensorflow, when rewrite length regular module, I met some problem. tf graph is static graph, I can not get real tensor shape before feeding data as I define inputs with placeholder and shape=[None,None, None].

my code is :

`
def len_regulator(self, phoneme_seqs, duration_seqs, alpha=1.0, max_mel_length=None):

    D = tf.keras.backend.round(tf.scalar_mul(alpha, duration_seqs))

    # grouping
    pho_splits = tf.split(phoneme_seqs, num_or_size_splits=phoneme_seqs.shape.as_list()[-1], axis=0)
    dur_splits = tf.split(D, num_or_size_splits=D.shape.as_list()[-1], axis=0)

    repeats = [tf.ones(tf.cast(r, tf.int32), dtype=tf.float32) for r in dur_splits]

    expanded = list()
    for i, j in zip(pho_splits, repeats):
        expanded.append(tf.multiply(i, j))
    expanded = tf.concat(expanded, axis=0)
    return expanded

anybody who can help? thanks in advance!!

Why the mask of decoder ?

class Decoder(nn.Module):
    """ Decoder """

    def forward(self, enc_seq, enc_pos, return_attns=False):
        # ....
        # -- Prepare masks
        slf_attn_mask = get_attn_key_pad_mask(seq_k=enc_pos, seq_q=enc_pos)
        non_pad_mask = get_non_pad_mask(enc_pos)

为什么使用 enc_pos 来计算 mask ，难道不应该是使用 enc_seq 吗？

对比Tacotron的合成速度

你复现后有对比过Tacotron的合成速度么，我复现后比Tacotron的合成速度还要慢一些

为什么alignment的权重区max的值相加后又减一？

这个代码的111到113行

TypeError error when run synthesis.py

Hi scmyz

I have train a model to step 172000 use train.py , and I want to use this model to synthesis ,
but when I run synthesis.py , I meet TypeError during the synthesis .
" TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos' "
do I miss something during training or synthesis ?

thanks

here is the full error log :

(Tacotron) [wann31828@glogin1 FastSpeech]$ python synthesis.py
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Model Have Been Loaded.
Traceback (most recent call last):
File "synthesis.py", line 91, in
synthesis_griffin_lim(words, model, alpha=1.0, mode="normal")
File "synthesis.py", line 45, in synthesis_griffin_lim
mel, mel_postnet = model(text, pos, alpha=alpha)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos'

WaveGlow sythensis result

I tried TTS with WaveGlow as follow, but i've got noise result
Could you explain the reason for me?
def synthesis_waveglow(text_seq, model, waveglow, alpha=1.0, mode=""):
denoiser = Denoiser(waveglow)
text = text_to_sequence(text_seq, hp.text_cleaners)
text = text + [0]
text = np.stack([np.array(text)])
text = torch.from_numpy(text).long().to(device)

pos = torch.stack([torch.Tensor([i+1 for i in range(text.size(1))])])
pos = pos.long().to(device)

model.eval()
with torch.no_grad():
    _, mel_postnet = model(text, pos, alpha=alpha)
with torch.no_grad():
    #wav = waveglow.infer(mel_postnet, sigma=0.666)
    wav = waveglow.infer(torch.transpose(mel_postnet,1,2).type(torch.cuda.HalfTensor), sigma=0.666)
print("Wav Have Been Synthesized.")

if not os.path.exists("results"):
    os.mkdir("results")

wav_denoised = denoiser(wav, strength=0.01)[:, 0]
#audio.save_wav(wav[0].data.cpu().numpy(), os.path.join(
#    "results", text_seq + mode + ".wav"))
audio.save_wav(wav_denoised[0].cpu().numpy(), os.path.join(
    "results", text_seq + mode + ".wav"))

Thank you

test.wav is inaudible?

i run the alignment.py, and the test.wav returned by get_tacotron2_alignment_test() function is inaudible.

Which License?

I find that this repository removes license.txt(MIT).
Do you plan to change the license?

any result samples?

Any result samples?

Model does not converge

I use LJ data， alignments.zip , set batch_size = 16 to train model,
but it does not converge after 20w step
Has anyone encountered this problem?

How to get alignment?

alignmnet.py returns attention weights as float matrix, should i convert them to int type as the alignment_targets/0.npy?

about alignment.py

Hi, Thank you for excellent source!

You mentioned Run alignment.py, it will spend 7 hours training on NVIDIA RTX2080ti. in README.md.
But, this command yields only one alignment from text("I want to go to CMU to do research on deep learning."). How do I fix alignment.py ?

the speed of （fastspeech+WaveRNN）

fastspeech+WaveRNN能实时吗，有多快

How many steps can you take to get a good result

hello,I would like to ask, how many steps can be trained to get a good result, now this file is all noise.

Can you provide some result sample?

Good Job! Can you provide some result sample?

can you try the biaobei chinese dataset？

Is the dimension of the alignment of FastSpeech the same as that of tacotron2?

你好,请问tacotron2中的aliment具体是如何用的,导出之后直接作为alignment_target传到现有模型里吗?我看了一下原文,里吗好像有一步通过duration extractor将alignment转化为phoneme duration seuence的步骤,请问这部分对应您代码里的哪部分?

The Mel

@xcmyz
There is a great improvement since I tried this branch at Jun! Great Work

There is one thing I am completely confused. You use log of mel as the feature, while the tacotron2 use DB of mel with normlization to [-4, 4]. What are the differences? Does feature like tacotron work for fast-speech? Does the log of mel better for fast speech?

Use true alignment on Biaobei public Chinese TTS dataset?

Biaobei public Chinese TTS dataset has .interval infomation, if it can use these data to train alignment?

erro in alignment.py

@xcmyz hello,when i run alignment.py, i met this erro

sequence size (1, 52)
alignment size (276, 52)
[[8.4814453e-01 4.4107437e-04 2.3097992e-03 ... 6.3121319e-05
6.3180923e-04 6.0485840e-02]
[5.4394531e-01 3.4637451e-02 4.3609619e-02 ... 8.9359283e-04
1.2283325e-03 7.8308105e-02]
[7.6904297e-01 5.2764893e-02 1.9561768e-02 ... 9.4366074e-04
2.1877289e-03 2.1026611e-02]
...
[9.5081329e-04 7.7486038e-07 4.0590763e-05 ... 2.9706955e-04
1.7822266e-02 8.7695312e-01]
[8.5830688e-04 5.9604645e-07 3.7193298e-05 ... 1.8274784e-04
8.3541870e-03 8.7597656e-01]
[7.3289871e-04 6.5565109e-07 3.7550926e-05 ... 1.7142296e-04
5.7220459e-03 8.7060547e-01]]
how can i solve that,thank you .

learning rate

你好，我看了下，按照lr的更新算法，learning rate会越来越大，请问这么设计的原因是什么？往常一般都会设计成随着训练轮数的增加，learning rate会越来越小

112000 steps model I cannot unpack it

It says the file has been broken. Is that possible for you to update it through?

Thanks!

Result voice is bad compared to Tacotron2

Why is the WAV result in the results folder so bad? It is because fast speech just can't produce good results compared to Tacotron2. Will the result from the paper's author be better than your results?

Stil need a long time to synthesize a new speech

I've trained a new model, and write a script to synthesize new speech using

mel_output, mel_output_postnet = model(src_seq, src_pos)

But it need about 4-6 seconds, is this an expected result? Thanks!

合成效果没有Tacotron2预训练模型好？

我在选择由Tacotron2计算的LJSpeech的比对alignment_targets.zip

解压后，还需要如何对代码做相应更改

running preprocess.py occurs NAN

print(mel_outputs)
None
tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0')

Why we need to a constant lr?

in train.py:

                if args.frozen_learning_rate:
                    scheduled_optim.step_and_update_lr_frozen(
                        args.learning_rate_frozen)
                else:
                    scheduled_optim.step_and_update_lr()

I am now optimizing the quality of the slower generated waves by train more times.
but:

why always use the learning_rate = 1e-3?
why the batch_size is so small, should I increase it up to my GPU's memory?
Thank you~
It's my reimplement of your code's process, may it a little help for us.
https://blog.csdn.net/u013625492/article/details/103076158

pre_trained checkpoint_12000xxx.tar can not be decompressed?

Hi, I downloaded the pre_trained fastspeech model, checkpoint_112000.pth.tar

when use tar -xf checkpoint_112000.pth.tar cmd on linux , error msg :

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

so, how to decompress it?

是否有汉语的语料库

default learning_rate_frozen is too big

Thanks for author's work.
But I find the default learning_rate_frozen(1e-3) is too big, cause the loss can't convergence.
What is the learning_rate when you train the model?