Giter Club home page Giter Club logo

fastspeech's Introduction

FastSpeech-Pytorch

The Implementation of FastSpeech Based on Pytorch.

Update (2020/07/20)

  1. Optimize the training process.
  2. Optimize the implementation of length regulator.
  3. Use the same hyper parameter as FastSpeech2.
  4. The measures of the 1, 2 and 3 make the training process 3 times faster than before.
  5. Better speech quality.

Model

My Blog

Prepare Dataset

  1. Download and extract LJSpeech dataset.
  2. Put LJSpeech dataset in data.
  3. Unzip alignments.zip.
  4. Put Nvidia pretrained waveglow model in the waveglow/pretrained_model and rename as waveglow_256channels.pt;
  5. Run python3 preprocess.py.

Training

Run python3 train.py.

Evaluation

Run python3 eval.py.

Notes

  • In the paper of FastSpeech, authors use pre-trained Transformer-TTS model to provide the target of alignment. I didn't have a well-trained Transformer-TTS model so I use Tacotron2 instead.
  • I use the same hyper-parameter as FastSpeech2.
  • The examples of audio are in sample.
  • pretrained model.

Reference

Repository

Paper

fastspeech's People

Contributors

dependabot[bot] avatar ppfliu avatar xcmyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastspeech's Issues

implementing length_regular with tensorflow discusstion

Hi, I am implementing fastspeech with tensorflow, when rewrite length regular module, I met some problem. tf graph is static graph, I can not get real tensor shape before feeding data as I define inputs with placeholder and shape=[None,None, None].

my code is :

`
def len_regulator(self, phoneme_seqs, duration_seqs, alpha=1.0, max_mel_length=None):

    D = tf.keras.backend.round(tf.scalar_mul(alpha, duration_seqs))

    # grouping
    pho_splits = tf.split(phoneme_seqs, num_or_size_splits=phoneme_seqs.shape.as_list()[-1], axis=0)
    dur_splits = tf.split(D, num_or_size_splits=D.shape.as_list()[-1], axis=0)

    repeats = [tf.ones(tf.cast(r, tf.int32), dtype=tf.float32) for r in dur_splits]

    expanded = list()
    for i, j in zip(pho_splits, repeats):
        expanded.append(tf.multiply(i, j))
    expanded = tf.concat(expanded, axis=0)
    return expanded

`

anybody who can help? thanks in advance!!

How to get alignment?

alignmnet.py returns attention weights as float matrix, should i convert them to int type as the alignment_targets/0.npy?

Pretrained model

Hello Zhengxi,

Will you provide the pre-trained model?

  • Angelo

about alignment.py

Hi, Thank you for excellent source!

You mentioned Run alignment.py, it will spend 7 hours training on NVIDIA RTX2080ti. in README.md.
But, this command yields only one alignment from text("I want to go to CMU to do research on deep learning."). How do I fix alignment.py ?

Why we need to a constant lr?

in train.py:

                if args.frozen_learning_rate:
                    scheduled_optim.step_and_update_lr_frozen(
                        args.learning_rate_frozen)
                else:
                    scheduled_optim.step_and_update_lr()

I am now optimizing the quality of the slower generated waves by train more times.
but:

  1. why always use the learning_rate = 1e-3?
  2. why the batch_size is so small, should I increase it up to my GPU's memory?
    Thank you~
    It's my reimplement of your code's process, may it a little help for us.
    https://blog.csdn.net/u013625492/article/details/103076158

pre_trained checkpoint_12000xxx.tar can not be decompressed?

Hi, I downloaded the pre_trained fastspeech model, checkpoint_112000.pth.tar

when use tar -xf checkpoint_112000.pth.tar cmd on linux , error msg :

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

so, how to decompress it?

Result voice is bad compared to Tacotron2

Why is the WAV result in the results folder so bad? It is because fast speech just can't produce good results compared to Tacotron2. Will the result from the paper's author be better than your results?

Which License?

I find that this repository removes license.txt(MIT).
Do you plan to change the license?

Why the mask of decoder ?

class Decoder(nn.Module):
    """ Decoder """

    def forward(self, enc_seq, enc_pos, return_attns=False):
        # ....
        # -- Prepare masks
        slf_attn_mask = get_attn_key_pad_mask(seq_k=enc_pos, seq_q=enc_pos)
        non_pad_mask = get_non_pad_mask(enc_pos)

为什么使用 enc_pos 来计算 mask ,难道不应该是使用 enc_seq 吗?

Model does not converge

I use LJ data, alignments.zip , set batch_size = 16 to train model,
but it does not converge after 20w step
Has anyone encountered this problem?

loss

CUDA out of memory

Hello,
I tried train FastSpeech with alignment_target and dataset on TeslaV100.
but I found the error like this.
Traceback (most recent call last):
File "train_accelerated.py", line 191, in
main(args)
File "train_accelerated.py", line 109, in main
length_target=alignment_target)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/FastSpeech.py", line 33, in forward
decoder_output = self.decoder(length_regulator_output, decoder_pos)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Models.py", line 141, in forward
slf_attn_mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Layers.py", line 125, in forward
enc_input, enc_input, enc_input, mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/SubLayers.py", line 60, in forward
output, attn = self.attention(q, k, v, mask=mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Modules.py", line 21, in forward
attn = attn.masked_fill(mask, -np.inf)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 337, in masked_fill
return self.clone().masked_fill_(mask, value)
RuntimeError: CUDA out of memory. Tried to allocate 316.50 MiB (GPU 0; 15.75 GiB total capacity; 14.15 GiB already allocated; 146.88 MiB free; 317.24 MiB cached)
Could explain the reason to me?

about alignment.py

hello,da niu.Can you explain what the role of alignment.py is? how to use?

running preprocess.py occurs NAN

print(mel_outputs)
None
tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0')

Output of synthesis_waveglow()

At the results folder, there are no WaveGlow output samples. I saw there is a function synthesis_waveglow. Can you please also output WaveGlow synthesized result too?

erro in alignment.py

@xcmyz hello,when i run alignment.py, i met this erro

sequence size (1, 52)
alignment size (276, 52)
[[8.4814453e-01 4.4107437e-04 2.3097992e-03 ... 6.3121319e-05
6.3180923e-04 6.0485840e-02]
[5.4394531e-01 3.4637451e-02 4.3609619e-02 ... 8.9359283e-04
1.2283325e-03 7.8308105e-02]
[7.6904297e-01 5.2764893e-02 1.9561768e-02 ... 9.4366074e-04
2.1877289e-03 2.1026611e-02]
...
[9.5081329e-04 7.7486038e-07 4.0590763e-05 ... 2.9706955e-04
1.7822266e-02 8.7695312e-01]
[8.5830688e-04 5.9604645e-07 3.7193298e-05 ... 1.8274784e-04
8.3541870e-03 8.7597656e-01]
[7.3289871e-04 6.5565109e-07 3.7550926e-05 ... 1.7142296e-04
5.7220459e-03 8.7060547e-01]]
how can i solve that,thank you .

Training slower with multiple GPU's

Hi, the training time for each step was slower when using multiple GPU's.

I'm running:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train_accelerated.py

And for single GPU:
CUDA_VISIBLE_DEVICES=0 python3 train_accelerated.py

I can fit batch_size=16 in a single gpu so I tried with batch_size=16 for single and multi-gpu.

I also tried batch_size=64 for the multi-gpu case.

default learning_rate_frozen is too big

Thanks for author's work.
But I find the default learning_rate_frozen(1e-3) is too big, cause the loss can't convergence.
What is the learning_rate when you train the model?

The Mel

@xcmyz
There is a great improvement since I tried this branch at Jun! Great Work

There is one thing I am completely confused. You use log of mel as the feature, while the tacotron2 use DB of mel with normlization to [-4, 4]. What are the differences? Does feature like tacotron work for fast-speech? Does the log of mel better for fast speech?

How long will the training process take ?

I run train_accelerated.py, and start training.
Howerver, the estimated time is more than 9400000s ( about 100 days ).

Could you introduce the training process and the training tine?
How many steps can we get a good result ?

I run the total training on 2 GPUs (Tesla V100).

UPDATE!

  1. Fix bugs in alignment;
  2. Fix bugs in transformer;
  3. Fix bugs in LengthRegulator;
  4. Change the way to process audio;
  5. Use waveglow to synthesize.

real time on CPU

Whether FastSpeech can be synthesized on CPU in real time?
(CPU上面是否能实时合成?)

Solved the slow training problem

After analyzing, I found the reason of slow training speed. The code is not optimized for parallel data training. I optimized the distributed data-parallel loader. I will create a pull request later.

BTW, the author os FastSpeech paper should publish their code soon. They say they will publish it once the paper is accepted. This repo is a very good resource, but it needs more maintains and optimization.

希望作者能够更多的参与github和知乎上的讨论,非常感谢

Breaks between words

In the original paper,the author proposes the breaks between words,and verifies that FastSpeech can add breaks between adjacent words by lengthening the duration of the space characters in the sentence, which can improve the prosody of voice. 

So,how do you add breaks between words in your code?
I'll be appreciated if you can help me.

test.wav is inaudible?

i run the alignment.py, and the test.wav returned by get_tacotron2_alignment_test() function is inaudible.

About alignments

Hi, Good reference to your good implementation. Such a nice work!
I have some questions about alignments (while getting the duration from pretrained-model)

It was understood that the proposed d_i (duration) can be applied only when the index increases monotonically. Since the alignment is not a perfect diagonal, the index value does not increase monotonically, when the argmax value is taken. How did you solve this case?

Thank you.

Stil need a long time to synthesize a new speech

I've trained a new model, and write a script to synthesize new speech using

mel_output, mel_output_postnet = model(src_seq, src_pos)

But it need about 4-6 seconds, is this an expected result? Thanks!

learning rate

你好,我看了下,按照lr的更新算法,learning rate会越来越大,请问这么设计的原因是什么?往常一般都会设计成随着训练轮数的增加,learning rate会越来越小

Should we change sample_rate=20000 to 22050 in FastSpeech/hparams.py? Other related params may also need to change correspondingly, thank you.

First of all, thanks for your quick and great implementation.
The default sample rate is 22050 for LJSpeech-1.1 wavs, Tacotron2/hparams.py, and possibly the used pre-trained Tacotron2 model published by NVIDIA.
So, should we change sample_rate=20000 to 22050 in FastSpeech/hparams.py? Other related params may also need to change correspondingly.
Please check and help on this issue, thank you.

WaveGlow sythensis result

I tried TTS with WaveGlow as follow, but i've got noise result
Could you explain the reason for me?
def synthesis_waveglow(text_seq, model, waveglow, alpha=1.0, mode=""):
denoiser = Denoiser(waveglow)
text = text_to_sequence(text_seq, hp.text_cleaners)
text = text + [0]
text = np.stack([np.array(text)])
text = torch.from_numpy(text).long().to(device)

pos = torch.stack([torch.Tensor([i+1 for i in range(text.size(1))])])
pos = pos.long().to(device)

model.eval()
with torch.no_grad():
    _, mel_postnet = model(text, pos, alpha=alpha)
with torch.no_grad():
    #wav = waveglow.infer(mel_postnet, sigma=0.666)
    wav = waveglow.infer(torch.transpose(mel_postnet,1,2).type(torch.cuda.HalfTensor), sigma=0.666)
print("Wav Have Been Synthesized.")

if not os.path.exists("results"):
    os.mkdir("results")

wav_denoised = denoiser(wav, strength=0.01)[:, 0]
#audio.save_wav(wav[0].data.cpu().numpy(), os.path.join(
#    "results", text_seq + mode + ".wav"))
audio.save_wav(wav_denoised[0].cpu().numpy(), os.path.join(
    "results", text_seq + mode + ".wav"))

Thank you

TypeError error when run synthesis.py

Hi scmyz

I have train a model to step 172000 use train.py , and I want to use this model to synthesis ,
but when I run synthesis.py , I meet TypeError during the synthesis .
" TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos' "
do I miss something during training or synthesis ?

thanks

here is the full error log :

(Tacotron) [wann31828@glogin1 FastSpeech]$ python synthesis.py
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Model Have Been Loaded.
Traceback (most recent call last):
File "synthesis.py", line 91, in
synthesis_griffin_lim(words, model, alpha=1.0, mode="normal")
File "synthesis.py", line 45, in synthesis_griffin_lim
mel, mel_postnet = model(text, pos, alpha=alpha)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos'

minor issue that might prevent some trouble later on

Hello,
Thanks a lot for publishing this amzaing repo.
I just have a little concern that I would like to let you know.
I see that under the 'paper' directory you have uploaded some papers.
I don't think that is a safe thing to do regarding copyrights.
The copyrights for each papers might differ, but just to be safe, how about replacing the papers with a link to it or just name of the papers???

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.