FastSpeech 2 - Pytorch Implementation

This is a Pytorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This project is based on xcmyz's implementation of FastSpeech. Feel free to use/modify the code. Any improvement suggestion is appreciated.

This repository contains only FastSpeech 2 but FastSpeech 2s so far. I will update it once I reproduce FastSpeech 2s, the end-to-end version of FastSpeech2, successfully.

Audio Samples

Audio samples generated by this implementation can be found here.

The model used to generate these samples is trained for 300k steps on LJSpeech dataset.
Audio samples are converted from mel-spectrogram to raw waveform via NVIDIA's pretrained WaveGlow.

Quickstart

Dependencies

You can install the python dependencies with

pip3 install -r requirements.txt

Noticeably, because I use a new functionality torch.bucketize, which is only supported in PyTorch 1.6, you have to install the nightly build by

pip3 install --pre torch==1.6.0.dev20200428 -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html

Since PyTorch 1.6 is still unstable, I suggest that Python virtual environment should be used.

Synthesis

You have to download NVIDIA's pretrained WaveGlow and put the checkpoint in the waveglow/pretrained_model/ directory, and download our FastSpeech2 pretrained model then put it in the ckpt/LJSpeech/ directory.

Your can run

python3 synthesis.py --step 300000

to generate any utterances you wish to. The generated utterances will be put in the results/ directory.

Here is a generated spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"

Training

Datasets

This project supports two datasets:

LJSpeech: consisting of 13100 short audio clips of a single female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
Blizzard2013: a male speaker reading 10 audio books. The prosody variance are greater than the LJSpeech dataset. Only the 9741 segmented utterances are used in this project.

After downloading the dataset, extract the compressed files, you have to modify the hp.data_path and some other parameters in hparams.py. Default parameters are for the LJSpeech dataset.

Preprocessing

As described in the paper, Montreal Forced Aligner(MFA) is used to obtain the alignment between utterance and phoneme sequence. Alignments for the LJSpeech dataset is provided here. You have to put the TextGrid.zip file in your hp.preprocessed_path/ and extract the files before you continue.

Then run the preprocessing sctipt by

python3 preprocess.py

Alternately, you can align the corpus by yourself. First, download the MFA package and the pretrained lexicon file. (We use LibriSpeech lexicon instead of the G2p_en python package proposed in the paper)

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then prepare some necessary files required by the MFA.

python3 prepare_align.py

Running MFA and put the .TextGrid files in your hp.preprocessed_path.

# Replace $DATA_PATH and $PREPROCESSED_PATH with ./LJSpeech-1.1 and ./preprocessed/LJSpeech/TextGrid, for example
./montreal-forced-aligner/bin/mfa_align $YOUR_DATA_PATH montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8

Remember to run the preprocessing sctipt.

python3 preprocess.py

After preprocessing, you will get a stat.txt file in your hp.preprocessed_path/, recording the maximum and minimum values of the fundamental frequency and energy values in the entire corpus. You have to modify the f0 and energy parameters in the hparams.py according to your stat.txt file.

Training

Train your model with

python3 train.py

The model takes less than 10k steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.

There might be some room for improvement for this repository. For example, I just simply add up the duration loss, f0 loss, energy loss and mel loss without any weighting. Please inform me if you find any useful tip for training the FastSpeech2 model.

Notes

Implementation Issues

There are several differences between my implementation and the paper.

The paper includes punctuations in the transcripts. However, MFA discards puntuations by default and I haven't found a way to solve it.
Following xcmyz's implementation, I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper.
The transformer paper suggests to use dropout after the input and positional embedding. I haven't try it yet.
The paper suggest to use L1 loss for mel loss and L2 loss for variance predictor losses. But I find it easier to train the model with L2 mel loss and L1 variance adaptor losses, for unknown reason.
I use gradient clipping and weigth decay in the training.

TODO

Try difference weights for the loss terms.
My loss computation does not mask out the paddings.
Evaluate the quality of the synthesized audio over the validation set.
Find the difference between the F0 & energy predicted by the variance predictors and the F0 & energy of the synthesized utterance measured by PyWorld Vocoder.
Implement FastSpeech 2s.

References

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
FastSpeech: Fast, Robust and Controllable Text to Speech, Y. Ren, et al.
xcmyz's FastSpeech implementation
NVIDIA's WaveGlow implementation

bigdan12 / fastspeech2 Goto Github PK

fastspeech2's Introduction

FastSpeech 2 - Pytorch Implementation

Audio Samples

Quickstart

Dependencies

Synthesis

Training

Datasets

Preprocessing

Training

Notes

Implementation Issues

TODO

References

fastspeech2's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent