Giter Club home page Giter Club logo

tacotron-2's Introduction

Tacotron-2:

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

This Repository contains additional improvements and attempts over the paper, we thus propose paper_hparams.py file which holds the exact hyperparameters to reproduce the paper results without any additional extras.

Suggested hparams.py file which is default in use, contains the hyperparameters with extras that proved to provide better results in most cases. Feel free to toy with the parameters as needed.

DIFFERENCES WILL BE HIGHLIGHTED IN DOCUMENTATION SHORTLY.

Repository Structure:

Tacotron-2
├── datasets
├── en_UK		(0)
│   └── by_book
│       └── female
├── en_US		(0)
│   └── by_book
│       ├── female
│       └── male
├── LJSpeech-1.1	(0)
│   └── wavs
├── logs-Tacotron	(2)
│   ├── eval_-dir
│   │ 	├── plots
│ 	│ 	└── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── taco_pretrained
│   ├── metas
│   └── wavs
├── logs-Wavenet	(4)
│   ├── eval-dir
│   │ 	├── plots
│ 	│ 	└── wavs
│   ├── plots
│   ├── wave_pretrained
│   ├── metas
│   └── wavs
├── logs-Tacotron-2	( * )
│   ├── eval-dir
│   │ 	├── plots
│ 	│ 	└── wavs
│   ├── plots
│   ├── taco_pretrained
│   ├── wave_pretrained
│   ├── metas
│   └── wavs
├── papers
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output	(3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
├── wavenet_output	(5)
│   ├── plots
│   └── wavs
├── training_data	(1)
│   ├── audio
│   ├── linear
│	└── mels
└── wavenet_vocoder
	└── models

The previous tree shows the current state of the repository (separate training, one step at a time).

  • Step (0): Get your dataset, here I have set the examples of Ljspeech, en_US and en_UK (from M-AILABS).

  • Step (1): Preprocess your data. This will give you the training_data folder.

  • Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.

  • Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.

  • Step (4): Train your Wavenet model. Yield the logs-Wavenet folder.

  • Step (5): Synthesize audio using the Wavenet model. Gives the wavenet_output folder.

  • Note: Steps 2, 3, and 4 can be made with a simple run for both Tacotron and WaveNet (Tacotron-2, step ( * )).

Note:

  • Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)! If running on datasets stored differently, you will probably need to make your own preprocessing script.
  • In the previous tree, files were not represented and max depth was set to 3 for simplicity.
  • If you run training of both models at the same time, repository structure will be different.

Pretrained model and Samples:

Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) here. THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON

Model Architecture:

The model described by the authors can be divided in two parts:

  • Spectrogram prediction network
  • Wavenet vocoder

To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki

Current state:

To have an overview of our advance on this project, please refer to this discussion

since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.

How to start

  • Machine Setup:

First, you need to have python 3 installed along with Tensorflow.

Next, you need to install some Linux dependencies to ensure audio libraries work properly:

apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools

Finally, you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)

pip install -r requirements.txt

  • Docker:

Alternatively, one can build the docker image to ensure everything is setup automatically and use the project inside the docker containers. Dockerfile is insider "docker" folder

docker image can be built with:

docker build -t tacotron-2_image docker/

Then containers are runnable with:

docker run -i --name new_container tacotron-2_image

Please report any issues with the Docker usage with our models, I'll get to it. Thanks!

Dataset:

We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.

Hparams setting:

Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the hparams.py file directly.

To pick optimal fft parameters, I have made a griffin_lim_synthesis_tool notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the hparams.py and have meaningful names so that you can try multiple things with them.

AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.

Example M-AILABS:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'

or if you want to use all books for a single speaker:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True

This should take no longer than a few minutes.

Training:

To train both models sequentially (one after the other):

python train.py --model='Tacotron-2'

Feature prediction model can separately be trained using:

python train.py --model='Tacotron'

checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.

Naturally, training the wavenet separately is done by:

python train.py --model='WaveNet'

logs will be stored inside logs-Wavenet.

Note:

  • If model argument is not provided, training will default to Tacotron-2 model training. (both models)
  • Please refer to train arguments under train.py for a set of options you can use.
  • It is now possible to make wavenet preprocessing alone using wavenet_proprocess.py.

Synthesis

To synthesize audio in an End-to-End (text to audio) manner (both models at work):

python synthesize.py --model='Tacotron-2'

For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:

  • Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.

python synthesize.py --model='Tacotron'

  • Natural synthesis (let the model make predictions alone by feeding last decoder output to the next time step).

python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False

  • Ground Truth Aligned synthesis (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)

python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True

Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:

python synthesize.py --model='WaveNet'

Note:

  • If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
  • Please refer to synthesis arguments under synthesize.py for a set of options you can use.

References and Resources:

tacotron-2's People

Contributors

alex73 avatar arloz avatar begeekmyfriend avatar dsmiller avatar h-meru avatar jyegerlehner avatar m-toman avatar metaln37 avatar neverjoe avatar nikitos9000 avatar r9y9 avatar rayhane-mamah avatar scribblemaniac avatar yeongtae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tacotron-2's Issues

How to fix an error output?

Hi, thank you for this project. And the wiki helped me a lot.

I used the pre-trained model. As you might already know, it still fails to read some words, e.g., it fails to read "obey [əˈbā]" in multiple ways like [əˈbī] and [əˈbē] because the word appears only 3 times in the dataset (LJ Speech), and the model had no luck?

So, how can we effectively re-train the model to remove/mitigate a specific observed error output? Adding extra audio clips containing the "obey" pronunciation sounds impractical, but reusing the three audio clips has the danger of overfitting. I think this is one of the drawbacks of seq2seq model in general. It replaced the traditional pipelines with a single neural network, but we lost the controllability and customization of the model at all.

This issue might be out of the interest of the project, but I really want to hear some ideas about how to gain more controllability of a seq2seq model.

The takeaway is that does anyone know:

  • how to re-train the trained seq2seq model to fix an error output, hopefully not in a resource-intensive way
  • any idea or an ongoing research about a more controllable seq2seq model

Any help will be appreciated.

Strange alignment when training model

First of all, thank you for great implementation of Tacotron 2. I'm trying to training Vietnamese language with my own dataset. The dataset is 22.05kHz, about 6 hours, already trimmed all silence. But I couldn't see any alignment of the model? Is it the problem?

step-24000-align

step-24000-pred-mel-spectrogram

step-24000-real-mel-spectrogram

step-25500-align

step-25500-pred-mel-spectrogram

step-25500-real-mel-spectrogram

sound.zip

Thanks for helping me!

tabs vs spaces?

Is there a strong preference for the tabs over spaces? PEP8 prefers spaces. I guess most people's editors default to spaces. Am I alone in preferring a switch over to spaces?

Evaluate a string directly

Hi, thank you for building this implementation. I have found it to be the easiest to work with of any Tacotron model. I'm currently at 25,000 steps and the checkpoint examples sound good so I would like to run some custom evaluations. Is there a safe way to pass in a string directly and output a .wav? From initial inspection it looks like I might be able to do this by overwriting hparams.sentences, but I would prefer to bypass the plotting and logging that happens during normal operation for enhanced speed. Thanks for all the hard work!

unexpected keyword argument 'previous_alignments'

Traceback (most recent call last):
  File "/home/zeng/work/pycharm/Tacotron-2/train.py", line 32, in <module>
    main()
  File "/home/zeng/work/pycharm/Tacotron-2/train.py", line 26, in main
    tacotron_train(args)
  File "/home/zeng/work/pycharm/Tacotron-2/tacotron/train.py", line 183, in tacotron_train
    train(log_dir, args)
  File "/home/zeng/work/pycharm/Tacotron-2/tacotron/train.py", line 73, in train
    model.initialize(feeder.inputs, feeder.input_lengths, feeder.mel_targets, feeder.token_targets)
  File "/home/zeng/work/pycharm/Tacotron-2/tacotron/models/tacotron.py", line 106, in initialize
    maximum_iterations=max_iters)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 286, in dynamic_decode
    swap_memory=swap_memory)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 234, in body
    decoder_finished) = decoder.step(time, inputs, state)
  File "/home/zeng/work/pycharm/Tacotron-2/tacotron/models/custom_decoder.py", line 123, in step
    (cell_outputs, stop_token), cell_state = self._cell(inputs, state)
  File "/home/zeng/work/pycharm/Tacotron-2/tacotron/models/Architecture_wrappers.py", line 194, in __call__
    attention_layer=None)
  File "/home/zeng/tf_gpu_py3/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py", line 973, in _compute_attention
    cell_output, previous_alignments=previous_alignments)
TypeError: __call__() got an unexpected keyword argument 'previous_alignments'

Hi,Rayhane-mamah, greate job! I try to train the model,but get this error.it seems that previous_alignments is not a parameter in LocationSensitiveAttention ? anything wrong?

Multi-language Support?

Is there any multi language support planning, such as Chinese? Or any instructions for prepare dataset for training?

No model to load at logs-Tacotron/taco_pretrained/

hello @Rayhane-mamah

i'm trying to train the model in google cloud app engine using ljspeech dataset

i follow the step in readme.me

the preprocessing step works fine(python3 preprocess.py)

but when i run the python3 train.py --model='Both( or 'tacotron' and wavenet')
the program always say "No model to load at logs-Tacotron/taco_pretrained/" and then killed

am i missing something? or do i need pre-trained model?

thank you for your attention
cheers

Update :
found the problem
the problem is i dont have enough memory and cpu to run it..silly me
solved

Implementation Status and planned TODOs

this umbrella issue tracks my current progress and discuss priority of planned TODOs. It has been closed since all objectives are hit.

Goal

  • achieve a high quality human-like text to speech synthesizer based on DeepMind's paper
  • provide a pre-trained Tacotron-2 model (Training.. checking this still)

Model

Feature Prediction Model (Done)

  • Convolutional-RNN encoder block
  • Autoregressive decoder
  • Location Sensitive Attention (+ smoothing option)
  • Dynamic stop token prediction
  • LSTM + Zoneout
  • reduction factor (not used in the T2 paper)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

  • 1D dilated convolution
  • Local conditioning
  • Global conditioning
  • Upsampling network (by transposed convolutions)
  • Mixture of logistic distributions
  • Gaussian distribution for waveforms modeling
  • Exponential Moving Average (train + synthesis)

Scripts

  • Feature prediction model: training
  • Feature prediction model: natural synthesis
  • Feature prediction model: ground-truth aligned synthesis
  • Wavenet vocoder model: training (ground truth Mel-Spectrograms)
  • Wavenet vocoder model: training (ground truth aligned Mel-Spectrograms)
  • Wavenet vocoder model: waveforms synthesis
  • Global model: synthesis (from text to waveforms)

Extra (optional):

  • Griffin-Lim (as an alternative vocoder)
  • Reduction factor (speed up training, reduce model complexity + better alignment)
  • Curriculum-Learning for RNN Natural synthesis. paper
  • Post processing network for Linear Spectrogram mapping
  • Wavenet with Gaussian distribution (reference)

Notes:

All models in this repository will be implemented in Tensorflow on a first stage, so in case you want to use a Wavenet vocoder implemented in Pytorch you can refer to this repository that shows very promising results.

Where is the wavenet input "tacotron_output/gta/map.txt"?

I have trained out a Tacotron model and the directory structure is as follows:
tacotron_output
├──eval
│ ├──map.txt
│ └──*.npy
└──logs-eval
├──plots
└──wavs
And then I wanted to train the WaveNet model but it confused me that I need the input tacotron_output/gta/map.txt. Where can I find it? Is it tacotron_output/eval/map.txt? But the matadata fetched in the feeder seemed not the proper one.

Fast evaluation question with no GPU Currently on hand

I'm using the default training dataset(LJSpeech-1.1) and got no GPU on my machine .The CPU training speed is about 7.182 sec/step, for fast envaluation ,how many steps do you think it would be OK to see different(big enough performance different) between the other TTS?

And since the training is running, is it OK is pause(ctrl+c) when I want to stop and envalute ?

error loading checkpoint in synthesis time

Hi and thank for publishing Tacotron-2.
I trained network using python train.py --model='Both' then tried to synthesis from checkpoints using python synthesize.py --model='Tacotron-2' and run to this error:

NotFoundError (see above for traceback): Key model_1/model/ResidualConv1dGLU_0/residual_block_cin_conv/bias_residual_block_cin_conv/ExponentialMovingAverage not found in checkpoint

Problems on Chinese mandarin using latest model

Hi, Rayhane-mamah,
Thanks for sharing the code.
I have some problems on it,can you give some advices?
I found that when I used tocatron_batch_size = 32 outputs_per_step = 1,(use "Both")my traning rate is about 7 sec/step. It is normal?It means it will take about 622 hours to reach 320k, which is defalut steps limit.I found in issues 18 that @begeekmyfriend used old model and rate is about 1sec/step,which I think it is comfortable.I didn't change your parameters except use predict_linear = True,sample_rate and cleaners,I use a professional P40 card, It used about 16G of memory,so it shouldn't be a problem.May I ask you about the speed of your model when you are traning?
Look forward to your feedback

adding padding to inputs, targets at wavemet_vocoder.feeder.py seems wrong

wavenet_vocoder.feeder.py add padding to inputs and targets by this code :

def _pad_inputs(x, maxlen):
return np.pad(x, [(0, maxlen - len(x)), (0, 0)], mode='constant', constant_values=_pad)
def _pad_targets(x, maxlen):
return np.pad(x, (0, maxlen - len(x)), mode='constant', constant_values=_pad)

I think valid padding value should be 'silence', which is 0.0 for raw input mode, 0.0=mulaw(0, 256) for mulaw, 128=mulaw_quantize(0, 256) for mulaw quantization.

When input_type="mulaw-quantize", zeros([256]) is padded to inputs, not one_hot(128, 256) and 0 is padded to targets, not 128.

Need some information

Hello Rayhane,

i would like to try your program before that i would like to know few things:

  • Do you have any samples generated using your program ?
  • How many training steps required to get noticeable output ?
  • And after how many steps it gets stable ?

preprocess.py running error, any package dependence needed?

I tried added path for solving not finding package in models on my ubuntu 16.04 machine

import argparse
++ import sys
++ sys.path.append("./models")
from tacotron.preprocess import tacotron_preprocess
from multiprocessing import cpu_count

and removing tacotron from tacotron.xxx import xxx to avoid error like below

Traceback (most recent call last):
  File "train.py", line 10, in <module>
    from tacotron.datasets.feeder import Feeder
ImportError: No module named 'tacotron'

but still get error like below:

tacotron$ python3 preprocess.py
Traceback (most recent call last):
  File "preprocess.py", line 5, in <module>
    from tacotron.preprocess import tacotron_preprocess
ImportError: No module named 'tacotron.preprocess'; 'tacotron' is not a package

I grep the project ,only find tacotron_preprocess be used in preprocess.py and not defined, is there any other package I need to install other than the requirements.txt?

Thanks

Pre-trained model and audio samples.

Hello everyone, just wanted to drop some naturally synthesized audio samples (griffin-lim inversion of mels) along with the model that generated them. Enjoy!

samples.tar.gz

Sentences:

  • hello everyone and thank you for hearing me out!
  • these are probably my first words as a pre-trained model.
  • finally, today, after about 4 months of development I am able to speak naturally
  • my voice is still robotic indeed, that's why my developer is working on Wavenet vocoder at this particular moment.
  • I still have some small problems with caps, words context and typos. but I read stuff differently each time.
  • Let me show you what I mean!
  • even if it doesn't sound like it, I do have feelings!
  • even if it doesn't sound like it, I do have feelings! (repetition to show slight difference in generation)
  • this is due to the pre-net drop-out even at inference time.
  • another thing I want to mention is that I am currently able to read long sentences with ease. here's an example!
  • Sequence to sequence models have enjoyed great success in a variety of tasks such as machine translation, speech recognition, and text summarization. This tutorial gives readers a full understanding of sequence to sequence models and shows how to build a competitive sequence to sequence model from scratch. We focus on the task of neural machine translation which was the very first wild success for this architecture.
  • the tutorial should be in the git-hub comment. ( this one )
  • you also probably noticed how well I can read complex words.
  • so to wrap things up, I am slowly progressing, and hopefully I will speak like humans very soon.
  • Thank you all so much for your support!

Incompatible Shapes Error when Training Wavenet

Exiting due to Exception: Incompatible shapes: [2,12799,10] vs. [2,12799]
[[Node: model/loss/sub_7 = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/loss/strided_slice_7, model/loss/Max)]]
[[Node: model/loss/truediv/_975 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20598_model/loss/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'model/loss/sub_7', defined at:
File "train.py", line 127, in
main()
File "train.py", line 121, in main
train(args, log_dir, hparams)
File "train.py", line 75, in train
checkpoint = wavenet_train(args, log_dir, hparams, input_path)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/train.py", line 244, in wavenet_train
return train(log_dir, args, hparams, input_path)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/train.py", line 167, in train
model, stats = model_train_mode(args, feeder, hparams, global_step)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/train.py", line 118, in model_train_mode
model.add_loss()
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 347, in add_loss
self.loss = DiscretizedMixtureLogisticLoss(self.y_hat[:, :, :-1], self.y[:, 1:, :], hparams=self._hparams, mask=self.mask)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/models/modules.py", line 370, in DiscretizedMixtureLogisticLoss
log_scale_min=hparams.log_scale_min, reduce=False)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/models/mixture.py", line 70, in discretized_mix_logistic_loss
log_probs = log_probs + log_prob_from_logits(logit_probs)
File "/home/seaton/repos/PyPPA/Tacotron-2/wavenet_vocoder/models/mixture.py", line 17, in log_prob_from_logits
return x - m - tf.log(tf.reduce_sum(tf.exp(x-m), axis))
File "/home/seaton/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
return func(x, y, name=name)
File "/home/seaton/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4636, in _sub
"Sub", x=x, y=y, name=name)
File "/home/seaton/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seaton/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/seaton/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

This occurred during the training of the Tacotron-2 model just after the Tacotron portion had been completed. Is there any way to get around this?

About the attention

Hi, Rayhane-mamah,
Thanks for sharing the code. I have got a acceptable result in tacotron before, but sometimes the tacotron model produces repeated words or misses some words. In tacotron2 paper, the authors say that they use a different attention model to handle these problems. So I replace the attention model in tacotron with relative code used in tacotron2. However, I can't get promising result anymore. Even that the model can not produce correct voices given by the text. Can you give me some suggestions about that ?

Out of memory error

Hello,

I'm trying to train your model with my custom Russian language data on GeForce GTX 1080 with 8Gb RAM. And when batch_size is set to anything greater than 8, I'm constanly getting out of memory error somewhere between steps 8-19. Do you have any idea why could it happen?

2018-04-10 16:52:27.494495: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 591.41MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-10 16:52:37.495788: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 118.28MiB.  Current allocation summary follows.
2018-04-10 16:52:37.540543: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-04-10 16:52:37.540674: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at random_op.cc:202 : Resource exhausted: OOM when allocating tensor with shape[16,3785,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Exiting due to exception: OOM when allocating tensor with shape[16,3785,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/dropout_conv_layer_2_postnet_convolutions/dropout/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/optimizer/gradients/model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/dropout_conv_layer_2_postnet_convolutions/dropout/div_grad/Shape)]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  154.23MiB from model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/dropout_conv_layer_2_postnet_convolutions/dropout/div
  118.28MiB from model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/conv1d/conv1d/Conv2D
  118.28MiB from model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/batch_normalization/batchnorm/mul_1
  118.28MiB from model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/dropout_conv_layer_1_postnet_convolutions/dropout/div
  118.28MiB from model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/dropout_conv_layer_1_postnet_convolutions/dropout/random_uniform/RandomUniform
  118.28MiB from model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/conv1d/conv1d/Conv2D
  118.28MiB from model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/batch_normalization/batchnorm/mul_1
  32.00MiB from model/optimizer/gradients/model/loss/L2Loss_16_grad/mul
  28.00MiB from model/optimizer/gradients/model/loss/L2Loss_15_grad/mul
  19.99MiB from model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/conv1d/conv1d/Conv2D
  18.48MiB from model/inference/decoder/transpose
  18.48MiB from model/optimizer/gradients/model/loss/mean_squared_error/Sum_grad/Tile
  17.06MiB from model/inference/embedding_lookup
  17.06MiB from model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/batch_normalization/batchnorm/mul_1
  17.06MiB from model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/dropout_conv_layer_1_encoder_convolutions/dropout/div
  17.06MiB from model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/dropout_conv_layer_1_encoder_convolutions/dropout/random_uniform/RandomUniform
  17.06MiB from model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/conv1d/conv1d/Conv2D
  17.06MiB from model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/batch_normalization/batchnorm/mul_1
  17.06MiB from model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/dropout_conv_layer_2_encoder_convolutions/dropout/div
  17.06MiB from model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/dropout_conv_layer_2_encoder_convolutions/dropout/random_uniform/RandomUniform
  17.06MiB from model/inference/encoder_convolutions/conv_layer_3_encoder_convolutions/conv1d/conv1d/Conv2D
  17.06MiB from model/inference/encoder_convolutions/conv_layer_3_encoder_convolutions/batch_normalization/batchnorm/mul_1
  17.06MiB from model/inference/encoder_convolutions/conv_layer_3_encoder_convolutions/dropout_conv_layer_3_encoder_convolutions/dropout/random_uniform/RandomUniform
  17.06MiB from model/inference/encoder_LSTM/concat
  17.06MiB from model/optimizer/gradients/model/inference/decoder/while/CustomDecoderStep/MatMul/Enter_grad/zeros
  16.06MiB from model/inference/encoder_LSTM/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3
  16.06MiB from model/inference/encoder_LSTM/bidirectional_rnn/bw/bw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3
  8.34MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  7.32MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  5.36MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_1_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_4_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_7_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_26_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_29_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_32_grad/mul
  5.00MiB from model/optimizer/gradients/model/loss/L2Loss_35_grad/mul
  4.27MiB from model/optimizer/gradients/model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add/Enter_grad/zeros
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  4.27MiB from model/inference/decoder/while/CustomDecoderStep/Location_Sensitive_Attention/add
  Remaining 41046 nodes with 5.32GiB

Effects on WaveNet predicted wavs

It seems the decreasing of loss during WaveNet training unsteady. Is it all right for the results or should I wait more steps? The predicted wavs under logs-WaveNet/wavs sound OK but the ones under logs-WaveNet/eval-dir/wavs sound like a mess...
image

About Training Helper Problem

i have rewrite the feeder.py to the lastest tf.dataset, but i find a problem in your code:

next_inputs = tf.cond(
tf.less(tf.random_uniform([], minval=0, maxval=1, dtype=tf.float32), self._ratio),
lambda: self._targets[:, time, :], #Teacher-forcing: return true frame
lambda: outputs[:,-self._output_dim:])

the next_inputs in the helper.py depends on the tfr, but if time is the final step, as i konw, len(stop_tokens) = len(targets) + 1, so if the ratio < tf.random_uniform, the target will not have the indice len(stop_tokens), it will cause a index error.
some solutions like this:
https://github.com/tensorflow/tensorflow/blob/cfd0ea3bfb85d92cdb32760ea024f1e38618d717/tensorflow/contrib/seq2seq/python/ops/helper.py#L248-L250
this will return a zero tensor if finish is true.
thanks, may be it is a bug in my code.

import the spectrogram into r9y9 wavenet vocoder and some samples

I downloaded the recently updated code, and trained 280,000 steps. The effect is OK, but when I input the generated mel spectrogram into r9y9's training 50k step model, the effect is not good. The following is my generation. Audio, can you tell me why this is? And when is wavenet_vocoder training? Thank you very much.
wavs.zip

Time steps different in dataset preprocessed between Tacotron and WaveNet repo

@Rayhane-mamah I tried using Tacotron2 GTA to feed @r9y9 Wavenet as training data
(using tacotron preprocessed speech-audio-XXX.npy and GTA speech-mel-XXX.npy to compare different with real wav trained wavenet model output)
and I found assert in

def assert_ready_for_upsampling(x, c):
    #print("len x = %d, len(c) = %d", len(x),len(c))
    assert len(x) % len(c) == 0 and len(x) // len(c) == audio.get_hop_size()

tried removed the assert check, the wavenet train eval became full sharp noise after 10K and which should give voice like wave plot at 10K with wavenet generated training data

Then I checked the train.txt generated from the two repo, the wav time steps are not same for same audio . In the audio.py of two repos the code routine are little different
Tacotron-2 :

	mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
	mel_frames = mel_spectrogram.shape[1]  #<<===========N using shape[1] here
.......
	#Ensure time resolution adjustment between audio and mel-spectrogram
	l, r = audio.pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))

	#Zero pad for quantized signal
	out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
	time_steps = len(out)                                 #<<=========== init timesteps here
	assert time_steps >= mel_frames * audio.get_hop_size(hparams)

	#time resolution adjustement
	#ensure length of raw audio is multiple of hop size so that we can use
	#transposed convolution to upsample
	out = out[:mel_frames * audio.get_hop_size(hparams)]
	assert time_steps % audio.get_hop_size(hparams) == 0

Wavenet repo :

    # Compute a mel-scale spectrogram from the trimmed wav:
    # (N, D)
    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32).T
    # lws pads zeros internally before performing stft
    # this is needed to adjust time resolution between audio and mel-spectrogram
    l, r = audio.lws_pad_lr(wav, hparams.fft_size, audio.get_hop_size())

    # zero pad for quantized signal
    out = np.pad(out, (l, r), mode="constant", constant_values=constant_values)
    N = mel_spectrogram.shape[0]              #<<===========N using shape[0] here
    assert len(out) >= N * audio.get_hop_size()

    # time resolution adjustment
    # ensure length of raw audio is multiple of hop_size so that we can use
    # transposed convolution to upsample
    out = out[:N * audio.get_hop_size()]
    assert len(out) % audio.get_hop_size() == 0

    timesteps = len(out)   #<<=========== init timesteps here

maybe this is reason why time step are different, but I’m not sure if this is a design different or just a mistake here.

Taking T2 output to r9y9 wavenet vocoder

Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:

  • num_mels=80
  • num_freq=1025; In wavenet code fft_size=1024 in t2 fft_size=(1025-1)*2=2048. As far as i understand i can keep this as is since anyways this accumulates to mel bands
  • sample_rate=20050 (As the ljspeech dataset)
  • frame_length_ms=46.44 (correlates to wavenet's fft_size/22050).
  • frame_shift_ms=11.61 (correlates to wavenet's hop_size=256, 256/22050=11.61ms)
  • preemphasis, not available in wavenet r9y9 implementation

Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.

Any remarks, suggestions?

Are there metrics on training speed per iteration in Tacotron

Hi,

Thank you for the great effort put into this project.

In an older issue (#11) I saw it mentioned that @butterl was getting about 7s/iteration in training, this is 10x faster than what I am experiencing. I am curious if you or @butterl have some recent metrics on training speed with typical hyper parameters (defaults from some recent commit), or if you could confirm that what I am experiencing is in the normal range for the current project state.

Tacotron (batch_size=16, all else default):

  • CPU only (core i7 960): 70-75s per iteration
  • GPU (K80): 8-9s per iteration

I understand this is highly dependent on hyperparameters, I just want to make sure there isn't something horribly wrong with my hardware or something else.

Evaluation on Chinese mandarin

step-30000-align
step-30000-pred-mel-spectrogram
step-30000-real-mel-spectrogram
eval-30000.zip
Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough.
The modification has been opened on my own repo with mandarin branch.
Thanks a lot for this work!

Need help about testing the quality of model by Tornado server.

Thank you so much for the greatest contribution, @Rayhane-mamah :)

I have been working on your project for a month more or less, which is applying your code into Chinese pinyin text-to-speech synthesis (Thanks @begeekmyfriend for his Tacotron 2 projects before, from his repo in @keithito to yours).

I follow the steps of @keithito in his demo_server.py. After re-coding another server program in tornado web server but not in falcon as keithito did before, however. I have some questions I cannot handle. :(

Here is my demo_server.py:

# tornado
import tornado.httpserver
import tornado.web
from tornado.ioloop import IOLoop
import tornado.web
import tornado.httpclient

# For tornado asynchronous
# I still have not figured out how to be asynchronous synthesizing :(
# In order to speed up the synthesis
import tornado.netutil
import tornado.process
from tornado import gen

# From demo_server.py
# Importing synthesizer
import argparse, os, io
from hparams import hparams, hparams_debug_string
from tacotron.demo_synthesizer import Synthesizer

# Chinese character Converting & splitting_sentences
# As I am working on Chinese text-to-speech
# I have to convert character into pinyin
# like 你好 -> ni2 hao3 by this pypinyin module
from pypinyin import pinyin, lazy_pinyin, Style
from splitting_sent import splitting_para

# Automatically loading the lastest model
import tensorflow as tf

# Combinding the several wavs of sentences
# into one single wav paragraph
# with the sox command
from datasets import audio
from subprocess import call

from tornado.options import define, options
define("port", default=8000, help="run on the given port", type=int)

Saving_wav_temporarily = "/home/user/Tacotron-2-mandarin/Saving"
if not os.path.exists(Saving_wav_temporarily):
        os.makedirs(Saving_wav_temporarily)


class MainHandler(tornado.web.RequestHandler):    
    def get(self):    
        self.render("demo.html")    


class SynHandler(tornado.web.RequestHandler):
    def get(self):
        self.set_header("Content-Type", "audio/wav")
        py = pinyin(str(self.get_argument("text")), style=Style.TONE3)    
        py = " ".join([i[0] for i in py if i[0].isalnum()])    
        data = synthesizer.synthesize(py)    
        out = io.BytesIO()    
        audio.save_wav(data, out)    
        self.write(out.getvalue())    
        self.finish("Done")
        

synthesizer = Synthesizer()
handlers = [
    (r"/", MainHandler),
    (r"/synthesize", SynHandler),
]


if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint', \
                        default='logs-Tacotron/pretrained/', \
                        help='Full path to model checkpoint')
    parser.add_argument('--hparams', default='')
    args = parser.parse_args()
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    hparams.parse(args.hparams)

    # Automatically loading the lastest model
    checkpoint_path = tf.train.get_checkpoint_state(args.checkpoint).model_checkpoint_path
    synthesizer.load(checkpoint_path)
    print("Serving on port %d" % options.port)

    tornado.options.parse_command_line()
    app = tornado.web.Application(handlers)
    http_server = tornado.httpserver.HTTPServer(app)
    http_server.listen(options.port)
    tornado.ioloop.IOLoop.instance().start()

Also I modify the Synthesizer class in tacotron.synthesizer.py as following:

# demo_synthesizer.py

import io
import os
import numpy as np
import tensorflow as tf
from hparams import hparams
from librosa import effects
from tacotron.models import create_model
from tacotron.utils.text import text_to_sequence
from tacotron.utils import plot
from datasets import audio
from datetime import datetime


class Synthesizer:
	def load(self, checkpoint_path, gta=False, model_name='Tacotron'):
		print('Constructing model: %s' % model_name)
		inputs = tf.placeholder(tf.int32, [1, None], 'inputs')
		input_lengths = tf.placeholder(tf.int32, [1], 'input_lengths')
		targets = tf.placeholder(tf.float32, [1, None, hparams.num_mels], 'mel_targets')
		with tf.variable_scope('model') as scope:
			self.model = create_model(model_name, hparams)
			if gta:
				self.model.initialize(inputs, input_lengths, targets, gta=gta)
			else:
				self.model.initialize(inputs, input_lengths)
			self.mel_outputs = self.model.mel_outputs
			self.alignment = self.model.alignments[0]

		self.gta = gta
		print('Loading checkpoint: %s' % checkpoint_path)
		self.session = tf.Session()
		self.session.run(tf.global_variables_initializer())
		saver = tf.train.Saver()
		saver.restore(self.session, checkpoint_path)


	def synthesize(self, text):
		cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
		print("Processing:", text)
		seq = text_to_sequence(text, cleaner_names)
		feed_dict = {
			self.model.inputs: [np.asarray(seq, dtype=np.int32)],
			self.model.input_lengths: np.asarray([len(seq)], dtype=np.int32),
		}

		if self.gta:
			feed_dict[self.model.mel_targets] = np.load(mel_filename).reshape(1, -1, 80)

		if self.gta or not hparams.predict_linear:
			mels, alignment = self.session.run([self.mel_outputs, self.alignment], feed_dict=feed_dict)

		else:
			linear, mels, alignment = self.session.run([self.linear_outputs, self.mel_outputs, self.alignment], feed_dict=feed_dict)
			linear = linear.reshape(-1, hparams.num_freq)

		mels = mels.reshape(-1, hparams.num_mels) #Thanks to @imdatsolak for pointing this out

		wav = audio.inv_mel_spectrogram(mels.T)
		out = io.BytesIO()
		audio.save_wav(wav, out)
		return out.getvalue()

demo.html:

<html><title>Demo</title>
<style>
button {background: #28d; padding: 9px 14px; margin-left: 8px; border: none; outline: none;
        color: #fff; font-size: 14px; border-radius: 4px; cursor: pointer
button:hover {box-shadow: 0 1px 2px rgba(0,0,0,.15); opacity: 0.9;}
button:active {background: #29f
button[disabled] {opacity: 0.4; cursor: default}
</style>
<body>
<form>
  <textarea id="text" name="text" type="text" rows="50" cols="150">Enter Text</textarea>
  <br/>
  <button id="button" name="synthesize">Speak</button>
</form>
<p id="message"></p>
<audio id="audio" controls autoplay hidden></audio>
<script>
function q(selector) {return document.querySelector(selector)}
q('#text').focus()
q('#button').addEventListener('click', function(e) {
  text = q('#text').value.trim()
  if (text) {
    q('#message').textContent = 'Synthesizing...' 
    q('#button').disabled = true
    q('#audio').hidden = true
    synthesize(text)
  }
  e.preventDefault()
  return false
})
function synthesize(text) {
  fetch('/synthesize?text=' + encodeURIComponent(text), {cache: 'no-cache'})
    .then(function(res) {
      if (!res.ok) throw Error(res.statusText)
      return res.blob()
    }).then(function(blob) {
      q('#message').textContent = ''
      q('#button').disabled = false
      q('#audio').src = URL.createObjectURL(blob)
      q('#audio').hidden = false
    }).catch(function(err) {
      q('#message').textContent = 'Error: ' + err.message
      q('#button').disabled = false
    })
}
</script></body></html>

As the result from above, I can synthesis one sentence one time.


However, if I want to synthesis a long paragraph, as well as remaining a good synthesized quality in every single sentence, I split the paragraph into several separated sentences by the following splitting_sent.py:

def splitting_para(words):
    start = 0
    i = 0
    sents = []

    punt_list = '?+——!,。?、~@#¥%……&*()!'
    for word in words:
        if word in punt_list and token not in punt_list:
            sents.append(words[start:i + 1])
            start = i + 1
            i += 1
        else:
            i += 1
            token = list(words[start:i + 2]).pop()
    if start < len(words):
        sents.append(words[start:])
    return sents

And replace the SynHandler within the demo_server.py:

...
class SynHandler(tornado.web.RequestHandler):
    def get(self):
        self.set_header("Content-Type", "audio/wav")
        # Splitting paragraph into sepreated sentences
        sents = []
        sents.append(splitting_para(self.get_argument("text")))
        print("all splited sents: ", sents[0])

        # Converting character into pinyin
        # Saving into Saving_wav_temporarily
        for index, sent in enumerate(sents[0]):
            py = pinyin(str(sent), style=Style.TONE3)
            py = " ".join([i[0] for i in py_testing if i[0].isalnum()])
            wav = synthesizer.synthesize(py)
            audio.save_wav(wav, os.path.join(Saving_wav_temporarily, '{:05d}.wav'.format(index)))

        file_names = []
        for root, dirs, files in os.walk(Saving_wav_temporarily):
            for file in files:
                if os.path.splitext(file)[1] == '.wav':
                    file_names.append(os.path.join(root, file))

        # Combinding sepreated wavs(sentences) into one wavs(paragraph)
        os.chdir(Saving_wav_temporarily)
        call(["sox", "*.wav", "output.wav"])
        call(["rm", "!(output.wav)"])

        self.finish("Done")
...

And Synthesizer Class in tacotron.synthesizer.py have to be changed:

...
if self.gta or not hparams.predict_linear:
	mels, alignment = self.session.run([self.mel_outputs, self.alignment], feed_dict=feed_dict)
			
else:
	linear, mels, alignment = self.session.run([self.linear_outputs, self.mel_outputs, self.alignment], feed_dict=feed_dict)
	linear = linear.reshape(-1, hparams.num_freq)

mels = mels.reshape(-1, hparams.num_mels) #Thanks to @imdatsolak for pointing this out

wav = audio.inv_mel_spectrogram(mels.T)
return wav

The result of combining sentences by the method above is I could only get the wav file of a whole paragraph "offline", but not able to play it in the Tornado server.


I think the key point is:

wav = audio.inv_mel_spectrogram(mels.T)
out = io.BytesIO()
audio.save_wav(wav, out)
return out.getvalue()

However, I do not know how to combine every sentence in this step. :(

Stop token can be removed for simplification

I do not think we have to train extra stop token targets to learn when to stop decoding. We can replace this line with as follows:

finished = tf.reduce_all(tf.equal(self._targets[:, time], [_target_pad]), axis=1)

It would seek the _target_pad as the terminating symbol to set finished token to stop decoding. And then we can remove all stop tokens in the code.

pyaudio errors during installation of requirements

When I run
pip install -r requirements.txt

I get this error:

----------------------------------------
  Failed building wheel for pyaudio
  Running setup.py clean for pyaudio
Successfully built librosa
Failed to build pyaudio
Installing collected packages: python-mimeparse, falcon, numpy, scipy, librosa, matplotlib, tqdm, Unidecode, pyaudio, sounddevice
  Found existing installation: numpy 1.14.3
    Uninstalling numpy-1.14.3:
      Successfully uninstalled numpy-1.14.3
  Found existing installation: scipy 0.18.1
    Uninstalling scipy-0.18.1:
      Successfully uninstalled scipy-0.18.1
  Found existing installation: librosa 0.6.0
    Uninstalling librosa-0.6.0:
      Successfully uninstalled librosa-0.6.0
  Found existing installation: matplotlib 2.0.0
    Uninstalling matplotlib-2.0.0:
      Successfully uninstalled matplotlib-2.0.0
  Found existing installation: tqdm 4.23.3
    Uninstalling tqdm-4.23.3:
      Successfully uninstalled tqdm-4.23.3
  Found existing installation: Unidecode 1.0.22
    Uninstalling Unidecode-1.0.22:
      Successfully uninstalled Unidecode-1.0.22
  Running setup.py install for pyaudio ... error
    Complete output from command /home/soul/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-fk5mr2vd/pyaudio/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-wn2v_iyg/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    copying src/pyaudio.py -> build/lib.linux-x86_64-3.6
    running build_ext
    building '_portaudio' extension
    creating build/temp.linux-x86_64-3.6
    creating build/temp.linux-x86_64-3.6/src
    gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/soul/anaconda3/include/python3.6m -c src/_portaudiomodule.c -o build/temp.linux-x86_64-3.6/src/_portaudiomodule.o
    src/_portaudiomodule.c:29:23: fatal error: portaudio.h: No such file or directory
    compilation terminated.
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
Command "/home/soul/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-fk5mr2vd/pyaudio/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-wn2v_iyg/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-fk5mr2vd/pyaudio/

This command fixed it for me:
sudo apt install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools

Hope this will save someone some time in the future

Missing files

I am having difficulties running this code. I am on Linux.

The preprocess step seemed to have completed but with some issues,

$ python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'
WARNING:tensorflow:From /home/al/.local/lib64/python3.5/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
initializing preprocessing..
Selecting data folders..
Traceback (most recent call last):
  File "preprocess.py", line 99, in <module>
    main()
  File "preprocess.py", line 95, in main
    run_preprocess(args)
  File "preprocess.py", line 78, in run_preprocess
    preprocess(args, input_folders, output_folder)
  File "preprocess.py", line 14, in preprocess
    metadata = preprocessor.build_from_path(input_folders, mel_dir, wav_dir, args.n_jobs, tqdm=tqdm)
  File "/home/al/Downloads/tacotron/Rayhane-mamah/Tacotron-2/datasets/preprocessor.py", line 31, in build_from_path
    with open(os.path.join(input_dir, 'metadata.csv'), encoding='utf-8') as f:
NotADirectoryError: [Errno 20] Not a directory: 'en_US/by_book/female/mary_ann/._.DS_Store/metadata.csv'
file en_US/by_book/female/mary_ann/midnight_passenger/wavs/midnight_passenger_05_f000269.wav present in csv metadata is not present in wav folder. skipping!
file en_US/by_book/female/mary_ann/northandsouth/wavs/northandsouth_40_f000069.wav present in csv metadata is not present in wav folder. skipping!

._.DS_Store is a file, so, i am not sure whats going on here.

The train step failed to find training_data/train.txt which doesn't actually exist,

$ python train.py --model='Tacotron'
WARNING:tensorflow:From /home/al/.local/lib64/python3.5/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
Checkpoint path: logs-Tacotron/pretrained/model.ckpt
Loading training data from: training_data/train.txt
Using model: Tacotron
Hyperparameters:
  allow_clipping_in_normalization: True
  attention_dim: 128
  attention_filters: 32
  attention_kernel: (31,)
  cleaners: english_cleaners
  decoder_layers: 2
  decoder_lstm_units: 1024
  embedding_dim: 512
  enc_conv_channels: 512
  enc_conv_kernel_size: (5,)
  enc_conv_num_layers: 3
  encoder_lstm_units: 256
  fft_size: 1024
  fmax: 7600
  fmin: 125
  frame_shift_ms: None
  griffin_lim_iters: 60
  hop_size: 256
  impute_finished: False
  input_type: mulaw-quantize
  log_scale_min: -32.23619130191664
  mask_encoder: False
  mask_finished: False
  max_abs_value: 4.0
  max_iters: 1000
  mel_normalization: True
  min_level_db: -100
  num_mels: 80
  outputs_per_step: 5
  postnet_channels: 512
  postnet_kernel_size: (5,)
  postnet_num_layers: 5
  power: 1.55
  prenet_layers: [256, 256]
  quantize_channels: 256
  ref_level_db: 20
  rescale: True
  rescaling_max: 0.999
  sample_rate: 22050
  silence_threshold: 2
  smoothing: False
  stop_at_any: True
  symmetric_mels: True
  tacotron_adam_beta1: 0.9
  tacotron_adam_beta2: 0.999
  tacotron_adam_epsilon: 1e-06
  tacotron_batch_size: 32
  tacotron_decay_learning_rate: True
  tacotron_decay_rate: 0.4
  tacotron_decay_steps: 50000
  tacotron_dropout_rate: 0.5
  tacotron_final_learning_rate: 1e-05
  tacotron_initial_learning_rate: 0.001
  tacotron_reg_weight: 1e-06
  tacotron_teacher_forcing_ratio: 1.0
  tacotron_zoneout_rate: 0.1
  trim_silence: True
Traceback (most recent call last):
  File "train.py", line 32, in <module>
    main()
  File "train.py", line 26, in main
    tacotron_train(args)
  File "/home/al/Downloads/tacotron/Rayhane-mamah/Tacotron-2/tacotron/train.py", line 174, in tacotron_train
    train(log_dir, args)
  File "/home/al/Downloads/tacotron/Rayhane-mamah/Tacotron-2/tacotron/train.py", line 55, in train
    feeder = Feeder(coord, input_path, hparams)
  File "/home/al/Downloads/tacotron/Rayhane-mamah/Tacotron-2/tacotron/feeder.py", line 38, in __init__
    with open(metadata_filename, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'training_data/train.txt'

Looks like preprocess script didn't create it for some reason.

What is the effect of batch size?

Thanks for your great work.

I used your repo to train tacotron-2 with LJ-speech database with default settings. I found it learned good alignments after 6000 steps with batch size 32. But when batch size is 24 (because of the limitation of GPU memory), it cannot get a good alignment even after 23000 steps.

So I wonder how the batch size affect the performance?

WaveNet Vocoder Integration

@Rayhane-mamah, I'm currently starting work on using a WaveNet based Vocoder to connect to Tacotron-2. Any pointers to your Tacotron-2 implementation to start with? I have a wavenet_vocoder implementation that could use Mel- or WAV-conditioning to generate audio. I would like to put a Mel as input to WaveNet and have then a final audio as output... Any recommendation would be appreciated. At the end, I want to make a parallel-wavenet out of it and then put back on GitHub. The point mainly is that you have, so far, not implemented a prediction code and I was wondering where the hook ...

A Doubt In LocationSensitiveAttention

Hi Rayhane, I'm currently looking into your LocationSensitiveAttention class and don't know the reason of using cumulate_weights when calculating the next state. I can't find any reference in the original essay.

By the way, your work is fantastic :D Appreciate it a lot.

GPU inference time?

Hi There!

What is the inference time of running Tacotron 2? Is it real time?

Thanks,
Michael

What is the effect of batch size?

Thanks for your great work.

I used your repo to train tacotron-2 with LJ-speech database with default settings. I found it learned good alignments after 6000 steps with batch size 32. But when batch size is 24 (because of the limitation of GPU memory), it cannot get a good alignment even after 23000 steps.

So I wonder how the batch size affect the performance?

May be there are errors in WaveNet code

y_hat_log = tf.reduce_max(tf.nn.softmax(y_hat_log, axis=1), 1)
y_hat_log = util.inv_mulaw_quantize(y_hat_log, hparams.quantize_channels)
y_log = util.inv_mulaw_quantize(y_log, hparams.quantize_channels)

in the line 194:
tf.reduce_max should be tf.argmax

in the line 196, 197
y_hat_log, y_log are Tensors, so we could not use the numpy function.

also there are similar errors in:

if is_mulaw(hparams.input_type):
y_hat_log = util.inv_mulaw(y_hat_log, hparams.quantize_channels)
y_log = util.inv_mulaw(y_log, hparams.quantize_channels)

and in eval process.

Loss exploded to nan after ctrl+C to stop and start again

Hi , I use Ctrl+C to cancel the Tacotron train to release GPU for other work then after I start the Tacotron again, the Loss exploded to nan for all the exsiting checkpoint, but will be OK when I remove all checkpoint and start from begining, any instruction on due with this ?

Loading checkpoint logs-Tacotron-2/taco_pretrained/tacotron_model.ckpt-67500

Generated 32 test batches of size 48 in 4.530 sec

Generated 32 train batches of size 48 in 4.558 sec
Loss exploded to nan at step 67501=nan, avg_loss=nan]
Exiting due to exception: Loss exploded
Traceback (most recent call last):
  File "/home/butter/Audio/CHS/Tacotron-2-master-CHS/tacotron/train.py", line 181, in train
    raise Exception('Loss exploded')
Exception: Loss exploded
Exception in thread background:
Traceback (most recent call last):
  File "/home/butter/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/butter/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/butter/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
         [[Node: datafeeder/eval_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_5, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_4)]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.