Giter Club home page Giter Club logo

coqui-ai / tts Goto Github PK

View Code? Open in Web Editor NEW
29.4K 261.0 3.4K 166.21 MB

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Home Page: http://coqui.ai

License: Mozilla Public License 2.0

Python 91.95% Jupyter Notebook 7.52% HTML 0.26% Shell 0.13% Makefile 0.07% Cython 0.04% Dockerfile 0.03%
python text-to-speech deep-learning speech pytorch tts vocoder tacotron glow-tts melgan speaker-encoder hifigan speaker-encodings multi-speaker-tts tts-model speech-synthesis voice-cloning voice-synthesis voice-conversion

tts's Introduction

🐸Coqui.ai News

  • 📣 ⓍTTSv2 is here with 16 languages and better performance across the board.
  • 📣 ⓍTTS fine-tuning code is out. Check the example recipes.
  • 📣 ⓍTTS can now stream with <200ms latency.
  • 📣 ⓍTTS, our production TTS model that can speak 13 languages, is released Blog Post, Demo, Docs
  • 📣 🐶Bark is now available for inference with unconstrained voice cloning. Docs
  • 📣 You can use ~1100 Fairseq models with 🐸TTS.
  • 📣 🐸TTS now supports 🐢Tortoise with faster inference. Docs

🐸TTS is a library for advanced Text-to-Speech generation.

🚀 Pretrained models in +1100 languages.

🛠️ Tools for training new models and fine-tuning existing models in any language.

📚 Utilities for dataset analysis and curation.


Discord License PyPI version Covenant Downloads DOI

GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions GithubActions Docs


💬 Where to ask questions

Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it.

Type Platforms
🚨 Bug Reports GitHub Issue Tracker
🎁 Feature Requests & Ideas GitHub Issue Tracker
👩‍💻 Usage Questions GitHub Discussions
🗯 General Discussion GitHub Discussions or Discord

🔗 Links and Resources

Type Links
💼 Documentation ReadTheDocs
💾 Installation TTS/README.md
👩‍💻 Contributing CONTRIBUTING.md
📌 Road Map Main Development Plans
🚀 Released Models TTS Releases and Experimental Models
📰 Papers TTS Papers

🥇 TTS Performance

Underlined "TTS*" and "Judy*" are internal 🐸TTS models that are not released open-source. They are here to show the potential. Models prefixed with a dot (.Jofish .Abe and .Janice) are real human voices.

Features

  • High-performance Deep Learning models for Text2Speech tasks.
    • Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
    • Speaker Encoder to compute speaker embeddings efficiently.
    • Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN)
  • Fast and efficient model training.
  • Detailed training logs on the terminal and Tensorboard.
  • Support for Multi-speaker TTS.
  • Efficient, flexible, lightweight but feature complete Trainer API.
  • Released and ready-to-use models.
  • Tools to curate Text2Speech datasets underdataset_analysis.
  • Utilities to use and test your models.
  • Modular (but not too much) code base enabling easy implementation of new ideas.

Model Implementations

Spectrogram models

End-to-End Models

Attention Methods

  • Guided Attention: paper
  • Forward Backward Decoding: paper
  • Graves Attention: paper
  • Double Decoder Consistency: blog
  • Dynamic Convolutional Attention: paper
  • Alignment Network: paper

Speaker Encoder

Vocoders

Voice Conversion

You can also help us implement more models.

Installation

🐸TTS is tested on Ubuntu 18.04 with python >= 3.9, < 3.12..

If you are only interested in synthesizing speech with the released 🐸TTS models, installing from PyPI is the easiest option.

pip install TTS

If you plan to code or train models, clone 🐸TTS and install it locally.

git clone https://github.com/coqui-ai/TTS
pip install -e .[all,dev,notebooks]  # Select the relevant extras

If you are on Ubuntu (Debian), you can also run following commands for installation.

$ make system-deps  # intended to be used on Ubuntu (Debian). Let us know if you have a different OS.
$ make install

If you are on Windows, 👑@GuyPaddock wrote installation instructions here.

Docker Image

You can also try TTS without install with the docker image. Simply run the following command and you will be able to run TTS without installing it.

docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
python3 TTS/server/server.py --list_models #To get the list of available models
python3 TTS/server/server.py --model_name tts_models/en/vctk/vits # To start a server

You can then enjoy the TTS server here More details about the docker images (like GPU support) can be found here

Synthesizing speech by 🐸TTS

🐍 Python API

Running a multi-speaker and multi-lingual model

import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# List available 🐸TTS models
print(TTS().list_models())

# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Run TTS
# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
# Text to speech list of amplitude values as output
wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
# Text to speech to a file
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")

Running a single speaker model

# Init TTS with the target model name
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device)

# Run TTS
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)

# Example voice cloning with YourTTS in English, French and Portuguese
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to(device)
tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr-fr", file_path="output.wav")
tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt-br", file_path="output.wav")

Example voice conversion

Converting the voice in source_wav to the voice of target_wav

tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")

Example voice cloning together with the voice conversion model.

This way, you can clone voices by using any model in 🐸TTS.

tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
tts.tts_with_vc_to_file(
    "Wie sage ich auf Italienisch, dass ich dich liebe?",
    speaker_wav="target/speaker.wav",
    file_path="output.wav"
)

Example text to speech using Fairseq models in ~1100 languages 🤯.

For Fairseq models, use the following name format: tts_models/<lang-iso_code>/fairseq/vits. You can find the language ISO codes here and learn about the Fairseq models here.

# TTS with on the fly voice conversion
api = TTS("tts_models/deu/fairseq/vits")
api.tts_with_vc_to_file(
    "Wie sage ich auf Italienisch, dass ich dich liebe?",
    speaker_wav="target/speaker.wav",
    file_path="output.wav"
)

Command-line tts

Synthesize speech on command line.

You can either use your trained model or choose a model from the provided list.

If you don't specify any models, then it uses LJSpeech based English model.

Single Speaker Models

  • List provided models:

    $ tts --list_models
    
  • Get model info (for both tts_models and vocoder_models):

    • Query by type/name: The model_info_by_name uses the name as it from the --list_models.

      $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
      

      For example:

      $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
      $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
      
    • Query by type/idx: The model_query_idx uses the corresponding idx from --list_models.

      $ tts --model_info_by_idx "<model_type>/<model_query_idx>"
      

      For example:

      $ tts --model_info_by_idx tts_models/3
      
    • Query info for model info by full name:

      $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
      
  • Run TTS with default models:

    $ tts --text "Text for TTS" --out_path output/path/speech.wav
    
  • Run TTS and pipe out the generated TTS wav file data:

    $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
    
  • Run a TTS model with its default vocoder model:

    $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
    

    For example:

    $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
    
  • Run with specific TTS and vocoder models from the list:

    $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
    

    For example:

    $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
    
  • Run your own TTS model (Using Griffin-Lim Vocoder):

    $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
    
  • Run your own TTS and Vocoder models:

    $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
        --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
    

Multi-speaker Models

  • List the available speakers and choose a <speaker_id> among them:

    $ tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
    
  • Run the multi-speaker TTS model with the target speaker ID:

    $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
    
  • Run your own multi-speaker TTS model:

    $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
    

Voice Conversion Models

$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>

Directory Structure

|- notebooks/       (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
|- utils/           (common utilities.)
|- TTS
    |- bin/             (folder for all the executables.)
      |- train*.py                  (train your target model.)
      |- ...
    |- tts/             (text to speech models)
        |- layers/          (model layer definitions)
        |- models/          (model definitions)
        |- utils/           (model specific utilities.)
    |- speaker_encoder/ (Speaker Encoder models.)
        |- (same)
    |- vocoder/         (Vocoder models.)
        |- (same)

tts's People

Contributors

adonispujols avatar agrinh avatar akx avatar aya-aljafari avatar ayushexel avatar bgerazov avatar edresson avatar eginhard avatar erogol avatar freds0 avatar gerazov avatar gorkemgoknar avatar guypaddock avatar kaiidams avatar kirianguiller avatar lexkoro avatar manmay-nakhashi avatar mic92 avatar mittimithai avatar nmstoker avatar omahs avatar p0p4k avatar reuben avatar rishikksh20 avatar sanjaesc avatar synesthesiam avatar thllwg avatar thorstenmueller avatar twerkmeister avatar weberjulian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tts's Issues

[Feature request] Pass config values to Tensorboars

Is your feature request related to a problem? Please describe.
It is hard to compare models with different configurations by just looking at Tensorboard.

Describe the solution you'd like
We can pass the configuration fields to the tensorboard.

[Help] Share your TTS models

Please consider sharing your pre-trained models in any language (If the licences allow that).

We can include them in our model catalogue for public use by attributing your name (website, company etc.).

That would enable more people to experiment together and coordinate, instead of individual efforts to achieve similar goals.

That is also a chance to make your work more visible.

You can share in two ways;

  1. Share the model files with us and we serve them with the next 🐸 TTS release.
  2. Upload your models on GDrive and share the link.

Models are served under .models.json file and any model is available under tts CLI or Server end points. More details...

(previously mozilla/TTS#395)

[Bug] AttributeError: 'AttrDict' object has no attribute 'generator_model'

Welcome to the 🐸TTS project! We are excited to see your interest, and appreciate your support!

This repository is governed by the Contributor Covenant Code of Conduct. For more details, see the CODE_OF_CONDUCT.md file.

If you've found a bug, please provide the following information:

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

  1. pip install TTS
  2. tts --text "Это голос дикой планеты" --model_name "tts_models/ru/ruslan/tacotron2-DDC" --vocoder_name "tts_models/ru/ruslan/tacotron2-DDC" --out_path example.wav

Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
Expected behavior
Should not crash but generate audio

Environment (please complete the following information):

$ python --version
Python 3.8.5
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic

  • Exact command to reproduce:
    $ tts --text "Это голос дикой планеты" --model_name "tts_models/ru/ruslan/tacotron2-DDC" --vocoder_name "tts_models/ru/ruslan/tacotron2-DDC" --out_path example.wav

Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
tts_models/ru/ruslan/tacotron2-DDC is already downloaded.
Using model: Tacotron2
Traceback (most recent call last):
File "/opt/anaconda3/envs/py38/bin/tts", line 8, in
sys.exit(main())
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/bin/synthesize.py", line 188, in main
synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 49, in init
self.load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 102, in load_vocoder
self.vocoder_model = setup_generator(self.vocoder_config)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/vocoder/utils/generic_utils.py", line 70, in setup_generator
print(" > Generator Model: {}".format(c.generator_model))
AttributeError: 'AttrDict' object has no attribute 'generator_model'

[Discussion] Ideas for better model config management

(I keep it in the issues to refer back to the initial discussion)

Hi All!!

I guess one of the biggest issues in TTS is the way we handle the configs for models and training. Putting example config files under the config folder is hard to maintain and looks complicated for people to start using TTS.

So I want to discuss here some better alternatives and ask for the wisdom of the crowd 🧑‍🤝‍🧑.

Couple of constraints we need to consider from the top of my head.

  • configs should not be python specific, and they should be in a generic form to be serialized and loaded by other systems and programming languages. So if someone likes to export the model and use it in an embedded system config file should not be a problem.
  • configs should allow easy experimentation, collaboration, and reproduction.
  • Each model should explain its config fields. Right now I do this in config.json by violating the JSON format with comments. It is not optimal ☹️.

If you have an idea please share it below and let's discuss it.

Edit:

I should also add one more constraint.

  • We should solve this with no dependencies if possible.

NOTE: This is a continuation of previously started conversion mozilla/TTS#660

Originally posted by @erogol in #20

Version conflicts with numba when installing locally

When I ran pip install -e . or pip install -r requirements.txt I get the following errors:

ERROR: umap-learn 0.5.1 has requirement numba>=0.49, but you'll have numba 0.48.0 which is incompatible.
ERROR: pynndescent 0.5.2 has requirement numba>=0.51.2, but you'll have numba 0.48.0 which is incompatible.

Which versions of these two packages should they be downgraded to?

[Bug] DDC-TTS_Universal-Fullband-MelGAN_MAI-karen_savage_ES.ipynb

Hi, while trying to execute the Colab tutorial for synthetizing spanish speech, I got an error when executing the following line:

align, spec, stop_tokens, wav = tts(vocoder_model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

This is the error:

in tts(model, text, CONFIG, use_cuda, ap, use_gl, figures)
12 t_1 = time.time()
13 waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
---> 14 truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
15 print(mel_postnet_spec.shape)
16 mel_postnet_spec = ap._denormalize(mel_postnet_spec.T).T

/content/TTS_repo/TTS/tts/utils/synthesis.py in synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav, truncated, enable_eos_bos_chars, use_griffin_lim, do_trim_silence, speaker_embedding, backend)
239 if backend == 'torch':
240 decoder_output, postnet_output, alignments, stop_tokens = run_model_torch(
--> 241 model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings=speaker_embedding)
242 postnet_output, decoder_output, alignment, stop_tokens = parse_outputs_torch(
243 postnet_output, decoder_output, alignments, stop_tokens)

/content/TTS_repo/TTS/tts/utils/synthesis.py in run_model_torch(model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings)
57 else:
58 decoder_output, postnet_output, alignments, stop_tokens = model.inference(
---> 59 inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
60 elif 'glow' in CONFIG.model.lower():
61 inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device) # pylint: disable=not-callable

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.class():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29

TypeError: inference() got an unexpected keyword argument 'speaker_ids'

Thanks!

"num_samples should be a positive integer value" error if `eval_split_size` is >= size of dataset

We're training a WaveGrad Vocoder on a fairly small dataset right now (~250 samples), and ran into the following error recently:

Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
    _, global_step = train(model, criterion, optimizer, scheduler, scaler,
  File "./TTS/bin/train_vocoder_wavegrad.py", line 82, in train
    data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
  File "./TTS/bin/train_vocoder_wavegrad.py", line 46, in setup_loader
    loader = DataLoader(dataset,
  File "coqui-tts\lib\site-packages\torch\utils\data\dataloader.py", line 266, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore
  File "coqui-tts\lib\site-packages\torch\utils\data\sampler.py", line 103, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

This appears to be related to the undocumented eval_split_size setting in the config.json value. The default config for WaveGrad specifies this as 256. After debugging for a bit, it appears that the way this setting works is that it controls how many files are used for the evaluation set. So, if there are 500 WAV files, and the eval_split_size is set to 256, then the first 256 audio files encountered are used for the evaluation set and the remaining 244 are used for training.

Since it can take a fair bit of debugging for an end-user to understand what's going on, I propose two things:

  1. There should be a sanity check/validation check that raises a more appropriate error if the number of WAV files is smaller than the eval_split_size.
  2. The eval_split_size parameter in the config should be documented so users understand what it does and can tune it appropriately.

[Feature request] accessing __version__ variable

Is your feature request related to a problem? Please describe.
It would be nice to see which version of TTS i am currently using with TTS.__version__ command

Describe the solution you'd like
Add __version__ information in a _version.py file

[Bug] MelGAN based vocoders do not use feature matching loss even if it is enabled.

Describe the bug
The condition checking for enabling feature_matching loss in

if self.use_feat_match_loss and not feats_fake:
is always False.

Probably all our trained models are affected by this bug and caused suboptimal results.

Specifically, I observed that it caused the metallic noise in the model outputs.

Expected behavior
The models should use feat_matching loss

Additional context
For anyone who needs an instant fix, the line indicated above needs to be updated as follows.

        if self.use_feat_match_loss and not feats_fake is None:

Bug: Dynamic Convolution Attention fails in `mixed_precision` training.

Describe the bug
Dynamic Convolutional Attention fails in mixed_precision training and ultimately causes NaN error.

To Reproduce
Steps to reproduce the behavior:

  1. set mixed_precision=True in config.json.
  2. set dynamic_convolution=True in config.json.
  3. start training a tacotron or tacotron2 model.
  4. On TB initially you observe broken attention alignment.
  5. Ultimately loss becomes NaN.

Expected behavior
The model should learn the alignment after 10K iterations with no NaN loss as it does in full precision training.

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • PyTorch or TensorFlow version (use command below): Torch 1.8.0
  • Python version: 3.8
  • CUDA/cuDNN version: 11.2
  • GPU model and memory: 1080Ti
  • Exact command to reproduce:

Additional context
Add any other context about the problem here.

Google Lyra as the vocoder

Google open source lyra (https://github.com/google/lyra), a version of waveRNN vocoder recently, I wonder if any of you think of using this Lyra version as the vocoder for TTS.

The benefit of this approach is that we have a well-engineered real-time vocoder on mobile devices (and hopefully, a high quality vocoder).

The unknown here is that, we don't know if Lyra is fitted for TTS. From reading google's papers, they use quantized 160-dimension mel-spectrogram as the the conditional features with only one frame look ahead.

The source code of this real-time wavegru vocoder can be really helpful anyway!

SpeedySpeech model causes error for input text shorter than 13 characters.

Due to the architecture of the model and the total receptive field, it causes errors for input text shorter than 13 characters.

This can be fixed by padding the input text with empty characters.

(venv) $ tts --model_name tts_models/en/ljspeech/speedy-speech-wn --text "Hey Bruce, what's good in the neighborhood?"
 > tts_models/en/ljspeech/speedy-speech-wn is already downloaded.
 > vocoder_models/en/ljspeech/multiband-melgan is already downloaded.
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/josh/venv/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/bin/synthesize.py", line 190, in main
    synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 47, in __init__
    use_cuda)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 96, in load_tts
    self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/tts/models/speedy_speech.py", line 196, in load_checkpoint
    self.load_state_dict(state['model'])
  File "/home/josh/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
	size mismatch for emb.weight: copying a param with shape torch.Size([129, 128]) from checkpoint, the shape in current model is torch.Size([130, 128]).

Thanks, @JRMeyer, for pointing this out 👑

Noam LR Scheduling - AttributeError: 'NoneType' object has no attribute 'step'

I'm working on a step-wise learning rate scheduling method and I wanted to take inspiration from the NoamLR() class found in training.py. When I set noam_schedule: true in the config, the following error is shown.

File "TTS/bin/train_tacotron.py", line 674, in
main(args)
File "TTS/bin/train_tacotron.py", line 640, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "TTS/bin/train_tacotron.py", line 154, in train
scheduler.step()
AttributeError: 'NoneType' object has no attribute 'step'

[Feature request] add stopnet delay argument to synthesis function (tacotron)

Sometimes synthesis for some sentences are cut short at the last word. I know (think) that it's indicative that something is amiss in the model or the dataset, either not trained long enough, audio parameters could be tuned further (trim_db ?) or just dataset quality. But taking time to fix that issue, debugging and training many models is a luxury that some people can't afford (maybe even more if it's a low ressource language).

I would gladly do a PR to propose the feature but I'm not sure how to go about the implementation.
Would adding a stopnet delay (delaying from n steps the stop signal) solve this issue ?

Please provide german voice

The quality of the english examples is already very good. What's missing to be useful for me is a model for a german voice.

You can use this issue for related discussions and documenting progress creating it.

Support offline mobile

This repository is really great! Decent samples too.

I see a huge opportunity for this to be extended to support mobile.

There are a number of obstacles to this of course, including running on TF-Lite.

If you ported it to Dart you could transpile it to iOS and Android.

[Bug] TTS does not detect new model versions

Describe the bug
A clear and concise description of what the bug is.

After upgrading from tts-0.0.9 to tts-0.0.11 the model was updated but TTS still tries to load the cached version.
A fix could be to hash models and compare if the cached model is the same as the hash. This also fixes
cases where models were corrupted in any way

To Reproduce

$  /nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server --model_name tts_models/en/ljspeech/glow-tts --vocoder_name vocoder_models/universal/libri-tts/fullband-melgan
 > tts_models/en/ljspeech/glow-tts is already downloaded.
 > vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
 > Using model: glow_tts
Traceback (most recent call last):
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/.tts-server-wrapped", line 6, in <module>
    from TTS.server.server import main
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/server/server.py", line 62, in <module>
    synthesizer = Synthesizer(args.tts_checkpoint, args.tts_config, args.vocoder_checkpoint, args.vocoder_config, args.use_cuda)
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 45, in __init__
    self.load_tts(tts_checkpoint, tts_config,
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 95, in load_tts
    self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/tts/models/glow_tts.py", line 229, in load_checkpoint
    self.load_state_dict(state['model'])
  File "/nix/store/1vv0fsvdv9j4gmqjgjwb3c5v8x906qgd-python3.8-pytorch-1.8.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
        size mismatch for encoder.emb.weight: copying a param with shape torch.Size([129, 192]) from checkpoint, the shape in current model is torch.Size([130, 192]).
/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server      4,64s user 0,71s system 117% cpu 4,541 total

Expected behavior

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): NixOS
  • PyTorch or TensorFlow version (use command below): pytorch 1.8.1
  • Python version: 3.7
  • CUDA/cuDNN version: no cuda
  • GPU model and memory: no gpu
  • Exact command to reproduce: see above

[Feature request] Test gpu memory capacity right from the start (training)

It would be cool to test if the GPU memory is enough for the model/config/dataset combo right from the start because it wastes time and money to start training only to discover that your training failed because of an OOM error.

I would suggest maybe for the first epoch and first batch to put all the longest samples duration/seq_length.
Or do a warmup batch the same way + a loaded test batch.

Configuring MelSpectrogram parameters for custom dataset

I've been trying to figure out the a good configuration for training the Tacotron2 model. I'm not sure how to set MelSpectrogram parameters accurately.

Specifically, how would I calculate the right values for mel_fmin and mel_fmax for my dataset?

Thanks!

[Bug] Dependencies in 0.0.13.1

Describe the bug
With TTS==0.0.13.1

from TTS.utils.synthesizer import Synthesizer

  • ModuleNotFoundError: No module named 'numba.decorators'
    The error seems to come from librosa. Solution: pin dependency numba to 0.48. See librosa/librosa#1160
  • ModuleNotFoundError: No module named 'packaging'
    Solution: require Dependency on packaging

To Reproduce
Steps to reproduce the behavior:

  1. pip install TTS==0.0.13.1
  2. from TTS.utils.synthesizer import Synthesizer
  3. See error

Expected behavior
No Exceptions :)

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker python:3.8
  • PyTorch or TensorFlow version (use command below):
  • Python version: Python 3.8.9

Feature Request: Loss coincidence report for sample data

In https://discourse.mozilla.org/t/custom-voice-tts-not-learning/40897/5, @erogol mentioned that a way to weed out bad samples in the data is to run the training network on the data to see which have the highest loss. Is there any easy way to see this? I am taking the comment to mean that we'd need to narrow the training list to just a few files at a time, run training, and check the loss value; then repeat for each handful of sample files to see a pattern. If so, that could take quite some time. Unless there is a report or something that I'm not aware of?

As we all know, training data set quality is the biggest factor influencing training. So, anything we can do to flag sub-optimal training samples that the CheckDataset notebook otherwise doesn't flag would be ideal.

To that end, is there any opportunity for the model to track and spit out a coincidence report of files to the average loss with those files? In other words, what if the training process tracked the average loss value observed each time each file is in a batch. Over time, that could be used to drive a heatmap of which files happen to be coincident with higher loss. That way, users would quickly identify the outliers in the data set that are contributing most to the loss.

[Bug] KeyError: 'default_vocoder' for all hifigan_v2 models

Describe the bug
Running any of the HiFiGAN models fails with a KeyError for the default_vocoder.

To Reproduce
Steps to reproduce the behavior:

  1. Install from PyPI (or dev branch using pip)
  2. Run tts-server --list_models | grep hifigan to get the list of HiFiGAN models
  3. Attempt to run any eg tts-server --use_cuda=true --model_name vocoder_models/en/sam/hifigan_v2
  4. See error:
Traceback (most recent call last):
  File "/home/gez/Projects/coqui-tts/.venv/bin/tts-server", line 6, in <module>
    from TTS.server.server import main
  File "/home/gez/Projects/coqui-tts/.venv/lib/python3.8/site-packages/TTS/server/server.py", line 86, in <module>
    args.vocoder_name = model_item["default_vocoder"] if args.vocoder_name is None else args.vocoder_name
KeyError: 'default_vocoder'

Expected behavior
Presumably model_item should always have a default_vocoder, or it should be checked and handled gracefully.

Environment (please complete the following information):

  • Python version: 3.8.3

RuntimeError: The expanded size of the tensor (12) must match the existing size (84) at non-singleton dimension 2. Target sizes: [64, 80, 12]. Tensor sizes: [64, 1, 84]

I have the same problem with v0.0.12:

 CUDA_VISIBLE_DEVICES="0" python ../../TTS/bin/train_tacotron.py --config_path model_config.json
 > Using CUDA:  True
 > Number of GPUs:  1
 > Git Hash: 59ab268
 > Experiment folder: /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > log_func:<ufunc 'log10'>
 | > exp_func:<function AudioProcessor.__init__.<locals>.<lambda> at 0x7f1ef7ac6c10>
 | > hop_length:256
 | > win_length:1024
 | > /tmp/tts/by_book/female/ezwa/monsieur_lecoq/metadata.csv
 | > Found 14211 files in /tmp/tts
 > Using model: Tacotron2

 > Model has 28183506 parameters
 > Starting with inf best loss.

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: fr-fr
 | > Number of instances : 14069
 | > Max length sequence: 281
 | > Min length sequence: 3
 | > Avg length sequence: 105.0826640130784
 | > Num. instances discarded by max-min (max=153, min=6) seq limits: 2420
 | > Batch group size: 128.

 > EPOCH: 0/1000

 > Number of output frames: 7

 > TRAINING (2021-04-23 15:38:18)
 ! Run is removed from /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
Traceback (most recent call last):
  File "../../TTS/bin/train_tacotron.py", line 744, in <module>
    main(args)
  File "../../TTS/bin/train_tacotron.py", line 704, in main
    train_avg_loss_dict, global_step = train(
  File "../../TTS/bin/train_tacotron.py", line 198, in train
    decoder_output, postnet_output, alignments, stop_tokens = model(
  File "/home/lpierron/miniconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/lpierron/Mozilla_TTS/COQUI-TTS/TTS/TTS/tts/models/tacotron2.py", line 226, in forward
    decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs)
RuntimeError: The expanded size of the tensor (12) must match the existing size (84) at non-singleton dimension 2.  Target sizes: [64, 80, 12].  Tensor sizes: [64, 1, 84]

I have downgraded librosa==0.6.3 but it doesn't work.

See my configuration next:

model_config.json.txt

Originally posted by @lpierron in #370 (comment)

GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference

I've trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'. referring to layers/tacotorn.py --> 478 self.attention.init_win_idx(). I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py do not exist in the GravesAttention class.

Config:

{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {

"fft_size": 1024, 
"win_length": 1024, 
"hop_length": 256, 
"frame_length_ms": null, 
"frame_shift_ms": null, 


"sample_rate": 24000, 
"preemphasis": 0.0, 
"ref_level_db": 20, 


"do_trim_silence": true, 
"trim_db": 60, 


"power": 1.5, 
"griffin_lim_iters": 60, 


"num_mels": 80, 
"mel_fmin": 95.0, 
"mel_fmax": 12000.0, 
"spec_gain": 20,


"signal_norm": true, 
"min_level_db": -100, 
"symmetric_norm": true, 
"max_norm": 4.0, 
"clip_norm": true, 
"stats_path": null 

},

"distributed": {
"backend": "nccl",
"url": "tcp://localhost:54321"
},

"reinit_layers": [],

"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,

"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,

"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,

"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,

"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,

"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,

"stopnet": true,
"separate_stopnet": true,

"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,

"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,

"output_path": "/home/big-boy/Models/Blizzard/",

"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",

"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},

"datasets":
[{
"name": "ljspeech",
"path": "/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}

Alignment plots:

image

[Bug] fix windows support (audio lambda function)

I think we've established that windows support is broken since that commit e0b3008 .
I suspect that it's due to the exp/log function stored in the class.
I would suggest to replace the log_func in the constructor by a exp_log_base, and only store that number in the class. Then I propose using math.log(x, exp_log_base) and exp_log_base**x since np.log doesn't allow to pass a base argument. But if we need np because of speed, we can define a function:

def np_log_base(x, base):
    return np.log(x) / np.log(base)

would that fix be ok ?

[Bug] Not able to install

pip install TTS

Failed to build TTS ERROR: Could not build wheels for TTS which use PEP 517 and cannot be installed directly

Adding more details about the state of my machine. I am on Windows 10.
I am using python 3.8.0 installed via pyenv. Here are my current package versions under python 3.8.0 environment:

pip list

Package Version


absl-py 0.12.0
appdirs 1.4.4
argon2-cffi 20.1.0
astor 0.8.1
astunparse 1.6.3
async-generator 1.10
attrs 20.3.0
audioread 2.1.9
backcall 0.2.0
bar-chart-race 0.1.0
black 20.8b1
bleach 1.5.0
cachetools 4.2.1
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 7.1.2
colorama 0.4.4
cycler 0.10.0
decorator 5.0.5
defusedxml 0.7.1
dill 0.3.3
entrypoints 0.3
flatbuffers 1.12
gast 0.3.3
google-auth 1.28.0
google-auth-oauthlib 0.4.4
google-pasta 0.2.0
grpcio 1.32.0
gTTS 2.2.2
h5py 2.10.0
html5lib 0.9999999
idna 2.10
inflect 5.3.0
ipykernel 5.5.3
ipython 7.22.0
ipython-genutils 0.2.0
ipywidgets 7.6.3
jedi 0.18.0
Jinja2 2.11.3
joblib 1.0.1
jsonpatch 1.32
jsonpointer 2.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.13
jupyter-console 6.4.0
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
librosa 0.8.0
llvmlite 0.36.0
Markdown 3.3.4
MarkupSafe 1.1.1
matplotlib 3.3.3
mistune 0.8.4
multiprocess 0.70.11.1
mypy-extensions 0.4.3
nbclient 0.5.3
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
notebook 6.3.0
numba 0.53.1
numpy 1.19.3
oauthlib 3.1.0
opt-einsum 3.3.0
packaging 20.9
pandas 1.2.3
pandocfilters 1.4.3
parso 0.8.2
pathspec 0.8.1
pep517 0.10.0
pickleshare 0.7.5
Pillow 8.2.0
pip 21.0.1
pooch 1.3.0
prometheus-client 0.10.0
prompt-toolkit 3.0.18
protobuf 3.15.7
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
Pygments 2.8.1
pynndescent 0.5.2
pyparsing 2.4.7
PyQt5 5.15.4
PyQt5-Qt5 5.15.2
PyQt5-sip 12.8.1
pyrsistent 0.17.3
python-dateutil 2.8.1
pytz 2021.1
pywin32 300
pywinpty 0.5.7
pyzmq 22.0.3
qtconsole 5.0.3
QtPy 1.9.0
regex 2021.4.4
requests 2.25.1
requests-oauthlib 1.3.0
resampy 0.2.2
rsa 4.7.2
scikit-learn 0.24.1
scipy 1.6.2
Send2Trash 1.5.0
setuptools 54.2.0
six 1.15.0
sounddevice 0.4.1
SoundFile 0.10.3.post1
tensorboard 1.14.0
tensorboard-plugin-wit 1.8.0
tensorflow 1.14.0
tensorflow-estimator 1.14.0
tensorflow-hub 0.11.0
tensorflow-tensorboard 1.5.1
termcolor 1.1.0
terminado 0.9.4
testpath 0.4.4
threadpoolctl 2.1.0
toml 0.10.2
torch 1.8.1+cpu
torchaudio 0.8.1
torchfile 0.1.0
torchvision 0.9.1+cpu
tornado 6.1
tqdm 4.60.0
traitlets 5.0.5
typed-ast 1.4.2
typing-extensions 3.7.4.3
umap-learn 0.5.1
Unidecode 1.2.0
urllib3 1.26.4
visdom 0.1.8.9
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
Werkzeug 1.0.1
wheel 0.36.2
widgetsnbextension 3.5.1
wrapt 1.12.1

Weird Training Plots and No Audio After 70K Steps - Taco1 GST Blizzard

Hey, I'm getting weird plots and no audio produced in the tensorboard examples after 70k Steps. I'm using a custom Blizzard dataset that I've already trained other models with that produced intelligible speech after 20k steps. The training has also stopped after 70K steps because of the NaN decoder_loss error. I'm using the #373 patched dev branch with the following config file:
{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {

    "fft_size": 1024, 
    "win_length": 1024, 
    "hop_length": 256, 
    "frame_length_ms": null, 
    "frame_shift_ms": null, 

    
    "sample_rate": 24000, 
    "preemphasis": 0.0, 
    "ref_level_db": 20, 

    
    "do_trim_silence": true, 
    "trim_db": 60, 

    
    "power": 1.5, 
    "griffin_lim_iters": 60, 

    
    "num_mels": 80, 
    "mel_fmin": 95.0, 
    "mel_fmax": 12000.0, 
    "spec_gain": 1,

    
    "signal_norm": true, 
    "min_level_db": -100, 
    "symmetric_norm": true, 
    "max_norm": 4.0, 
    "clip_norm": true, 
    "stats_path": null 
},


"distributed": {
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"reinit_layers": [], 


"batch_size": 128, 
"eval_batch_size": 16,
"r": 7, 
"gradual_training": [
    [0, 7, 64],
    [1, 5, 64],
    [50000, 3, 32],
    [130000, 2, 32],
    [290000, 1, 32]
], 
"mixed_precision": true, 


"loss_masking": false, 
"decoder_loss_alpha": 0.5, 
"postnet_loss_alpha": 0.25, 
"postnet_diff_spec_alpha": 0.25, 
"decoder_diff_spec_alpha": 0.25, 
"decoder_ssim_alpha": 0.5, 
"postnet_ssim_alpha": 0.25, 
"ga_alpha": 5.0, 
"stopnet_pos_weight": 15.0, 



"run_eval": true,
"test_delay_epochs": 10, 
"test_sentences_file": null, 


"noam_schedule": false, 
"grad_clip": 1.0, 
"epochs": 300000, 
"lr": 0.0001, 
"wd": 0.000001, 
"warmup_steps": 4000, 
"seq_len_norm": false, 


"memory_size": -1, 
"prenet_type": "original", 
"prenet_dropout": true, 


"attention_type": "graves", 
"attention_heads": 4, 
"attention_norm": "sigmoid", 
"windowing": false, 
"use_forward_attn": false, 
"forward_attn_mask": false, 
"transition_agent": false, 
"location_attn": true, 
"bidirectional_decoder": false, 
"double_decoder_consistency": false, 
"ddc_r": 7, 


"stopnet": true, 
"separate_stopnet": true, 


"print_step": 25, 
"tb_plot_step": 100, 
"print_eval": false, 
"save_step": 5000, 
"checkpoint": true, 
"tb_model_param_stats": false, 


"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false, 
"num_loader_workers": 8, 
"num_val_loader_workers": 8, 
"batch_group_size": 4, 
"min_seq_len": 6, 
"max_seq_len": 153, 
"compute_input_seq_cache": false, 
"use_noise_augment": true,


"output_path": "/home/big-boy/Models/Blizzard/",


"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/", 
"use_phonemes": true, 
"phoneme_language": "en-us", 


"use_speaker_embedding": false, 
"use_gst": true, 
"use_external_speaker_embedding_file": false, 
"external_speaker_embedding_file": "../../speakers-vctk-en.json", 
"gst": { 
    "gst_style_input": null, 
    "gst_embedding_dim": 512,
    "gst_num_heads": 4,
    "gst_style_tokens": 10,
    "gst_use_speaker_embedding": false
},


"datasets": 
    [{
        "name": "ljspeech",
        "path": "/home/big-boy/Data/blizzard2013/segmented/",
        "meta_file_train": "metadata.csv", 
        "meta_file_val": null
    }]

}

Screenshot from 2021-03-13 11-05-31

Screenshot from 2021-03-04 11-33-42

French Tacotron2 DDC TTS model with HifiGan2 very noisy

I trained a French TTS model with Tacotron2 DDC from MAI-Labs. I'm using Coqui-TTS v.0.0.12.

I tried TTS with the vocoder vocoder_models--en--ljspeech--hifigan_v2, as dowloaded fro Coqui-TTS.

The resulting audio file is very noisy as you can hear: https://sndup.net/2t9d

You can find as a gist my config.json and vocoder_config.json: https://gist.github.com/lpierron/6c56302eb628ee6a86363daa08e5fa63

Any idea to solve the noisy problem ?

I tried using another Vocoder (Melgan one) and there is no noise, but the voice is hoarse as you can her: https://sndup.net/4jgn

[Bug] save_spectogram seems not implemented

Hello there!

Thanks for the project. I think the saving of raw spectrograms through --save_spectogram is not implemented, right? If so, maybe we can use this issue to track its development.

Tacotron model uses Tacotron2 losses

/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py:94: UserWarning: Using a target size (torch.Size([64, 90, 80])) that is different to the input size (torch.Size([64, 90, 513])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.l1_loss(input, target, reduction=self.reduction)
! Run is removed from /home/big-boy/Models/Blizzard/blizzard-gts-March-11-2021_05+38PM-45068a9
Traceback (most recent call last):
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 721, in
main(args)
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 619, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 180, in train
loss_dict = criterion(postnet_output, decoder_output, mel_input,
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 377, in forward
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 203, in forward
return self.loss_func(x_diff, target_diff)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 94, in forward
return F.l1_loss(input, target, reduction=self.reduction)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/functional.py", line 2633, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/functional.py", line 71, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (513) must match the size of tensor b (80) at non-singleton dimension 2

Originally posted by @a-froghyar in #370 (comment)

Representative Samples

The README.md links to English Voice Samples which claim to use and English DDC model however it is not identifiable in list models;

tts --list_models
 Name format: type/language/dataset/model
 >: tts_models/en/ek1/tacotron2
 >: tts_models/en/ljspeech/glow-tts
 >: tts_models/en/ljspeech/tacotron2-DCA
 >: tts_models/en/ljspeech/speedy-speech-wn
 >: tts_models/es/mai/tacotron2-DDC
 >: tts_models/fr/mai/tacotron2-DDC
 >: tts_models/zh-CN/baker/tacotron2-DDC-GST
 >: tts_models/nl/mai/tacotron2-DDC
 >: tts_models/ru/ruslan/tacotron2-DDC
 >: vocoder_models/universal/libri-tts/wavegrad
 >: vocoder_models/universal/libri-tts/fullband-melgan
 >: vocoder_models/en/ek1/wavegrad
 >: vocoder_models/en/ljspeech/multiband-melgan
 >: vocoder_models/nl/mai/parallel-wavegan

Can the Samples page be changed to one in this project and using an available model?

[Bug] Trainning using --restore_path fails with param 'initial_lr' not specified message.

Describe the bug
Trainning using --restore_path fails with param 'initial_lr' not specified message.

To Reproduce
Steps to reproduce the behavior:

  1. python TTS/bin/train_tacotron.py --config_path config.json --restore_path best_model.pth.tar
  2. See error:
    File "TTS/bin/train_tacotron.py", line 664, in
    main(args)
    File "TTS/bin/train_tacotron.py", line 580, in main
    scheduler = NoamLR(optimizer,
    File "/root/TTS/TTS/utils/training.py", line 94, in init
    super(NoamLR, self).init(optimizer, last_epoch)
    File "/root/anaconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 39, in init
    raise KeyError("param 'initial_lr' is not specified "
    KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu Docker 18.04.4 LTS
  • PyTorch or TensorFlow version (use command below): torch=1.8.1+cu111
  • Python version: 3.8.3
  • CUDA/cuDNN version: 11
  • GPU model and memory: V100

🐸 TTS roadmap

These are the main dev plans for 🐸 TTS.

If you want to contribute to 🐸 TTS and don't know where to start you can pick one here and start with our Contribution Guideline. We're also always here to help.

Feel free to pick one or suggest a new one.

Contributions are always welcome 💪 .

v0.1.0 Milestones

  • Better model config handling #21
  • TTS recipes for public datasets.
  • TTS trainer API to unify all the model training scripts.
  • TTS, Vocoder and SpeakerEncoder model abstractions and APIs.
  • Documentation for
    • Implementing a new model using 🐸 TTS.
    • Training a model on a new dataset from gecko.
    • Using Synthesizer interface on CLI or Server.
    • Extracting Spectrograms for Vocoder training.
    • Contributing a new pre-trained 🐸 TTS model.
    • Explanation for Model config parameters/

v0.2.0 Milestones

  • Grapheme 2 Phoneme in-house conversion. (Thx to gruut 👍 )
  • Implement VITS model.

v0.3.0 Milestones

  • Implement generic ForwardTTS API.
  • Implement Fast Speech model.
  • Implement Fast Pitch model.

v0.4.0 Milestones

v0.5.0 Milestones

  • Support for multi-lingual models
  • YourTTS release 🚀

v0.6.0 Milestones

  • Add ESpeak support
  • New Tokenizer and Phonemizer APIs #937
  • New Model API #1078
  • Splitting the trainer as a separate repo 👟Trainer
  • Update VITS model API
  • Gradient accumulation. #560 (in 👟)

v0.7.0 Milestones

v0.8.0 Milestones

  • Separate numpy transforms
  • Better data sampling for VITS
  • New Thorsten DE models 👑 @thorstenMueller

🏃‍♀️ Milestones along the way

  • Implement End-to-end training API for ForwardTTS models a vocoder. #1510
  • Implement a Python voice synthesis API.
  • Inject phonemes to the input text at inference. #1452
  • AdaSpeech1/2 https://arxiv.org/pdf/2104.09715 and https://arxiv.org/abs/2103.00993
  • Let the user pass a custom text cleaner function.
  • Refactor the text cleaners for a more flexible and transparent API.
  • Implement HifiGAN2 (not the vocoder)
  • Implement emotion and style adaptation.
  • Implement FastSpeech2 (https://arxiv.org/abs/2006.04558).
  • AutoTTS 🤖 (👑 @loganhart420)
  • Watermarking TTS outputs to sign against DeepFakes.
  • Implement SSML v0.0.1
  • ONNX and TorchScript model exports.
  • TensorFlow run-time for training models.

🤖 New TTS models

TypeError: expected str, bytes or os.PathLike object, not bool "Coqui-TTS-torchhub-example.ipynb"

When I try to run Coqui-TTS-torchhub-example.ipynb on colab I got this error:
"""
Downloading: "https://github.com/coqui-ai/TTS/archive/dev.zip" to /root/.cache/torch/hub/dev.zip

Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--tacotron2-DCA
Downloading model to /root/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan
Using model: Tacotron2


TypeError Traceback (most recent call last)
in ()
3 synthesizer = torch.hub.load('coqui-ai/TTS:dev',
4 'tts',
----> 5 source='github')
6 wav = synthesizer.tts("TTS is an open-source library that generates synthethic speech!")

6 frames
/usr/local/lib/python3.7/dist-packages/torch/hub.py in load(repo_or_dir, model, *args, **kwargs)
337 repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
338
--> 339 model = _load_local(repo_or_dir, model, *args, **kwargs)
340 return model
341

/usr/local/lib/python3.7/dist-packages/torch/hub.py in _load_local(hubconf_dir, model, *args, **kwargs)
366
367 entry = _load_entry_from_hubconf(hub_module, model)
--> 368 model = entry(*args, **kwargs)
369
370 sys.path.remove(hubconf_dir)

/root/.cache/torch/hub/coqui-ai_TTS_dev/hubconf.py in tts(model_name, vocoder_name, use_cuda)
29
30 # create synthesizer
---> 31 synt = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, use_cuda)
32 return synt
33

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in init(self, tts_checkpoint, tts_config_path, tts_speakers_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, use_cuda)
73 self.output_sample_rate = self.tts_config.audio["sample_rate"]
74 if vocoder_checkpoint:
---> 75 self._load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
76 self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
77

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in _load_vocoder(self, model_file, model_config, use_cuda)
151 use_cuda (bool): enable/disable CUDA use.
152 """
--> 153 self.vocoder_config = load_config(model_config)
154 self.vocoder_ap = AudioProcessor(verbose=False, **self.vocoder_config["audio"])
155 self.vocoder_model = setup_generator(self.vocoder_config)

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/io.py in load_config(config_path)
43 config = AttrDict()
44
---> 45 ext = os.path.splitext(config_path)[1]
46 if ext in (".yml", ".yaml"):
47 with open(config_path, "r", encoding="utf-8") as f:

/usr/lib/python3.7/posixpath.py in splitext(p)
120
121 def splitext(p):
--> 122 p = os.fspath(p)
123 if isinstance(p, bytes):
124 sep = b'/'

TypeError: expected str, bytes or os.PathLike object, not bool
"""
Anyone tell me what happens?

Training WaveGrad Immediately Fails with "ValueError: The histogram is empty, please file a bug report."

With dc2954e checked out, when I try to train WaveGrad on with the following config:

{
    "run_name": "wavegrad-my-project",
    "run_description": "wavegrad test",

    "audio":{
        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": false,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "stats_path": "/path/to/my/project/scale_stats.npy"    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
    },

    // DISTRIBUTED TRAINING
    "mixed_precision": true,     // enable torch mixed precision training (true, false)
    "distributed":{
        "backend": "nccl",
        "url": "tcp:\/\/localhost:54322"
    },

    "target_loss": "avg_wavegrad_loss",  // loss value to pick the best model to save after each epoch

    // MODEL PARAMETERS
    "generator_model": "wavegrad",
    "model_params":{
        "use_weight_norm": true,
        "y_conv_channels":32,
        "x_conv_channels":768,
        "ublock_out_channels": [512, 512, 256, 128, 128],
        "dblock_out_channels": [128, 128, 256, 512],
        "upsample_factors": [4, 4, 4, 2, 2],
        "upsample_dilations": [
            [1, 2, 1, 2],
            [1, 2, 1, 2],
            [1, 2, 4, 8],
            [1, 2, 4, 8],
            [1, 2, 4, 8]]
    },

    // DATASET
    "data_path": "/path/to/my/project/wavs/22.05k_edited_normalized",  // root data path. It finds all wav files recursively from there.
    "feature_path": null,   // if you use precomputed features
    "seq_len": 6144,        // 24 * hop_length
    "pad_short": 0,      // additional padding for short wavs
    "conv_pad": 0,          // additional padding against convolutions applied to spectrograms
    "use_noise_augment": false,     // add noise to the audio signal for augmentation
    "use_cache": false,      // use in memory cache to keep the computed features. This might cause OOM.

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

    // TRAINING
    "batch_size": 96,      // Batch size for training.

    // NOISE SCHEDULE PARAMS - Only effective at training time.
    "train_noise_schedule":{
        "min_val": 1e-6,
        "max_val": 1e-2,
        "num_steps": 1000
    },
    "test_noise_schedule":{
        "min_val": 1e-6,
        "max_val": 1e-2,
        "num_steps": 50
    },

    // VALIDATION
    "run_eval": true,       // enable/disable evaluation run

    // OPTIMIZER
    "epochs": 10000,                // total number of epochs to train.
    "clip_grad": 1.0,                 // Generator gradient clipping threshold. Apply gradient clipping if > 0
    "lr_scheduler": "MultiStepLR",  // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.

    // TENSORBOARD and LOGGING
    "print_step": 50,       // Number of steps to log traning on console.
    "print_eval": false,     // If True, it prints loss values for each step in eval run.
    "save_step": 5000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "keep_all_best": false,  // If true, keeps all best_models after keep_after steps
    "keep_after": 10000,    // Global step after which to keep best models if keep_all_best is true
    "tb_model_param_stats": true,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

    // DATA LOADING
    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "eval_split_size": 256,

    // PATHS
    "output_path": "/path/to/my/project/Models"
}

... it fails immediately with this error:

 > TRAINING (2021-03-28 19:47:57)

   --> TRAIN PERFORMACE -- EPOCH TIME: 8.09 sec -- GLOBAL_STEP: 1
     | > avg_wavegrad_loss: 1.47542
     | > avg_loader_time: 16.76300
     | > avg_step_time: 8.08550

coqui-tts\lib\site-packages\torch\optim\lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[WARNING] NaN or Inf found in input tensor.
 ! Run is removed from D:/path/to/my/project
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
    _, global_step = train(model, criterion, optimizer, scheduler, scaler,
  File "./TTS/bin/train_vocoder_wavegrad.py", line 223, in train
    tb_logger.tb_model_weights(model, global_step)
  File "coqui-tts\TTS\utils\tensorboard_logger.py", line 34, in tb_model_weights
    self.writer.add_histogram(
  File "coqui-tts\lib\site-packages\tensorboardX\writer.py", line 503, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 210, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 248, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.

[Bug] Skipping part of a sentence

Description
Using tts, it skipped part of a sentence. For example, for 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' it skipped the last part "and the distribution of the costs and benefits across different segments of society".
However, if a period before the part that was skipped the complete text is synthesized.

To Reproduce
Single sentence:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' --out_path output.wav

Split by a period:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits. And the distribution of the costs and benefits across different segments of society.' --out_path output.wav

Expected behavior
The long sentences should be fully synthesized.

Crashing while saving checkpoint

Hi,

I am trying to train a Tacotron2 model in Hindi. I have my own 25 hour single speaker cleaned dataset. I'm using the following configuration.

{
"model": "Tacotron2",
"run_name": "hindi-ddc",
"run_description": "tacotron2 with DDC and differential spectral loss.",

// AUDIO PARAMETERS
"audio":{
    // stft parameters
    "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1024,      // stft window length in ms.
    "hop_length": 256,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate.
    "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (true), TWEB (false), Nancy (true)
    "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

    // Griffin-Lim
    "power": 1.5,           // value to sharpen wav signals after GL algorithm.
    "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 1,

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},

// VOCABULARY PARAMETERS
// if custom character set is not defined,
// default set in symbols.py is used
"characters":{
    "pad": "_",
    "eos": "~",
    "bos": "^",
    "characters": "अआइईउऊऋएऐऑओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा",
    "punctuations":"!'\",.:?। ",
    "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
},


// DISTRIBUTED TRAINING
"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

// TRAINING
"batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"eval_batch_size":16,
"r": 7,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"mixed_precision": true,     // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.

// LOSS SETTINGS
"loss_masking": true,       // enable / disable loss masking against the sequence padding.
"decoder_loss_alpha": 0.5,  // original decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
"postnet_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.5,     // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.25,     // postnet ssim loss weight. If > 0, it is enabled
"ga_alpha": 5.0,           // weight for guided attention loss. If > 0, guided attention is enabled.
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.


// VALIDATION
"run_eval": true,
"test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

// OPTIMIZER
"noam_schedule": false,        // use noam warmup and lr schedule.
"grad_clip": 1.0,              // upper limit for gradients for clipping.
"epochs": 1000,                // total number of epochs to train.
"lr": 0.0001,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
"wd": 0.000001,                // Weight decay weight.
"warmup_steps": 4000,          // Noam decay steps to increase the learning rate from 0 to "lr"
"seq_len_norm": false,         // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.

// TACOTRON PRENET
"memory_size": -1,             // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
"prenet_type": "original",     // "original" or "bn".
"prenet_dropout": false,       // enable/disable dropout at prenet.

// TACOTRON ATTENTION
"attention_type": "original",  // 'original' , 'graves', 'dynamic_convolution'
"attention_heads": 4,          // number of attention heads (only for 'graves')
"attention_norm": "sigmoid",   // softmax or sigmoid.
"windowing": false,            // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false,     // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false,    // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false,     // enable/disable transition agent of forward attention.
"location_attn": true,         // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false,  // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
"double_decoder_consistency": true,  // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
"ddc_r": 7,                           // reduction rate for coarse decoder.

// STOPNET
"stopnet": true,               // Train stopnet predicting the end of synthesis.
"separate_stopnet": true,      // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.

// TENSORBOARD and LOGGING
"print_step": 25,       // Number of steps to log training on console.
"tb_plot_step": 100,    // Number of steps to plot TB training figures.
"print_eval": false,     // If True, it prints intermediate loss values in evalulation.
"save_step": 200,      // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true,     // If true, it saves checkpoints per "save_step"
"keep_all_best": false,  // If true, keeps all best_models after keep_after steps
"keep_after": 10000,    // Global step after which to keep best models if keep_all_best is true
"tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

// DATA LOADING
"text_cleaner": "basic_cleaners",
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
"num_loader_workers": 2,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 2,    // number of evaluation data loader processes.
"batch_group_size": 4,  //Number of batches to shuffle after bucketing.
"min_seq_len": 81,       // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 186,     // DATASET-RELATED: maximum text length
"compute_input_seq_cache": false,  // if true, text sequences are computed before starting training. If phonemes are enabled, they are also computed at this stage.
"use_noise_augment": true,

// PATHS
"output_path": "/home/ubuntu/output/",

// PHONEMES
"phoneme_cache_path": "/home/ubuntu/phoneme_cache/",  // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": false,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "hi",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages

// MULTI-SPEAKER and GST
"use_speaker_embedding": false,      // use speaker embedding to enable multi-speaker learning.
"use_gst": false,       			    // use global style tokens
"use_external_speaker_embedding_file": false, // if true, forces the model to use external embedding per sample instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"external_speaker_embedding_file": "../../speakers-vctk-en.json", // if not null and use_external_speaker_embedding_file is true, it is used to load a specific embedding file and thus uses these embeddings instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"gst":	{			                // gst parameter if gst is enabled
    "gst_style_input": null,        // Condition the style input either on a
                                    // -> wave file [path to wave] or
                                    // -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
                                    // with the dictionary being len(dict) <= len(gst_style_tokens).
    "gst_embedding_dim": 512,
    "gst_num_heads": 4,
    "gst_style_tokens": 10,
    "gst_use_speaker_embedding": false
},

// DATASETS
"datasets":   // List of datasets. They all merged and they get different speaker_ids.
    [
        {
            "name": "hindi",
            "path": "/dev/data/hindidataset/",
            "meta_file_train": "metadata.csv", // for vtck if list, ignore speakers id in list for train, its useful for test cloning with new speakers
            "meta_file_val": null
        }
    ]

}

--

The stacktrace I'm hitting is below.

CHECKPOINT : /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69/checkpoint_200.pth.tar
/home/ubuntu/TTS/TTS/utils/audio.py:234: RuntimeWarning: overflow encountered in power
return np.power(10.0, x / self.spec_gain)
! Run is kept in /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69
Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

--

I've been trying to debug for 2 days but not able to make progress. I'd really appreciate any help/suggestions.

Crash while saving checkpoint : "Audio buffer is not finite everywhere"

Hi,

First of all, thanks for all this great code!

Now, I'm training a new Tacotron2 using a Hindi dataset - 25 hours, 12,000 audio files, single speaker, not noisey, trimmed silences.

At 10000 global steps, when the model tries to save the checkpoint, it crashes with the message "Audio buffer is not finite everywhere". I've been trying to tweak the config parameters, but to no avail.

Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

I'd really appreciate any hints to what might be causing this.

RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]

Hey,

I'm trying to run a training with Tacotron 1 using GST. I get the error on the first batch already.

Pytorch version: 1.8 and 1.7.1 (both yielded the same error)
Python version: 3.8.0

Traceback (most recent call last): File "TTS/bin/train_tacotron.py", line 721, in <module> main(args) File "TTS/bin/train_tacotron.py", line 619, in main train_avg_loss_dict, global_step = train(train_loader, model, File "TTS/bin/train_tacotron.py", line 168, in train decoder_output, postnet_output, alignments, stop_tokens = model( File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/big-boy/projects/TTS/TTS/tts/models/tacotron.py", line 173, in forward decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs) RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]

My hyperparams:
// TRAINING
"batch_size": 64,
"eval_batch_size": 16,
"r": 4,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,

// MULTI-SPEAKER and GST
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": { // gst parameter if gst is enabled
"gst_style_input": null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},

'avg_align_error' does not change on Tensorboard.

'avg_align_error' does not change at validation for non-Tacotron models on Tensorboard.

This is probably because the attention maps are binary for these models and the alignment error does not work correctly with them.

image

Vocoder training fails on Windows versions of Python

Trying to train a WaveGrad Vocoder on Python 3.8.8 for Windows 10 yields this error:

Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 417, in main
    best_loss = save_best_model(
  File "TTS\vocoder\utils\io.py", line 97, in save_best_model
    os.symlink(best_model_name, os.path.join(out_path, link_name))
OSError: [WinError 1314] A required privilege is not held by the client: 'best_model_1.pth.tar' -> 'c:/path/to/my/project/best_model.pth.tar'

This is because Windows only lets admins create symlinks for... reasons...

--speaker_wav leads to AttributeError: 'NoneType' object has no attribute 'load_wav'

input:

tts --text 'Hello world!'  --out_path out/out_1.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_wav 28.wav

output/error

 > tts_models/en/vctk/sc-glow-tts is already downloaded.
 > vocoder_models/en/vctk/hifigan_v2 is already downloaded.
Loading speakers ...
 > Using model: glow_tts
 > Generator Model: hifigan_generator
Removing weight norm...
 > Text: Hello world!
 > Text splitted to sentences.
['Hello world!']
Traceback (most recent call last):
  File "/home/user/.local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 257, in main
    wav = synthesizer.tts(args.text, args.speaker_idx, args.speaker_wav)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 220, in tts
    speaker_embedding = self.speaker_manager.compute_x_vector_from_clip(speaker_wav)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 241, in compute_x_vector_from_clip
    x_vector = _compute(wf)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 228, in _compute
    waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)
AttributeError: 'NoneType' object has no attribute 'load_wav'```

when reaching this line: ```
waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)

self.speaker_encoder_ap is a NoneType for me, so it seems that self.speaker_encoder_ap wasn't initialized

the wav file im supplying is a 22050 mono file and it's path is correct

i'm running version 0.13

this works without a problem:

tts --text 'Hello world!'  --out_path out/out21.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_idx p245

In Random Window Discriminator, feats is an empty list

In Random Window discriminator, feats list is defined, but it is not updated. Is this by design?

    def forward(self, x, c):
        scores = []
        feats = []
        # unconditional pass
        for (window_size, layer) in zip(self.window_sizes,
                                        self.unconditional_discriminators):
            index = np.random.randint(x.shape[-1] - window_size)

            score = layer(x[:, :, index:index + window_size])
            scores.append(score)

        # conditional pass
        for (window_size, layer) in zip(self.window_sizes,
                                        self.conditional_discriminators):
            frame_size = window_size // self.hop_length
            lc_index = np.random.randint(c.shape[-1] - frame_size)
            sample_index = lc_index * self.hop_length
            x_sub = x[:, :,
                      sample_index:(lc_index + frame_size) * self.hop_length]
            c_sub = c[:, :, lc_index:lc_index + frame_size]

            score = layer(x_sub, c_sub)
            scores.append(score)
        return scores, feats

Also, thank you for this project. It's awesome :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.