yl4579 / styletts2 Goto Github PK

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

License: MIT License

Python 77.78% Jupyter Notebook 22.22%

deep-learning pytorch speaker-adaptation speech-synthesis text-to-speech tts wavlm diffusion-models latent-diffusion latent-diffusion-models

styletts2's People

Contributors

Stargazers

Watchers

Forkers

wendongj shaun95 positivewon entn-at ishine awas666 bhairavmehta95 yasinlin techthiyanes haoqi jdarpinian newkcriss chenchy oytunturk render-ai bharathraj-v martinambrus antonioevans gpalrepo hirajanwin brianjking cesarhub anshukg mike-flanagan radoraykov brookswood woniesong92 tryreplay touristshaun robert-ko abpin mrcodechef hbcbh1999 gpernelle evdcush tqphan bradparks toliaqat ms-cs-stevens evelynmitchell uakbr rogervaas systemexecutellc zjw-swun send2cloud devinschumacher keyman9848 lyhiving ericjmcdermott chunhualiu cvoalex noscripter codingminds hellehata ayourtch mahong125 taichuai oytunistrator jmaigc eltociear zengxishenggmail eliasschwalme zhnathaniellee jansystemic drkarl neuroidss persuc smirnoff k2m5t2 jinwook-shim oneman coinhubx rkp64 jatindahiya027 leochencipher rockystevejobs bomberia mbrukman parafphino scoroman liuxing9848 wangchengqun thanhkm weilan157 yanxg yvrjsharma naomi284 mikelmanro iieleven11 mehmetcanbudak devidw jackeylove1 astralscribe tomchapin renxiangnan suryatmodulus ethanyhzhang mz0in superoldman96 wgwangang

styletts2's Issues

How to do voice cloning with StyleTTS2?

I saw amazing samples of Zero-shot speaker adaptation. How we can do that with this implementation.

Thankyou.

Epoch-based loss_param values for LibriTTS

The paper describes reducing the total training epochs for stage 1 (100 -> 30) and for stage 2 (60 -> 25) when moving from LJSpeech to LibriTTS. I'm wondering about the epoch-based loss_params like TMA_epoch, diff_epoch, and joint_epoch. How were those changed for LibriTTS?

Python 3.10 Colab

RuntimeError: Error(s) in loading state_dict for CustomAlbert:
Unexpected key(s) in state_dict: "embeddings.position_ids".

PIP package

Hi, are you planning to allow us to install this via pip by creating a setup.py file?

Weird error when finetuning using colab in repo

Weird error when finetuning. I tried to put 'embeddings' in ignore_modules but it didn't change anything.

bert
bert loaded
Traceback (most recent call last):
  File "train_finetune.py", line 714, in <module>
    main()
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_finetune.py", line 212, in main
    load_only_params=config.get('load_only_params', True))
  File "/data/Repos/Forsen2/StyleTTS2/models.py", line 703, in load_checkpoint
    model[key].load_state_dict(params[key])
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MyDataParallel:
	Missing key(s) in state_dict: "module.embeddings.position_ids".

Can you update the Configs/config.yml file?

Can you update the config.yml file so it can be used to finetune the pretrained StyleTTS 2 model on LibriTTS?

speaker selection on inference on finetuned libritts

Hello- thanks again for sharing this project. The output quality is very impressive.
I was able to finetune the libritts model you shared with another voice to 199 steps.
Is there a way to select the speaker from the model? Im getting difference speaker outputs each time I run inference. Also- is a reference clip required? I would like to just get inference from the finetuned model without using a reference clip to see how it performs.

Finetuning and dataset preparation

First of all is it already possible to finetune a single speaker model?
If so, what should one pay attention to?

Second:
How do you prepare a dataset?
train and val is pretty clear but the OOD_text confuse me a little, how do I get to those?

How much data should we prepare to train our PLBERT if we are not English?

Hi,
The document said if we are not English, we should prepare the PLBERT for our language, then how much data should we prepare? Is there any pretrained cross-language PLBERT? The XPhoneBERT seems not good enough for our language too.

inference

I used the ljspeech pre-trained model you provided for inference, and found that directly using the Inference_LJSpeech.ipynb file under the /Demo directory for generation works well. However, if I first use the compute_style function to calculate the style of the audio (data in the LJSpeech data set), then When combined, the result will be slightly less effective. I want to ask why?

Very high memory usage when training Stage 1

Hello,
Thanks for the great work.
I'm trying to train a model on my dataset using an A5000 (24GB VRAM). I kept getting OOM at the beginning of Stage 1. I kept reducing batch size, and finally, the training could go on with a batch size of 4.
Is this normal? What hardware were you using?
Thanks!

pl-bert preprocess.ipynb throwing error

when exploring pl-bert, could see an error during the preprocess. please see issue yl4579/PL-BERT#21 (comment)

High-pitched noise in the background when using old GPUs

Previously discussed here: #1 (comment)

The model produces some high-pitched noise in the background when I use my old GPU for inference (NVIDIA Quadro P5000, Driver Version: 515.105.01, CUDA Version: 11.7)

Audio examples:

Wav from the demo, the voice is very clear - https://voca.ro/1cJK3PM67Flj
Same text generated with the inference notebook, with noticeable high-pitched noise in the background - https://voca.ro/1elSCh8VuT79

I solved this problem by switching to CPU device, so this issue is just for reference, as asked by the author.

Thank you for your work!

why syletts2 can get emotions？

Hello, I want to know why syletts2 can get emotions, how does it know what emotion this sentence should have? Because I see no difference between neutral speech synthesis and multi-emotional speech synthesis in the inference code

(Question) Max Length and datasets

A lot of useful information is found in the other (closed) issues, but these questions come to mind.

How does max_len impact the training/finetuning process exactly?

In the LJS dataset, there are audio files with a duration far longer than the max_len: 400 (=5 seconds) as it is specified in the example config file. Many files are 10 seconds long and a great majority are longer than 5 seconds. They are also included in the train_list.txt . Was this intentional?

Are audiofiles truncated once the maximum number of frames is reached?

Should the datasets be carefully edited so that audiofiles do not exceed the maximum duration set in the config file? Is there a detrimental effect on adherence to punctuation or spelling when the model only sees short or clipped speech?

Is there a maximum permissible length / does the architecture impose restrictions? Could max_len be set to something like 1200 and thus make full use of long audio files? (Ignoring the VRAM requirements in the current DP implementation)

About the pretrained model on LibriTTS

Hello, thank you for sharing such interesting work!
May I know what your plans are for sharing the pretrained model trained on LibriTTS?
Thanks! :)

About the code of the decoder part

Thank you for your great work！🙂
The idea of combining an encoder and a vocoder gave me great inspiration. I am trying to implement this idea now. Would you like to provide this part of the code for reference?
Thank you again and look forward to your reply.

I get a error message from trying to finetune a model

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
text_encoder loaded
predictor_encoder loaded
style_encoder loaded
diffusion loaded
Traceback (most recent call last):
File "/home/user/StyleTTS2/train_finetune.py", line 714, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_finetune.py", line 211, in main
model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'],
File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint
model[key].load_state_dict(params[key])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for MyDataParallel:
Missing key(s) in state_dict: "module.diffusion.net.blocks.0.attention.norm.weight", "module.diffusion.net.blocks.0.attention.norm.bias", "module.diffusion.net.blocks.0.attention.norm_context.weight", "module.diffusion.net.blocks.0.attention.norm_context.bias", "module.diffusion.net.blocks.1.attention.norm.weight", "module.diffusion.net.blocks.1.attention.norm.bias", "module.diffusion.net.blocks.1.attention.norm_context.weight", "module.diffusion.net.blocks.1.attention.norm_context.bias", "module.diffusion.net.blocks.2.attention.norm.weight", "module.diffusion.net.blocks.2.attention.norm.bias", "module.diffusion.net.blocks.2.attention.norm_context.weight", "module.diffusion.net.blocks.2.attention.norm_context.bias", "module.unet.blocks.0.attention.norm.weight", "module.unet.blocks.0.attention.norm.bias", "module.unet.blocks.0.attention.norm_context.weight", "module.unet.blocks.0.attention.norm_context.bias", "module.unet.blocks.1.attention.norm.weight", "module.unet.blocks.1.attention.norm.bias", "module.unet.blocks.1.attention.norm_context.weight", "module.unet.blocks.1.attention.norm_context.bias", "module.unet.blocks.2.attention.norm.weight", "module.unet.blocks.2.attention.norm.bias", "module.unet.blocks.2.attention.norm_context.weight", "module.unet.blocks.2.attention.norm_context.bias".
Unexpected key(s) in state_dict: "module.diffusion.net.to_features.0.weight", "module.diffusion.net.to_features.0.bias", "module.diffusion.net.blocks.0.attention.norm.fc.weight", "module.diffusion.net.blocks.0.attention.norm.fc.bias", "module.diffusion.net.blocks.0.attention.norm_context.fc.weight", "module.diffusion.net.blocks.0.attention.norm_context.fc.bias", "module.diffusion.net.blocks.1.attention.norm.fc.weight", "module.diffusion.net.blocks.1.attention.norm.fc.bias", "module.diffusion.net.blocks.1.attention.norm_context.fc.weight", "module.diffusion.net.blocks.1.attention.norm_context.fc.bias", "module.diffusion.net.blocks.2.attention.norm.fc.weight", "module.diffusion.net.blocks.2.attention.norm.fc.bias", "module.diffusion.net.blocks.2.attention.norm_context.fc.weight", "module.diffusion.net.blocks.2.attention.norm_context.fc.bias", "module.unet.to_features.0.weight", "module.unet.to_features.0.bias", "module.unet.blocks.0.attention.norm.fc.weight", "module.unet.blocks.0.attention.norm.fc.bias", "module.unet.blocks.0.attention.norm_context.fc.weight", "module.unet.blocks.0.attention.norm_context.fc.bias", "module.unet.blocks.1.attention.norm.fc.weight", "module.unet.blocks.1.attention.norm.fc.bias", "module.unet.blocks.1.attention.norm_context.fc.weight", "module.unet.blocks.1.attention.norm_context.fc.bias", "module.unet.blocks.2.attention.norm.fc.weight", "module.unet.blocks.2.attention.norm.fc.bias", "module.unet.blocks.2.attention.norm_context.fc.weight", "module.unet.blocks.2.attention.norm_context.fc.bias".

Poor audio quality after fine-tuning

I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor results. Could you please give me some directions or help to spot the issue?

How I fine-tuned:

Pulled the latest changes from the repo
Replaced Data/train_list.txt with a copy that only has the first 1000 lines (~1 hour for training)
Changed batch_size to 4 and max_len to 100, otherwise it doesn't fit into the memory of my 4090 (24GB).
After training it for 50-100 epochs, I tested new checkpoints with both Inference_LibriTTS.ipynb and Inference_LJSpeech.ipynb notebooks by changing the multispeaker parameter in the config to true/false.
Inference_LJSpeech.ipynb produces very noisy results with a poor pronunciation.
Inference_LibriTTS.ipynb with reference audio from LJSpeech has a good pronunciation, but there are noticeable noises (example - https://voca.ro/1nQ8Ltjhsh9y)

Thank you again for the awesome project!

about text aligner

Maybe the asr module(text aligner) can be replaced with vits's idea??

Gradio demo

HI, congrats on StyleTT2, would be great to setup a gradio demo for it on Hugging Face, you can see the guide to get started here: https://huggingface.co/docs/hub/spaces-sdks-gradio and here is a recent example: https://huggingface.co/spaces/coqui/xtts, @yvrjsharma

Audio streaming

Hi,
Can we stream audio as it is being generated for longer texts?
Thank you!

text_encoder and text_aligner are not optimized in train_second.py

I'm trying to reconcile a difference I'm seeing between the paper and the code. Figure 1a says, "joint training then follows to optimize all components except the pitch extractor". When I look at train_second.py, though, I see two components that are not being optimized: text_encoder and text_aligner. These two components are used only in no_grad contexts in train_second.py, which makes sense to me, but I wanted to check that this is correct. Thanks!

about model and code

Thank you for your outstanding work. I am also very interested in this paper and the diffusion model. I wonder when the source code and the specific training model can be made public. 3q

iSTFTNet and LibriTTS

I'm curious if you can share any observations about using iSTFTNet with LibriTTS. The paper implies that the performance of iSTFTNet was insufficient for LibriTTS and so HiFiGAN was adopted, but I was wondering if you did any experiments with iSTFTNet and LibriTTS and what you saw.

g_loss is None in second stage training

Thank you for the code and work.
I'm trying to run the second stage training, and step into the breakpoint part since the g_loss is None, any thoughts on that?

StyleTTS2/train_second.py

Line 450 in fd3884b

set_trace()

possible for you share a rough timeline for sharing training code?

Have been tinkering with StyleTTS to achieve a good ASMR TTS voice.
sharing a tortoise sample here
https://drive.google.com/file/d/1OfZnURiaAEW0O_B8RVozt1M4Uadl-rEV/view?usp=sharing

haven't had good experience finetuning STyleTTS so far. Would you like to tell when the next version will be out?

Additional requirements for README

Saw the code released and just got a chance to poke around. Great results testing out the default inference model!

From my install, I did have a couple install notes. As mentioned in #5 , I also ran into the PLBERT error but the fix worked and that was the only code problem for inference.

For dependencies from a fresh mamba env, nltk and matplotlib need to be installed as well from the ipynb (although matplotlib isn't used in my Python code). I also used soundfile for WAV output.

The only other gotcha getting things up and running is that currently, Pytorch doesn't work with Python 3.12 (which just released and is what's installed if you just install python or pip) - 3.11 is fine w/ PyTorch nightly, although maybe 3.10 is still required for stable releases.

Oh, also, in case anyone doesn't know pip install gdown is great for grabbing GDrive links onto a server.

Happy to submit a PR adding these to the docs if you'd like, otherwise, just leaving a note here for anyone else getting the code up and running.

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

The readme makes it sound very simple: "Replace bert with xphonebert"
Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (#28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

Alternatively re-training the current PL-Bert for other languages, though that needs a corpus and I've no idea on the cost ?
(https://github.com/yl4579/PL-BERT)

train_second.py question?

https://github.com/yl4579/StyleTTS2/blob/2f2dece35b9cad63215ae38bf991b856d1d1c063/train_second.py#L681C1-L681C32
Apologies if I am misunderstanding something, but it seems this line (681) in train_second.py starts a block that refers to mel_input_length, mels, asr at two indent levels, but the loop that defines these variables per batch above (line 569) starts at three indent levels--so the block at line 681 would only process the last batch in validation?

Link to pretrained weights broken

Hello,

Thank you for sharing your wonderful text to speech model.

Unfortunately, when trying to download the pretrained weights the Google Drive link gives an error:

I get the error for both the LjSpeech and LibriTTS models.

Perhaps it is an idea to place your model including pretrained weights on a site such as huggingface.co? That way more people will be able to find and use your TTS, and it has more reliable filestorage.

Finetune Error Message

I get this error message when I try to finetune. I set batch_size to 12 and max_len to 14. I'm using torch-2.1.1 torchaudio-2.1.1 torchvision-0.16.1 if that matters.

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
text_encoder loaded
predictor_encoder loaded
style_encoder loaded
text_aligner loaded
pitch_extractor loaded
mpd loaded
msd loaded
wd loaded
BERT AdamW (
Parameter Group 0
amsgrad: False
base_momentum: 0.85
betas: (0.9, 0.99)
capturable: False
differentiable: False
eps: 1e-09
foreach: None
fused: None
initial_lr: 1e-05
lr: 1e-05
max_lr: 2e-05
max_momentum: 0.95
maximize: False
min_lr: 0
weight_decay: 0.01
)
decoder AdamW (
Parameter Group 0
amsgrad: False
base_momentum: 0.85
betas: (0.0, 0.99)
capturable: False
differentiable: False
eps: 1e-09
foreach: None
fused: None
initial_lr: 0.0001
lr: 0.0001
max_lr: 0.0002
max_momentum: 0.95
maximize: False
min_lr: 0
weight_decay: 0.0001

Traceback (most recent call last):
File "/home/user/StyleTTS2/train_finetune.py", line 714, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_finetune.py", line 302, in main
s = model.predictor_encoder(mel.unsqueeze(0).unsqueeze(1))
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
return self.module(*inputs[0], **module_kwargs[0])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/models.py", line 160, in forward
h = self.shared(x)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

Ability to Read Longer Audio (ie Audiobooks)

Hi,
Might it be possible to implement a tqdm progress bar for longer text? This would make it possible to easily narrate entire audiobooks!
Thanks!

Extremely weird DDP issue for train_second.py

So far train_second.py only works with DataParallel (DP) but not DistributedDataParalell (DDP). One major problem with this is if we simply translate DP to DDP (code in the comment section), we encounter the following problem:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 6; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
It is insanely difficult to debug. The tensor has no batch dimension, indicating it might be a parameter in the neural network. I found the tensor to be the bias term of the last Conv1D layer of predictor_encoder (prosodic style encoder): https://github.com/yl4579/StyleTTS2/blob/main/models.py#L152. This is extremely weird because the problem does not trigger for any Conv1D layer before this.

More mysteriously, issue surprisingly disappears if we add model.predictor_encoder.train() near line 250 of train_second.py. However, this causes the F0 loss to be much higher than without this line. This is true for both DP and DDP, so the higher F0 loss value is caused by model.predictor_encoder.train(), not DDP. Unfortunately, the predictor_encoder, which is StyleEncoder, has no module that changes the behavior depending on whether it is in train or eval mode. The output is exactly the same whether it is set to train or eval.

TLDR: There are three issues with train_second.py:

DDP does not work because of the in-place operation error. The error disappears if model.predictor_encoder.train() before training.
However, model.predictor_encoder.train() causes F0 loss to be much higher after convergence. This issue is independent of using DP or DDP.
model.predictor_encoder is an instantiation of StyleEncoder, which has no components that change the output depending on its train or eval mode.

This problem has bugged me for more than a month, but I can't find a solution to it. It would be greatly appreciated if anyone has any insight into how to fix this problem. I have pasted the broken DDP code with accelerator below.

questions on training

a) when loading check point for train_second.py, apart from the nets for mpd, msd and wd, saw the need to add the prefix "module." to make the key names compatible. is this expected ?

diff --git a/models.py b/models.py
index 99b4f3d..fc03a8b 100644
--- a/models.py
+++ b/models.py

+from collections import OrderedDict
 
 class LearnedDownSample(nn.Module):
     def __init__(self, layer_type, dim_in):
@@ -697,9 +700,19 @@ def load_checkpoint(model, optimizer, path, load_only_params=True, ignore_module
     state = torch.load(path, map_location='cpu')
     params = state['net']
     for key in model:
+        new_state_dict = OrderedDict()
+        for k, v in params[key].items():
+            name = 'module.' + k # add `module.`
+            new_state_dict[name] = v
+
+        if key in ['mpd', 'msd', 'wd'] : 
+            new_state_dict = params[key]    
+
         if key in params and key not in ignore_modules:
             print('%s loaded' % key)
-            model[key].load_state_dict(params[key])
+            #model[key].load_state_dict(params[key])
+            model[key].load_state_dict(new_state_dict)
+
     _ = [model[key].eval() for key in model]

b) passing pre-trained model for train_second.py
after training first stage for around 50 epochs, defined the paramsfirst_stage_path and pre-trained_model
anything more needed ?

diff --git a/Configs/config.yml b/Configs/config.yml
index b74b8ee..2a9f93c 100644
--- a/Configs/config.yml
+++ b/Configs/config.yml
@@ -1,13 +1,13 @@
 log_dir: "Models/LJSpeech"
-first_stage_path: "first_stage.pth"
+first_stage_path: "epoch_1st_00048.pth"
 save_freq: 2
-batch_size: 16
-max_len: 400 # maximum number of frames
-pretrained_model: ""
+batch_size: 2
+max_len: 100 # maximum number of frames
+pretrained_model: "Models/LJSpeech/epoch_1st_00048.pth"
 second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
 load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

c) resume training in first stage?
defining the params first_stage_path suffice ?
further, doing the modifications as done in a) does not seem to be needed.

Continuing after an interrruption

Hi and thank you for the excellent work you're providing in this repository! It's much appreciated.
I have a question, trying to run this project on Google Colab with a single GPU and a free plan: is there a way to continue where the first or second stage stopped when I (or Google) interrupt the notebook run?

Laugh, Chuckle, giggling, 'Uhuh'(Nodding sound), etc

I'm wondering how to extend this excellent work to synthesize more types of expressions like in the title?

For laughs, just using the text "Haha" works, but often sounds like a fake laugh and I have no idea what text makes a giggle sound, it might not even have an IPA representations.
I guess a better approach would be to add some VQS symbols, add some appropriate training data that includes the extra VQS symbols and hack the phonemizer to allow something like "*giggle* Thats funny!" to be phonemized by using a custom dictionary for the giggle part and espeak for the rest. Finally, retrain the ASR with an expanded vocabulary. Is this a good approach?

If the vocabulary expands, then would I need to retrain the ASR from scratch or can it be finetuned?

Alternatively, sampling a laughing, giggling, etc style might work but I think there would be less fine-grained speech control than expanding the vocabulary, especially if laughing and giggling appear in the same same sentence, wouldn't it?

How to replace PL-BERT with XPhoneBERT?

Hi,

I'm looking to generate Hindi audio but it was mentioned that PL-BERT doesn't work well with other languages and I either need to train a different PL-BERT or replace the module with XPhoneBERT.

I'm having trouble understanding how I could go about replacing the module with XPhoneBERT. The repository XPhoneBERT describes using the model through transformers but I'm unsure how I can apply that here and this issue thread suggests that the pre-trained model is not publicized so, how do I go about replacing PL-BERT with XPhoneBERT here?

Thanks!

ETA of training code publication

Thank you for your work! Is there any ETA on when the training and inference code will become available?

Possibly misleading license info

This repo claims to use an MIT license, but there are additional license requirements buried in the readme file:

Before using these models, you agree to inform the listeners that the speech samples are synthesized by StyleTTS 2 models, unless you have the permission to use the voice you synthesize. That is, when using it for voice cloning, you also agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc.

This is very misleading because someone simply checking the license file before using the repo would make the assumption that only the MIT license requirements apply.

I understand that the intention is probably to license the code in the repo as MIT, while having additional license requirements for the pre-trained models. However, because the only apparent way to get the models is a Google Drive link contained in the repo, it still seems misleading. Additionally, the wording above suggests that this license requirement applies not only to the pre-trained models but also to any StyleTTS 2 models created using the code in the repo.

IMO, any additional license requirements for using the code in the repo as intended should be mentioned in the repo license file.

If I am misunderstanding something, I take no offense to you simply closing this issue without comment.

SLM adversarial training: 3 - 6 seconds in duration?

I'm having trouble reconciling the paper and the code when it comes to the min_len and max_len for the slmadv_params. They are set to 400 and 500 respectively here, but the paper states:
"For SLM adversarial training, both the ground truth and generated samples were ensured to be 3 to 6 seconds in duration".
I'm not sure how exactly to interpret the units on min_len and max_len, but that ratio definitely doesn't line up with 3-6 seconds.
Those parameters get used here, where they're halved and compared against the number of mel frames. If that's the correct interpretation, then with the mel transform here giving us ~80 frames per second of 24k audio, I think min_len would be set to 480 and max_len would be set to 960 to match the paper. Is that correct? Can you help clear this up for me?

SSML support

Could you add SSML support, thank you!

Portability? (iOS, etc.)

Sorry for the naive question - is there any suggested direction to porting this to iOS given that it's all Python, or would that require a completely original project?

Licensing issue

Hi,
This package uses the phonemizer library which is GPL licensed (because it depends on espeak-ng by Jonathan Duddington and nothing's been heard of him for years). That means all software that uses it must also be GPL licensed. Might it be possible to switch to an alternate library (preferably deep_phonemizer or g2p_en)? Thanks!

Loud white noise on short texts with embedding_scale > 1

If I give it a short text like "Hello how are you?" it generates me a 25 second clip of extremly loud white noise.
I tried this on the libritts model only, it also happens on my own model which I finetuned based on it.

Emotion in voices

Multispeaker Config

Hi @yl4579 thanks so much for your work.

To train multispeaker, do we just need to generate train|val_list.txt and change multispeaker to True in the config?

Any chance you could share (for example) your VCTK training script?

Contextual learning

Sorry if this question is more generic but I'm fairly new to the TTS field and I'm not sure if what I want to ask is a project-specific thing or a TTS-specific thing.

I'd like to ask how much is the context of sentences in WAV files important for training.

I'm working on my own set of WAV files using the LJSpeech structure. While testing this project's demo output by using the model generously provided by you (thank you!), I realized that the output is sometimes sounding "wrong" at the end of a sentence.

In some sentences, the voice is going up, as if there was a comma or a questionmark at the end instead of a full-stop. In other sentences this does not occur.

When I played back some of the LJSpeech sentences used for training, I found that this is exactly the same.

What I'm not sure about is whether the model learns to make the same mistake by the context of the sentence itself or it's just repeating the same thing because there are the same / similar words towards the end of the sentence?

I'm trying to understand how to best create my WAV files, so the model is trained well with regards to the emotional context of the sentence.

Example: "Oh my god! How did that happen?!", exclaimed Anna with a tone of irritation in her voice.

If I use this sentence as a whole, would the TTS learn to use an irritated surprise emotion where the context of irritation is present? Or does this not matter and the model would only learn the irritated tone from the sound in that quote, regardless of the context following it?

Thanks for reading and sorry again for a super-long question!

Finetuning for better voice cloning

train_second.py: NameError: name 'mel_len_st' is not defined

As of pulling the code today and attempting to run train_second.py, I get the error NameError: name 'mel_len_st' is not defined @ line 364 in train_second.py