Giter Club home page Giter Club logo

styletts2's People

Contributors

ameerazam08 avatar astricks avatar awas666 avatar danielmsu avatar devidw avatar eltociear avatar fakerybakery avatar haoqi avatar kmn1024 avatar phields avatar yl4579 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

styletts2's Issues

text_encoder and text_aligner are not optimized in train_second.py

I'm trying to reconcile a difference I'm seeing between the paper and the code. Figure 1a says, "joint training then follows to optimize all components except the pitch extractor". When I look at train_second.py, though, I see two components that are not being optimized: text_encoder and text_aligner. These two components are used only in no_grad contexts in train_second.py, which makes sense to me, but I wanted to check that this is correct. Thanks!

(Question) Max Length and datasets

A lot of useful information is found in the other (closed) issues, but these questions come to mind.

  • How does max_len impact the training/finetuning process exactly?

In the LJS dataset, there are audio files with a duration far longer than the max_len: 400 (=5 seconds) as it is specified in the example config file. Many files are 10 seconds long and a great majority are longer than 5 seconds. They are also included in the train_list.txt . Was this intentional?

  • Are audiofiles truncated once the maximum number of frames is reached?

Should the datasets be carefully edited so that audiofiles do not exceed the maximum duration set in the config file? Is there a detrimental effect on adherence to punctuation or spelling when the model only sees short or clipped speech?

  • Is there a maximum permissible length / does the architecture impose restrictions? Could max_len be set to something like 1200 and thus make full use of long audio files? (Ignoring the VRAM requirements in the current DP implementation)

Licensing issue

Hi,
This package uses the phonemizer library which is GPL licensed (because it depends on espeak-ng by Jonathan Duddington and nothing's been heard of him for years). That means all software that uses it must also be GPL licensed. Might it be possible to switch to an alternate library (preferably deep_phonemizer or g2p_en)? Thanks!

Laugh, Chuckle, giggling, 'Uhuh'(Nodding sound), etc

I'm wondering how to extend this excellent work to synthesize more types of expressions like in the title?

For laughs, just using the text "Haha" works, but often sounds like a fake laugh and I have no idea what text makes a giggle sound, it might not even have an IPA representations.
I guess a better approach would be to add some VQS symbols, add some appropriate training data that includes the extra VQS symbols and hack the phonemizer to allow something like "*giggle* Thats funny!" to be phonemized by using a custom dictionary for the giggle part and espeak for the rest. Finally, retrain the ASR with an expanded vocabulary. Is this a good approach?

If the vocabulary expands, then would I need to retrain the ASR from scratch or can it be finetuned?

Alternatively, sampling a laughing, giggling, etc style might work but I think there would be less fine-grained speech control than expanding the vocabulary, especially if laughing and giggling appear in the same same sentence, wouldn't it?

Very high memory usage when training Stage 1

Hello,
Thanks for the great work.
I'm trying to train a model on my dataset using an A5000 (24GB VRAM). I kept getting OOM at the beginning of Stage 1. I kept reducing batch size, and finally, the training could go on with a batch size of 4.
Is this normal? What hardware were you using?
Thanks!

Continuing after an interrruption

Hi and thank you for the excellent work you're providing in this repository! It's much appreciated.
I have a question, trying to run this project on Google Colab with a single GPU and a free plan: is there a way to continue where the first or second stage stopped when I (or Google) interrupt the notebook run?

Audio streaming

Hi,
Can we stream audio as it is being generated for longer texts?
Thank you!

Possibly misleading license info

This repo claims to use an MIT license, but there are additional license requirements buried in the readme file:

Before using these models, you agree to inform the listeners that the speech samples are synthesized by StyleTTS 2 models, unless you have the permission to use the voice you synthesize. That is, when using it for voice cloning, you also agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc.

This is very misleading because someone simply checking the license file before using the repo would make the assumption that only the MIT license requirements apply.

I understand that the intention is probably to license the code in the repo as MIT, while having additional license requirements for the pre-trained models. However, because the only apparent way to get the models is a Google Drive link contained in the repo, it still seems misleading. Additionally, the wording above suggests that this license requirement applies not only to the pre-trained models but also to any StyleTTS 2 models created using the code in the repo.

IMO, any additional license requirements for using the code in the repo as intended should be mentioned in the repo license file.

If I am misunderstanding something, I take no offense to you simply closing this issue without comment.

Epoch-based loss_param values for LibriTTS

The paper describes reducing the total training epochs for stage 1 (100 -> 30) and for stage 2 (60 -> 25) when moving from LJSpeech to LibriTTS. I'm wondering about the epoch-based loss_params like TMA_epoch, diff_epoch, and joint_epoch. How were those changed for LibriTTS?

Finetuning and dataset preparation

First of all is it already possible to finetune a single speaker model?
If so, what should one pay attention to?

Second:
How do you prepare a dataset?
train and val is pretty clear but the OOD_text confuse me a little, how do I get to those?

inference

I used the ljspeech pre-trained model you provided for inference, and found that directly using the Inference_LJSpeech.ipynb file under the /Demo directory for generation works well. However, if I first use the compute_style function to calculate the style of the audio (data in the LJSpeech data set), then When combined, the result will be slightly less effective. I want to ask why?
image
image

why syletts2 can get emotions?

Hello, I want to know why syletts2 can get emotions, how does it know what emotion this sentence should have? Because I see no difference between neutral speech synthesis and multi-emotional speech synthesis in the inference code
image
image

questions on training

a) when loading check point for train_second.py, apart from the nets for mpd, msd and wd, saw the need to add the prefix "module." to make the key names compatible. is this expected ?

diff --git a/models.py b/models.py
index 99b4f3d..fc03a8b 100644
--- a/models.py
+++ b/models.py

+from collections import OrderedDict
 
 class LearnedDownSample(nn.Module):
     def __init__(self, layer_type, dim_in):
@@ -697,9 +700,19 @@ def load_checkpoint(model, optimizer, path, load_only_params=True, ignore_module
     state = torch.load(path, map_location='cpu')
     params = state['net']
     for key in model:
+        new_state_dict = OrderedDict()
+        for k, v in params[key].items():
+            name = 'module.' + k # add `module.`
+            new_state_dict[name] = v
+
+        if key in ['mpd', 'msd', 'wd'] : 
+            new_state_dict = params[key]    
+
         if key in params and key not in ignore_modules:
             print('%s loaded' % key)
-            model[key].load_state_dict(params[key])
+            #model[key].load_state_dict(params[key])
+            model[key].load_state_dict(new_state_dict)
+
     _ = [model[key].eval() for key in model]

b) passing pre-trained model for train_second.py
after training first stage for around 50 epochs, defined the paramsfirst_stage_path and pre-trained_model
anything more needed ?

diff --git a/Configs/config.yml b/Configs/config.yml
index b74b8ee..2a9f93c 100644
--- a/Configs/config.yml
+++ b/Configs/config.yml
@@ -1,13 +1,13 @@
 log_dir: "Models/LJSpeech"
-first_stage_path: "first_stage.pth"
+first_stage_path: "epoch_1st_00048.pth"
 save_freq: 2
-batch_size: 16
-max_len: 400 # maximum number of frames
-pretrained_model: ""
+batch_size: 2
+max_len: 100 # maximum number of frames
+pretrained_model: "Models/LJSpeech/epoch_1st_00048.pth"
 second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
 load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

c) resume training in first stage?
defining the params first_stage_path suffice ?
further, doing the modifications as done in a) does not seem to be needed.

Contextual learning

Sorry if this question is more generic but I'm fairly new to the TTS field and I'm not sure if what I want to ask is a project-specific thing or a TTS-specific thing.

I'd like to ask how much is the context of sentences in WAV files important for training.

I'm working on my own set of WAV files using the LJSpeech structure. While testing this project's demo output by using the model generously provided by you (thank you!), I realized that the output is sometimes sounding "wrong" at the end of a sentence.

In some sentences, the voice is going up, as if there was a comma or a questionmark at the end instead of a full-stop. In other sentences this does not occur.

When I played back some of the LJSpeech sentences used for training, I found that this is exactly the same.

What I'm not sure about is whether the model learns to make the same mistake by the context of the sentence itself or it's just repeating the same thing because there are the same / similar words towards the end of the sentence?

I'm trying to understand how to best create my WAV files, so the model is trained well with regards to the emotional context of the sentence.

Example: "Oh my god! How did that happen?!", exclaimed Anna with a tone of irritation in her voice.

If I use this sentence as a whole, would the TTS learn to use an irritated surprise emotion where the context of irritation is present? Or does this not matter and the model would only learn the irritated tone from the sound in that quote, regardless of the context following it?

Thanks for reading and sorry again for a super-long question!

PIP package

Hi, are you planning to allow us to install this via pip by creating a setup.py file?

iSTFTNet and LibriTTS

I'm curious if you can share any observations about using iSTFTNet with LibriTTS. The paper implies that the performance of iSTFTNet was insufficient for LibriTTS and so HiFiGAN was adopted, but I was wondering if you did any experiments with iSTFTNet and LibriTTS and what you saw.

Multispeaker Config

Hi @yl4579 thanks so much for your work.

To train multispeaker, do we just need to generate train|val_list.txt and change multispeaker to True in the config?

Any chance you could share (for example) your VCTK training script?

Portability? (iOS, etc.)

Sorry for the naive question - is there any suggested direction to porting this to iOS given that it's all Python, or would that require a completely original project?

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

The readme makes it sound very simple: "Replace bert with xphonebert"
Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (#28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

Alternatively re-training the current PL-Bert for other languages, though that needs a corpus and I've no idea on the cost ?
(https://github.com/yl4579/PL-BERT)

I get a error message from trying to finetune a model

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

  • This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    bert loaded
    bert_encoder loaded
    predictor loaded
    decoder loaded
    text_encoder loaded
    predictor_encoder loaded
    style_encoder loaded
    diffusion loaded
    Traceback (most recent call last):
    File "/home/user/StyleTTS2/train_finetune.py", line 714, in
    main()
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
    return self.main(*args, **kwargs)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
    File "/home/user/StyleTTS2/train_finetune.py", line 211, in main
    model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'],
    File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint
    model[key].load_state_dict(params[key])
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for MyDataParallel:
    Missing key(s) in state_dict: "module.diffusion.net.blocks.0.attention.norm.weight", "module.diffusion.net.blocks.0.attention.norm.bias", "module.diffusion.net.blocks.0.attention.norm_context.weight", "module.diffusion.net.blocks.0.attention.norm_context.bias", "module.diffusion.net.blocks.1.attention.norm.weight", "module.diffusion.net.blocks.1.attention.norm.bias", "module.diffusion.net.blocks.1.attention.norm_context.weight", "module.diffusion.net.blocks.1.attention.norm_context.bias", "module.diffusion.net.blocks.2.attention.norm.weight", "module.diffusion.net.blocks.2.attention.norm.bias", "module.diffusion.net.blocks.2.attention.norm_context.weight", "module.diffusion.net.blocks.2.attention.norm_context.bias", "module.unet.blocks.0.attention.norm.weight", "module.unet.blocks.0.attention.norm.bias", "module.unet.blocks.0.attention.norm_context.weight", "module.unet.blocks.0.attention.norm_context.bias", "module.unet.blocks.1.attention.norm.weight", "module.unet.blocks.1.attention.norm.bias", "module.unet.blocks.1.attention.norm_context.weight", "module.unet.blocks.1.attention.norm_context.bias", "module.unet.blocks.2.attention.norm.weight", "module.unet.blocks.2.attention.norm.bias", "module.unet.blocks.2.attention.norm_context.weight", "module.unet.blocks.2.attention.norm_context.bias".
    Unexpected key(s) in state_dict: "module.diffusion.net.to_features.0.weight", "module.diffusion.net.to_features.0.bias", "module.diffusion.net.blocks.0.attention.norm.fc.weight", "module.diffusion.net.blocks.0.attention.norm.fc.bias", "module.diffusion.net.blocks.0.attention.norm_context.fc.weight", "module.diffusion.net.blocks.0.attention.norm_context.fc.bias", "module.diffusion.net.blocks.1.attention.norm.fc.weight", "module.diffusion.net.blocks.1.attention.norm.fc.bias", "module.diffusion.net.blocks.1.attention.norm_context.fc.weight", "module.diffusion.net.blocks.1.attention.norm_context.fc.bias", "module.diffusion.net.blocks.2.attention.norm.fc.weight", "module.diffusion.net.blocks.2.attention.norm.fc.bias", "module.diffusion.net.blocks.2.attention.norm_context.fc.weight", "module.diffusion.net.blocks.2.attention.norm_context.fc.bias", "module.unet.to_features.0.weight", "module.unet.to_features.0.bias", "module.unet.blocks.0.attention.norm.fc.weight", "module.unet.blocks.0.attention.norm.fc.bias", "module.unet.blocks.0.attention.norm_context.fc.weight", "module.unet.blocks.0.attention.norm_context.fc.bias", "module.unet.blocks.1.attention.norm.fc.weight", "module.unet.blocks.1.attention.norm.fc.bias", "module.unet.blocks.1.attention.norm_context.fc.weight", "module.unet.blocks.1.attention.norm_context.fc.bias", "module.unet.blocks.2.attention.norm.fc.weight", "module.unet.blocks.2.attention.norm.fc.bias", "module.unet.blocks.2.attention.norm_context.fc.weight", "module.unet.blocks.2.attention.norm_context.fc.bias".

About the code of the decoder part

Thank you for your great work!🙂
The idea of ​​combining an encoder and a vocoder gave me great inspiration. I am trying to implement this idea now. Would you like to provide this part of the code for reference?
Thank you again and look forward to your reply.

about model and code

Thank you for your outstanding work. I am also very interested in this paper and the diffusion model. I wonder when the source code and the specific training model can be made public. 3q

Finetune Error Message

I get this error message when I try to finetune. I set batch_size to 12 and max_len to 14. I'm using torch-2.1.1 torchaudio-2.1.1 torchvision-0.16.1 if that matters.

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

  • This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    bert loaded
    bert_encoder loaded
    predictor loaded
    decoder loaded
    text_encoder loaded
    predictor_encoder loaded
    style_encoder loaded
    text_aligner loaded
    pitch_extractor loaded
    mpd loaded
    msd loaded
    wd loaded
    BERT AdamW (
    Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.9, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 1e-05
    lr: 1e-05
    max_lr: 2e-05
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.01
    )
    decoder AdamW (
    Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.0, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 0.0001
    lr: 0.0001
    max_lr: 0.0002
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.0001

Traceback (most recent call last):
File "/home/user/StyleTTS2/train_finetune.py", line 714, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_finetune.py", line 302, in main
s = model.predictor_encoder(mel.unsqueeze(0).unsqueeze(1))
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
return self.module(*inputs[0], **module_kwargs[0])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/models.py", line 160, in forward
h = self.shared(x)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

Python 3.10 Colab

RuntimeError: Error(s) in loading state_dict for CustomAlbert:
Unexpected key(s) in state_dict: "embeddings.position_ids".

about text aligner

Maybe the asr module(text aligner) can be replaced with vits's idea??

Link to pretrained weights broken

Hello,

Thank you for sharing your wonderful text to speech model.

Unfortunately, when trying to download the pretrained weights the Google Drive link gives an error:
image
I get the error for both the LjSpeech and LibriTTS models.

Perhaps it is an idea to place your model including pretrained weights on a site such as huggingface.co? That way more people will be able to find and use your TTS, and it has more reliable filestorage.

Poor audio quality after fine-tuning

I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor results. Could you please give me some directions or help to spot the issue?

How I fine-tuned:

  1. Pulled the latest changes from the repo
  2. Replaced Data/train_list.txt with a copy that only has the first 1000 lines (~1 hour for training)
  3. Changed batch_size to 4 and max_len to 100, otherwise it doesn't fit into the memory of my 4090 (24GB).
  4. After training it for 50-100 epochs, I tested new checkpoints with both Inference_LibriTTS.ipynb and Inference_LJSpeech.ipynb notebooks by changing the multispeaker parameter in the config to true/false.
  5. Inference_LJSpeech.ipynb produces very noisy results with a poor pronunciation.
  6. Inference_LibriTTS.ipynb with reference audio from LJSpeech has a good pronunciation, but there are noticeable noises (example - https://voca.ro/1nQ8Ltjhsh9y)

Thank you again for the awesome project!

Extremely weird DDP issue for train_second.py

So far train_second.py only works with DataParallel (DP) but not DistributedDataParalell (DDP). One major problem with this is if we simply translate DP to DDP (code in the comment section), we encounter the following problem:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 6; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
It is insanely difficult to debug. The tensor has no batch dimension, indicating it might be a parameter in the neural network. I found the tensor to be the bias term of the last Conv1D layer of predictor_encoder (prosodic style encoder): https://github.com/yl4579/StyleTTS2/blob/main/models.py#L152. This is extremely weird because the problem does not trigger for any Conv1D layer before this.

More mysteriously, issue surprisingly disappears if we add model.predictor_encoder.train() near line 250 of train_second.py. However, this causes the F0 loss to be much higher than without this line. This is true for both DP and DDP, so the higher F0 loss value is caused by model.predictor_encoder.train(), not DDP. Unfortunately, the predictor_encoder, which is StyleEncoder, has no module that changes the behavior depending on whether it is in train or eval mode. The output is exactly the same whether it is set to train or eval.

TLDR: There are three issues with train_second.py:

  1. DDP does not work because of the in-place operation error. The error disappears if model.predictor_encoder.train() before training.
  2. However, model.predictor_encoder.train() causes F0 loss to be much higher after convergence. This issue is independent of using DP or DDP.
  3. model.predictor_encoder is an instantiation of StyleEncoder, which has no components that change the output depending on its train or eval mode.

This problem has bugged me for more than a month, but I can't find a solution to it. It would be greatly appreciated if anyone has any insight into how to fix this problem. I have pasted the broken DDP code with accelerator below.

Additional requirements for README

Saw the code released and just got a chance to poke around. Great results testing out the default inference model!

From my install, I did have a couple install notes. As mentioned in #5 , I also ran into the PLBERT error but the fix worked and that was the only code problem for inference.

For dependencies from a fresh mamba env, nltk and matplotlib need to be installed as well from the ipynb (although matplotlib isn't used in my Python code). I also used soundfile for WAV output.

The only other gotcha getting things up and running is that currently, Pytorch doesn't work with Python 3.12 (which just released and is what's installed if you just install python or pip) - 3.11 is fine w/ PyTorch nightly, although maybe 3.10 is still required for stable releases.

Oh, also, in case anyone doesn't know pip install gdown is great for grabbing GDrive links onto a server.

Happy to submit a PR adding these to the docs if you'd like, otherwise, just leaving a note here for anyone else getting the code up and running.

SLM adversarial training: 3 - 6 seconds in duration?

I'm having trouble reconciling the paper and the code when it comes to the min_len and max_len for the slmadv_params. They are set to 400 and 500 respectively here, but the paper states:
"For SLM adversarial training, both the ground truth and generated samples were ensured to be 3 to 6 seconds in duration".
I'm not sure how exactly to interpret the units on min_len and max_len, but that ratio definitely doesn't line up with 3-6 seconds.
Those parameters get used here, where they're halved and compared against the number of mel frames. If that's the correct interpretation, then with the mel transform here giving us ~80 frames per second of 24k audio, I think min_len would be set to 480 and max_len would be set to 960 to match the paper. Is that correct? Can you help clear this up for me?

Weird error when finetuning using colab in repo

Weird error when finetuning. I tried to put 'embeddings' in ignore_modules but it didn't change anything.

bert
bert loaded
Traceback (most recent call last):
  File "train_finetune.py", line 714, in <module>
    main()
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_finetune.py", line 212, in main
    load_only_params=config.get('load_only_params', True))
  File "/data/Repos/Forsen2/StyleTTS2/models.py", line 703, in load_checkpoint
    model[key].load_state_dict(params[key])
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MyDataParallel:
	Missing key(s) in state_dict: "module.embeddings.position_ids". 

How to replace PL-BERT with XPhoneBERT?

Hi,

I'm looking to generate Hindi audio but it was mentioned that PL-BERT doesn't work well with other languages and I either need to train a different PL-BERT or replace the module with XPhoneBERT.

I'm having trouble understanding how I could go about replacing the module with XPhoneBERT. The repository XPhoneBERT describes using the model through transformers but I'm unsure how I can apply that here and this issue thread suggests that the pre-trained model is not publicized so, how do I go about replacing PL-BERT with XPhoneBERT here?

Thanks!

speaker selection on inference on finetuned libritts

Hello- thanks again for sharing this project. The output quality is very impressive.
I was able to finetune the libritts model you shared with another voice to 199 steps.
Is there a way to select the speaker from the model? Im getting difference speaker outputs each time I run inference. Also- is a reference clip required? I would like to just get inference from the finetuned model without using a reference clip to see how it performs.

High-pitched noise in the background when using old GPUs

Previously discussed here: #1 (comment)

The model produces some high-pitched noise in the background when I use my old GPU for inference (NVIDIA Quadro P5000, Driver Version: 515.105.01, CUDA Version: 11.7)

Audio examples:

I solved this problem by switching to CPU device, so this issue is just for reference, as asked by the author.

Thank you for your work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.