ming024 / fastspeech2 Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 527.0 330.31 MB

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

License: MIT License

Python 81.29% HTML 18.71%

fastspeech2's People

Contributors

Stargazers

Watchers

Forkers

bigdan12 templeblock humanlost x-ccs peterzhousz lee-96-14 shoegazerstella zhuxiaoxuhit del18687058912 haifengzeng lampv33 llmhao zhangsong427 ligz07 gzfffff d4c-lt weishunyang farhanhs wanshanhsieh xiaohaijing chenyang918 entn-at a2heng ayush-1506 sunxh16 goodpaas codybai guomin peterli1001 cadia-lvl zhuzhuzhenbang syler123 zhangxinaaaa copperdong toanhvu stallboy thangvu8948 quocanh199 eric102004 wgfi110 ntzzc davidelvis wataru-nakata gheyret janfschr hlp2819 traitsai zhao-yajun ningyanzhu nuts-kun johannahom tubbz-alt evdv thanhdo99 cuongnm5 fantasyyyy saadnaeem-dev raikarsagar xuexidi yxfy peiyanflying ada-k sun-peach atlisig mgsong bruinxiong a0u jeloo-woo jet-voice tuananhktmt zhangsanfeng86 chcbin speechclone archelunch ishine zwglory wgwangang lukelluke pan310 superowner wangzhen-kris deepdubbed dipjyoti92 parvez2017 vietvhqnh stafanray xinkez atkisonb wuyx517 fengzhang2011 cocobar pandorals iluntsai99 zaidalyafeai wonbin-jung fireae sungfeng-huang hommmm sadam1195 yingfenging

fastspeech2's Issues

run preprocess.py failed with the latest version

The TextGrid file at path \preprocessed_data\LJSpeech\TextGrid,and i run preprocess.py. output log like this:

(pytorch_fs02) D:\workspace_tts\FastSpeech2-master>python preprocess.py config/LJSpeech/preprocess.yaml
Processing Data ...
100%| | 1/1 [00:00<00:00, 1.47it/s]
Computing statistic quantities ...
Traceback (most recent call last):
File "preprocess.py", line 15, in
preprocessor.build_from_path()
File "D:\workspace_tts\FastSpeech2-master\preprocessor\preprocessor.py", line 100, in build_from_path
pitch_mean = pitch_scaler.mean_[0]
AttributeError: 'StandardScaler' object has no attribute 'mean_'

I need help……

Energy and F0 for other dataset

Ho do I find the below parameters for other datasets

Quantization for F0 and energy

for LJSpeech

f0_min = 71.0
f0_max = 795.8
energy_min = 0.0
energy_max = 315.0

for Blizzard2013

#f0_min = 71.0
#f0_max = 786.7
#energy_min = 21.23
#energy_max = 101.02

Question about duration predictor

Hello @ming024 ,

First of all, thank you for sharing your awesome work.

I leave this question due to my curiosity about duration predictor improvements.
A few months ago, evaluation performance of duration predictor seemed to be not good due to overfitting.
(At that time, train error was less than 0.01, but eval error was 0.5~0.6.)

But, now it has been drastically improved (in the past, eval error was 0.5~0.6, but now 0.08 ~ 0.1).
If you don't mind, could you tell me what was the problem of your previous version of duration predictor?

Always appreciate,

TextGrid.zip

How could I make the textgrid file if I have to try it for other datasets.

Support to HiFiGan

HiFiGan has sota results in wav generation from mel spectrograms

Is it possibile to add support to hifigan model, after the mel generation, in order to create the wave file?

    mel, mel_postnet, log_duration_output, f0_output, energy_output, _, _, mel_len = model(text, src_len)
    
    mel_torch = mel.transpose(1, 2)
    mel_postnet_torch = mel_postnet.transpose(1, 2)
    mel = mel[0].cpu().transpose(0, 1)
    mel_postnet = mel_postnet[0].cpu().transpose(0, 1)
    f0_output = f0_output[0].cpu().numpy()
    energy_output = energy_output[0].cpu().numpy()

    if not os.path.exists(hp.test_path):
        os.makedirs(hp.test_path)

    if melgan is not None:
        with torch.no_grad():
            wav = melgan.inference(mel_torch).cpu().numpy() # use here hifgan?
            wav = wav.astype('int16')
            #ipd.display(ipd.Audio(wav, rate=hp.sampling_rate))
            # save audio file
            write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, wav)

or some additional adaptation would be needed?

In the case of the end-to-end inference with hifi gan the generation code would look like

def inference(a):
    generator = Generator(h).to(device)

    state_dict_g = load_checkpoint(a.checkpoint_file, device)
    generator.load_state_dict(state_dict_g['generator'])
    generator.eval()
    generator.remove_weight_norm()
    with torch.no_grad():
        x = torch.FloatTensor( mel_torch ).to(device)
        y_g_hat = generator(x)
        audio = y_g_hat.squeeze()
        audio = audio * MAX_WAV_VALUE
        audio = audio.cpu().numpy().astype('int16')
       write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, audio)

where mel_torch is our mel spectrogram.

Why perform interpolation on pitch?

FastSpeech2/preprocessor/preprocessor.py

Line 207 in 7011fa1

pitch = interp_fn(np.arange(0, len(pitch)))

As far as I know, pitch(F0) is the bottom yellow line on the spectrogram, and the pitch before interpolation looks similar to it. What's the reason to perform interpolation on pitch?

RuntimeError: Error(s) in loading state_dict for FastSpeech2:

When I tried to load the pretrained model output/LJSpeech/ckpt/900000.pth.tar, I have some errors:

size mismatch for encoder.src_word_emb.weight: copying a param with shape torch.Size([361, 256]) from checkpoint, the shape in current model is torch.Size([151, 256]).

The code which loads the model from repo

base_config_path = "config/LJSpeech"
prepr_path = f"{base_config_path}/preprocess.yaml"
model_path = f"{base_config_path}/model.yaml"
train_path = f"{base_config_path}/train.yaml"

prepr_config = yaml.load(open(prepr_path, "r"), Loader=yaml.FullLoader)
model_config = yaml.load(open(model_path, "r"), Loader=yaml.FullLoader)
train_config = yaml.load(open(train_path, "r"), Loader=yaml.FullLoader)
configs = (prepr_config, model_config, train_config)
cpkt_path = "output/LJSpeech/ckpt/900000.pth.tar"

def get_model(ckpt_path, configs):
    (preprocess_config, model_config, train_config) = configs
    model = FastSpeech2(preprocess_config, model_config).to(device)
    ckpt = torch.load(ckpt_path)
    model.load_state_dict(ckpt["model"])
    model.eval()
    model.requires_grad_ = False
    return model

Unable to train with custom dataset

Hello! Thank you for your great work with this impl. I am trying to train a model on a custom dataset (different language).
I didn't use MFA, I prepared data in LJSpeech format and edited dataset.py and preprocessing scripts. I didn't change symbols.py, however I use transliteration_cleaners before.

Original Traceback (most recent call last):
  File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/george/work/FastSpeech2/model/fastspeech2.py", line 81, in forward
    ) = self.variance_adaptor(
  File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/george/work/FastSpeech2/model/modules.py", line 121, in forward
    x = x + pitch_embedding
RuntimeError: The size of tensor a (79) must match the size of tensor b (620) at non-singleton dimension 1

seems that something wrong with the pitch embedding and duration. As I didn't use mfa I set duration to some hardcoded value and didnt trim pitch array at preprocessing stage. How to properly setup data for training?

Thank you

Question of Training Time

In README, here is:

The model takes less than 10k steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality

I clone this project, use the same dataset and parameters on GTX1080Ti (Two GPUs). The log is:

It seems to need 4.2 days to complete. Is somewhere I made a mistake?

pinyin phoneme modeling on aishell3 dataset

I have some questions about pinyin phoneme modeling on aishell3 dataset:

1 when use MFA force-alignment tools to get alignments data, the am in MFA is trained on aishell3 dataset or pre_trained am model?

2 when use MFA to get alignment, the G2p model is retrained on aishell3 dataset scripts? the lexicon is limited to aishell3 dataset?

3 if there are prosody tags #1 #2 .. in texts , how to do alignment with MFA?

thanks in advance!!

No speakers.json

Hello,
Thank you for ur update

Could you share ur speaker.json

problems on chinese dataset

hi, i train fastspeech2 on the biaobei dataset. the result seems that the pitch predictor and the duration predictor doesn't work well, but if you input the groud truth, you can get a good result. so does the your demo use the ground truth pitch and energy or use the predictor to predict the pitch and energy? i find other fastspeech2 code on tensorflowTTS and espnet2, they use the structure different from the orginal paper. they use length regulator after all the predictors finished, and pitch and energy predictor works well.

Need citation information

Hi,

I'm working on writing a paper related to neural vocoder.
In that paper, I wanna use multi-speaker TTS for testing vocoders on multi-speaker dataset(VCTK).
So, I will add experimental sources for training and validating yours on VCTK. And I will add reference information about it.

Can you add bibtex on README.md?

How can one control the F0 at synthesis time?

Hi there,

I am wondering since the paper for FastSpeech 2 is about being able to control factors such as F0, how this can be done synthesis time. I would like to add a scaling factor for F0 or duration, but cannot seem to do it in synthesis.py.

Thanks!

TypeError when python3 synthesize.py --step 300000

Traceback (most recent call last):
File "synthesize.py", line 113, in
args.step), args.duration_control, args.pitch_control, args.energy_control)
File "synthesize.py", line 54, in synthesize
text, src_len, d_control=duration_control, p_control=pitch_control, e_control=energy_control)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_len'

Can you share the MST-Originbeat dataset？ thank you very much！

AttributeError: module 'torch' has no attribute 'bucketize'

I get the following error:

root@75adae8f35d1:/app# python3 synthesize.py --step 300000
|{DH AH0 N EY1 SH AH0 N Z T UH1 R IH2 Z AH0 M M IH1 N AH0 S T ER0 HH AE1 Z AO1 L S OW0 EH0 N K ER1 IH0 JH D AO2 S T R EY1 L Y AH0 N Z T UW1 T EY1 K DH EH1 R HH AA1 L AH0 D EY2 Z W IH0 DH IH1 N DH AH0 K AH1 N T R IY0 DH IH1 S Y IH1 R} |
Traceback (most recent call last):
  File "synthesize.py", line 94, in <module>
    synthesize(model, text, sentence, prefix='step_{}'.format(args.step))
  File "synthesize.py", line 48, in synthesize
    mel, mel_postnet, duration_output, f0_output, energy_output = model(text, src_pos)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/app/fastspeech2.py", line 33, in forward
    encoder_output, d_target, p_target, e_target, max_length)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/app/modules.py", line 47, in forward
    pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
AttributeError: module 'torch' has no attribute 'bucketize'

I'm running in CPU, and I had to modify get_FastSpeech2 like this:

def get_FastSpeech2(num):
    checkpoint_path = os.path.join(hp.checkpoint_path, "checkpoint_{}.pth.tar".format(num))
    model = nn.DataParallel(FastSpeech2())
    if torch.cuda.is_available():
        model.load_state_dict(torch.load(checkpoint_path)['model'])
    else:
        model.load_state_dict(torch.load(checkpoint_path, map_location=torch.device('cpu'))['model'])
    model.requires_grad = False
    model.eval()
    return model

to set map_location to cpu device.

The loss of variance_adaptor(Mandarin dataset)

Hi, I used Mandarin dataset(BIAOBEI) to train FastSpeech2.
The loss of mel and PostNet mel seems no problem.
But I find out that the loss of variance_adaptor (Duration Loss, F0 Loss and Energy Loss) is really high.

The following is a part of my log:
Epoch [191/1000], Step [115650/608000]:
Total Loss: 68.7120, Mel Loss: 0.2892, Mel PostNet Loss: 0.2889, Duration Loss: 2.2572, F0 Loss: 59.7572, Energy Loss: 6.1195;
Time Used: 28493.331s, Estimated Time Remaining: 97800.871s.

How could I solve this?

Thank you.

train.py stop at 3321step

how can i train start from 3000 step?what command is?thanks

align AISHELL3 the corpus ERR

Hello,

When Ialign the AISHELL3 corpus by myself, got error . As Follow:

Maybe the document lexicon.txt is wrong?

No unaligned.txt when training the aligment by using MFA tool

when I use MFA tool to train the AISHELL3, I can not have unaligned.txt file in the result folder. Should I open some args when training?

Blizzard 2013 is a female voice data set

Hello, please update the README.md, I believe the Bllizzard2013 is a female voice data set and the speaker is Catherine Byers.

Curious about 'max_mel_len' of 'mel_predictions' and 'mel_targets'

FastSpeech2/model/modules.py

Line 176 in 7011fa1

output = pad(output, max_len)

FastSpeech2/utils/tools.py

Line 303 in 7011fa1

max_len = max([input_ele[i].size(0) for i in range(len(input_ele))])

In my understanding, 'max_mel_len' of 'mel_predictions' is >= 'max_mel_len' of 'mel_targets', thus shape of 'mel_predictions' and 'mel_targets' is not always the same. How to calculate loss of two matrices with different shapes?

FastSpeech2/model/loss.py

Line 43 in 7011fa1

mel_targets = mel_targets[:, : mel_masks.shape[1], :]

Here, isn't 'mel_masks.shape[1]' == 'max_mel_len' of 'mel_predictions', and 'mel_masks.shape[1]' >= 'max_mel_len' of 'mel_targets'? I think this operation won't change shape of 'mel_targets'?

Share OS version problem with MFA

previous OS : Ubuntu 16.04
solve OS : Ubuntu 18.04

I faced a problem that a compiled binary has a version dependency and its version depend on versions of Ubuntu (upper than 18.04).

Probably, I guess from log files that latest shared library of MFA is compiled on ubuntu 18.04 or upper.

So, I upgraded to ubuntu 18.04, then mfa successfully can be run.

(* but be careful while doing it! )

Freely close this issue after you see it.

Masking

Hello, thank you for your work. I have a question, what is the purpose of masking in your work? As I assume, this is not for hiding future steps, due to FastSpeech2 is non-autoregressive.

textgrid

We are implementing the model using a different dataset, are we supposed to produce new text grids or is there a way we can customize ljspeechtextgrids?
If we're to produce new text grids, which method is the most appropriate to use for all the wav files?

Problem occurred when increasing batch size

Hi, thanks for your great work!
I was training this repo on my own dataset.
Everything was OK when I set batch size as 64. I got training log and checkpoints of model.
But when I tried to increase batch size above 64 (There were enough free memory on my GPU), the training.py crashed before it printed any a single line of running log or error message.

Usage of GPU when batch size was 64:

Any idea on how did this happen?
Thanks!

Is it possible to use something else instead of MFA

Hi, is it possible to use something else instead of MFA?

MFA is buggy as hell and such dependency is scary.

Modify model to allow JIT tracing

Hi, thanks for the repo! I am wondering if you have plans to convert the model to be JIT-traceable for exporting to C++? I tried to JIT trace and it generated some critical warnings:

FastSpeech2/env/lib/python3.7/site-packages/torch/tensor.py:593: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
FastSpeech2/utils/tools.py:97: TracerWarning: Converting a tensor to a NumPy array might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  max_len = max_len.detach().cpu().numpy()[0]
FastSpeech2/transformer/Models.py:82: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not self.training and src_seq.shape[1] > self.max_seq_len:
FastSpeech2/transformer/Models.py:90: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  :, :max_len, :
FastSpeech2/model/modules.py:186: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  expand_size = predicted[i].item()
FastSpeech2/model/modules.py:180: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  return output, torch.LongTensor(mel_len).to(device)
FastSpeech2/utils/tools.py:94: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  max_len = torch.max(lengths).item()
FastSpeech2/transformer/Models.py:145: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not self.training and enc_seq.shape[1] > self.max_seq_len:
FastSpeech2/transformer/Models.py:154: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  max_len = min(max_len, self.max_seq_len)
FastSpeech2/transformer/Models.py:158: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  dec_output = enc_seq[:, :max_len, :] + self.position_enc[
FastSpeech2/transformer/Models.py:159: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  :, :max_len, :
FastSpeech2/transformer/Models.py:161: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  mask = mask[:, :max_len]
FastSpeech2/transformer/Models.py:162: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  slf_attn_mask = slf_attn_mask[:, :, :max_len]

I made the following changes:

tools.py:91

def get_mask_from_lengths(lengths, max_len=None):
    batch_size = lengths.shape[0]
    if max_len is None:
        max_len = torch.max(lengths).item()
    else:
        print(max_len)
        max_len = max_len.detach().cpu().numpy()[0]
        print(max_len)
    ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
    mask = ids >= lengths.unsqueeze(1).expand(-1, max_len)

    return mask

and

synthesize:87

def synthesize(model, step, configs, vocoder, batchs, control_values):
    preprocess_config, model_config, train_config = configs
    pitch_control, energy_control, duration_control = control_values

    for batch in batchs:
        batch = to_device(batch, device)
        with torch.no_grad():
            traced_script_module = torch.jit.trace(
                model, (batch[2], batch[3], batch[4], torch.tensor([batch[5]]))
            )
            traced_script_module.save("traced_fastspeech_model.pt")

It seems like most of the issues are with max_len being used in conditionals and array slices. I will look into this more but wanted to see if you had tried this before

FastSpeechs training error

In this case, after generated textgrid files by MFA and placed in the preprocessed folder, i ran the scripts preprocess.py, prepare_align.py and preprocess.py sperataly and no error occured, and created these file: alignment energy f0 mel stat.txt train.txt val.txt;
then i ran the python train.py script to train the model, but get the error as follows:

5 15] (50,) 00001943
[] (0,) [26 15 12 6 4 4 10 11 5 3 2 3 2 5 7 7 8 8 3 3 7 9 5 4
5 8 15 6 7 5 8 9 3 4 5 10 12 4 6 4 6 9 4 9 5 12] (46,) 00000539
Traceback (most recent call last):
File "train.py", line 238, in
main(args)
File "train.py", line 108, in main
text, src_len, mel_len, D, f0, energy, max_src_len, max_mel_len)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/temp/fs2p/fastspeech2.py", line 36, in forward
encoder_output, src_mask, mel_mask, d_target, p_target, e_target, max_mel_len)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/temp/fs2p/modules.py", line 46, in forward
pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_target, self.pitch_bins))
RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1595629403081/work/torch/csrc/autograd/functions/utils.h":59, please report a bug to PyTorch.
(fs2p) [speechlab@localhost fs2p]$ RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1595629403081/work/torch/csrc/autograd/functions/utils.h":59, please report a bug to PyTorch.

in this case, i install pytorch 1.6 stable edition as offical pytorch site. but not torch_nightly
pip3 install --pre torch==1.6.0.dev20200428 -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

how do i fix this problem? thanks advanced!

M2voc training stuck？

Why does the training of M2voc keep stuck and there is no information output？

Huggingface support

Hi!

Thank you for the wonderful library. I've ported some of your code over to the huggingface framework. Please check out my repo and let me know what you think.

https://github.com/ontocord/fastspeech2_hf

raw data path

The preprocess.yaml file for LJSpeech contains the attribute raw_path: "./raw_data/LJSpeech", How to obtain raw data from LJSpeech-1.1 ?

Implementation of Pitch predictor

Hi,

Thanks for your contribution. As mentioned in the paper for pitch they predict Pitch spectogram, mean and variance however it's not implemented in the current repo. Could you kindly tell if I am wrong or if that part is needed to be implemented.
Regards
Harsh

does the mel_fmax matter ？

I saw “please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder” about mel_fmax in preprocess.yaml
But I didnt change it when using MelGAN, it seems not bad.
Why mel_fmax is different in hifigan and melgan？

Problem with VarianceAdaptor implementation

Here :

FastSpeech2/modules.py

Line 48 in bd4c341

x = x + pitch_embedding

and then

FastSpeech2/modules.py

Line 50 in bd4c341

energy_prediction = self.energy_predictor(x)

But as per paper detail the input of Energy predictor should be output of length regulator not the output of pitch predictor.
See the fastspeech 2 diagram clearly input of Energy predictor is x output of length regulator without pitch component.
Actual code should be like:

 def forward(self, x, duration_target=None, pitch_target=None, energy_target=None, max_length=None):

        duration_prediction = self.duration_predictor(x)
        if duration_target is not None:
            x, mel_pos = self.length_regulator(x, duration_target, max_length)
        else:
            duration_rounded = torch.round(duration_prediction)
            x, mel_pos = self.length_regulator(x, duration_rounded)
        
        pitch_prediction = self.pitch_predictor(x)
        if pitch_target is not None:
            pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_target, self.pitch_bins))
        else:
            pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
  
        
        energy_prediction = self.energy_predictor(x)
        if energy_target is not None:
            energy_embedding = self.energy_embedding(torch.bucketize(energy_target, self.energy_bins))
        else:
            energy_embedding = self.energy_embedding(torch.bucketize(energy_prediction, self.energy_bins))

        x = x + pitch_embedding
        x = x + energy_embedding
        
        return x, duration_prediction, pitch_prediction, energy_prediction, mel_pos

Speed control?

Thanks for amazing work.
Is this implementation allows speed control (adjust speaking rate become faster or slower) ?
If not, where should I look for to implement this feature?

AdaIN-VC speaker encoder support

Hi, thank you for your work, it's awesome.

I participated in your lecture about M2VoC. You shared the approach of using AdaIN-VC as speaker encoder, it was impressive.
But the code here seems to only support lookup tables, any plan for it?

FastSpeech2/model/fastspeech2.py

Line 38 in 76b2b65

self.speaker_emb = nn.Embedding(

Blizzard output

The output of Blizzard dataset is not good , which is sample at 16khz. Have you tested the model for 16hz sampled audio.

Some confusion in your visualization TensorBoard

When I train model FastPitch from NVIDIA source code, I have the same images like yours,
My train data is 11239 and validation is 1000,
I have seen that the train line and val line more and more separate it others. It seems not common
Are your models really output a speech or not? I feel so confused?
Thank you for sharing the code <3

MFA alignment is not accurate

Hi, I am using MFA to do force alignment on my own dataset. I found that the alignment results is not accurate.

my dataset is 15 hours, maybe not big enough for training am from scratch. after increased my dataset to 30 hours, still not accurate. Do you have some trick to improve the alignment?

Error in data preprocessing

这行代码wav = wav / max(abs(wav)) * max_wav_value
将会使音频产生噪声，像下面这种：

Multispeaker

First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning.
Multispeaker support when? Or is it technically implemented but not used yet?

Training LJSpeech as described in the README.md

Hi,
I'm trying to run the LJSpeech example described in the README.md but it fails during preprocessing.
Here's what I've done so far.

git clone https://github.com/ming024/FastSpeech2.git
cd FastSpeech2

I then downloaded LJSpeech.zip from https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing

Looking at what is in LJSpeech.zip

unzip -l LJSpeech.zip | head
Archive:  LJSpeech.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2021-02-13 20:13   TextGrid/
        0  2021-02-13 20:13   TextGrid/LJSpeech/
     7164  2021-02-13 20:06   TextGrid/LJSpeech/LJ001-0019.TextGrid
     4589  2021-02-13 20:06   TextGrid/LJSpeech/LJ001-0020.TextGrid
     3830  2021-02-13 20:06   TextGrid/LJSpeech/LJ001-0061.TextGrid
     8199  2021-02-13 20:06   TextGrid/LJSpeech/LJ001-0088.TextGrid
     8678  2021-02-13 20:06   TextGrid/LJSpeech/LJ001-0089.TextGrid
	 ....
     9461  2021-02-13 20:06   TextGrid/LJSpeech/LJ049-0158.TextGrid
     3698  2021-02-13 20:06   TextGrid/LJSpeech/LJ050-0230.TextGrid
     8067  2021-02-13 20:06   TextGrid/LJSpeech/LJ050-0261.TextGrid
     7719  2021-02-13 20:06   TextGrid/LJSpeech/LJ050-0272.TextGrid
---------                     -------
 87949622                     13102 files

Then the instructions say to unzip it `unzip -d preprocessed_data/LJSpeech/TextGrid/ LJSpeech.zip`.

Looking at the result from the previous command.

```bash
find preprocessed_data/LJSpeech/TextGrid
preprocessed_data/LJSpeech/TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ003-0286.TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ008-0301.TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ002-0235.TextGrid
....

I then symlink my LJSpeech wav files under raw_data

( cd raw_data && ln -fs  <base_dir>/LJ.Speech.Dataset/LJSpeech-1.1/wavs   LJSpeech; )

I then modified preprocess.yaml to match my paths.

cp config/LJSpeech/preprocess.yaml preprocess.yaml

4c4
<   corpus_path: "/home/ming/Data/LJSpeech-1.1"
---
>   corpus_path: "<base_dir>/LJ.Speech.Dataset/LJSpeech-1.1"
6c6,7
<   raw_path: "./raw_data/LJSpeech"
---
>   #raw_path: "./raw_data/LJSpeech"
>   raw_path: "./raw_data"

And when I try to run python3 preprocess.py preprocess.yaml I get the following error for missing lab files

Processing Data ...
  0%|                                                                                                                                                  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "preprocess.py", line 15, in <module>
    preprocessor.build_from_path()
  File "/gpfs/fs1/nrc/ict/others/u/sam037/tts/FastSpeech2.git/preprocessor/preprocessor.py", line 79, in build_from_path
    ret = self.process_utterance(speaker, basename)
  File "/gpfs/fs1/nrc/ict/others/u/sam037/tts/FastSpeech2.git/preprocessor/preprocessor.py", line 179, in process_utterance
    with open(text_path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: './raw_data/LJSpeech/LJ026-0087.lab'

What steps am I missing?

Colab Notebook for Inference

Hi, thanks for sharing this repo!

I did a Colab Notebook for inference only that you can find here HERE.

Even if I have an error when executing the synthesize step

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
|{F AO1 R AO2 L DH OW1 DH AH0 CH AY0 N IY1 Z T UH1 K IH0 M P R EH1 SH AH0 N Z F R AH1 M W UH1 D B L AA1 K S IH0 N G R EY1 V D IH0 N R IH0 L IY1 F F AO1 R S EH1 N CH ER0 IY0 Z B IH0 F AO1 R DH AH0 W UH1 D K AH2 T ER0 Z AH1 V DH AH0 N EH1 DH ER0 L AH0 N D Z} {B AY1 AH0 S IH1 M AH0 L ER0 P R AA1 S EH2 S}|
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.WaveGlow' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.conv.ConvTranspose1d' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.WN' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv1d' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.Invertible1x1Conv' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
Traceback (most recent call last):
  File "synthesize.py", line 86, in <module>
    synthesize(model, text, sentence, prefix='step_{}'.format(args.step))
  File "synthesize.py", line 60, in synthesize
    wave_glow = utils.get_WaveGlow()
  File "/content/FastSpeech2/utils.py", line 107, in get_WaveGlow
    wave_glow = wave_glow.remove_weightnorm(wave_glow)
  File "/content/FastSpeech2/glow.py", line 307, in remove_weightnorm
    WN.cond_layers = remove(WN.cond_layers)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 606, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'WN' object has no attribute 'cond_layers'

I am still able to generate the speech, but it doesn't sound as it should at all.
HERE the generated speech.

Any idea on what's happening?
Thanks!

Male voice

First thank you, I have solved the issue opened thanks to you support. In my understanding both Melgan (that I have tried) and Waveglow (not run yet) have a female voice. To have a male voice is it necessary to train from scratch the model? Or add support to a specific vocoder?

Thank you.

solution about RuntimeError: isDifferentiableType(variable.scalar_type())

In modules.py change
self.pitch_bins = nn.Parameter(torch.exp(torch.linspace(np.log(hp.f0_min), np.log(hp.f0_max), hp.n_bins-1))) self.energy_bins = nn.Parameter(torch.linspace(hp.energy_min, hp.energy_max, hp.n_bins-1))

to
self.pitch_bins = torch.exp(torch.linspace(np.log(hp.f0_min), np.log(hp.f0_max), hp.n_bins-1)).cuda() self.energy_bins = torch.linspace(hp.energy_min, hp.energy_max, hp.n_bins-1).cuda()

Originally posted by @Mao-JianGuo in #13 (comment)

Error while training Russian model

Hi, thanks for the job! I have a problem. I am trying to teach the Russian language to the model. The only thing I changed in your code is the kind of cleaner by specifying transliteration_cleaner, since Cyrillic can be transliterated. I went through the TextGrids creation step as well as the preprocess.py step. Run train.py. An error occurs:

how can i solve this problem?

MFA alignment

I want to use the mandarin acoustic model provided by MFA ,i failed. So how should i train a acoustic model or can you share your acoustic model used in this project.

multi-language support?

I have tried your project for "普通话". It appears a result that generator will skip the num when the input text mix up with Chinese character and number. For example, "台风奥马尔是继1976年的台风帕梅拉以来吹袭关岛的最强台风"， it will skip ”1976“ in ”.wav“. How can I add '1976' prounce in result?