ming024 / fastspeech2 Goto Github PK
View Code? Open in Web Editor NEWAn implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
License: MIT License
An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
License: MIT License
The TextGrid file at path \preprocessed_data\LJSpeech\TextGrid,and i run preprocess.py. output log like this:
(pytorch_fs02) D:\workspace_tts\FastSpeech2-master>python preprocess.py config/LJSpeech/preprocess.yaml
Processing Data ...
100%| | 1/1 [00:00<00:00, 1.47it/s]
Computing statistic quantities ...
Traceback (most recent call last):
File "preprocess.py", line 15, in
preprocessor.build_from_path()
File "D:\workspace_tts\FastSpeech2-master\preprocessor\preprocessor.py", line 100, in build_from_path
pitch_mean = pitch_scaler.mean_[0]
AttributeError: 'StandardScaler' object has no attribute 'mean_'
I need help……
Ho do I find the below parameters for other datasets
f0_min = 71.0
f0_max = 795.8
energy_min = 0.0
energy_max = 315.0
#f0_min = 71.0
#f0_max = 786.7
#energy_min = 21.23
#energy_max = 101.02
Hello @ming024 ,
First of all, thank you for sharing your awesome work.
I leave this question due to my curiosity about duration predictor improvements.
A few months ago, evaluation performance of duration predictor seemed to be not good due to overfitting.
(At that time, train error was less than 0.01, but eval error was 0.5~0.6.)
But, now it has been drastically improved (in the past, eval error was 0.5~0.6, but now 0.08 ~ 0.1).
If you don't mind, could you tell me what was the problem of your previous version of duration predictor?
Always appreciate,
How could I make the textgrid file if I have to try it for other datasets.
HiFiGan has sota results in wav generation from mel spectrograms
Is it possibile to add support to hifigan
model, after the mel
generation, in order to create the wave file?
mel, mel_postnet, log_duration_output, f0_output, energy_output, _, _, mel_len = model(text, src_len)
mel_torch = mel.transpose(1, 2)
mel_postnet_torch = mel_postnet.transpose(1, 2)
mel = mel[0].cpu().transpose(0, 1)
mel_postnet = mel_postnet[0].cpu().transpose(0, 1)
f0_output = f0_output[0].cpu().numpy()
energy_output = energy_output[0].cpu().numpy()
if not os.path.exists(hp.test_path):
os.makedirs(hp.test_path)
if melgan is not None:
with torch.no_grad():
wav = melgan.inference(mel_torch).cpu().numpy() # use here hifgan?
wav = wav.astype('int16')
#ipd.display(ipd.Audio(wav, rate=hp.sampling_rate))
# save audio file
write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, wav)
or some additional adaptation would be needed?
In the case of the end-to-end inference with hifi gan the generation code would look like
def inference(a):
generator = Generator(h).to(device)
state_dict_g = load_checkpoint(a.checkpoint_file, device)
generator.load_state_dict(state_dict_g['generator'])
generator.eval()
generator.remove_weight_norm()
with torch.no_grad():
x = torch.FloatTensor( mel_torch ).to(device)
y_g_hat = generator(x)
audio = y_g_hat.squeeze()
audio = audio * MAX_WAV_VALUE
audio = audio.cpu().numpy().astype('int16')
write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, audio)
where mel_torch
is our mel spectrogram.
FastSpeech2/preprocessor/preprocessor.py
Line 207 in 7011fa1
As far as I know, pitch(F0) is the bottom yellow line on the spectrogram, and the pitch before interpolation looks similar to it. What's the reason to perform interpolation on pitch?
output/LJSpeech/ckpt/900000.pth.tar
, I have some errors:size mismatch for encoder.src_word_emb.weight: copying a param with shape torch.Size([361, 256]) from checkpoint, the shape in current model is torch.Size([151, 256]).
base_config_path = "config/LJSpeech"
prepr_path = f"{base_config_path}/preprocess.yaml"
model_path = f"{base_config_path}/model.yaml"
train_path = f"{base_config_path}/train.yaml"
prepr_config = yaml.load(open(prepr_path, "r"), Loader=yaml.FullLoader)
model_config = yaml.load(open(model_path, "r"), Loader=yaml.FullLoader)
train_config = yaml.load(open(train_path, "r"), Loader=yaml.FullLoader)
configs = (prepr_config, model_config, train_config)
cpkt_path = "output/LJSpeech/ckpt/900000.pth.tar"
def get_model(ckpt_path, configs):
(preprocess_config, model_config, train_config) = configs
model = FastSpeech2(preprocess_config, model_config).to(device)
ckpt = torch.load(ckpt_path)
model.load_state_dict(ckpt["model"])
model.eval()
model.requires_grad_ = False
return model
Hello! Thank you for your great work with this impl. I am trying to train a model on a custom dataset (different language).
I didn't use MFA, I prepared data in LJSpeech format and edited dataset.py
and preprocessing scripts. I didn't change symbols.py, however I use transliteration_cleaners before.
Original Traceback (most recent call last):
File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/george/work/FastSpeech2/model/fastspeech2.py", line 81, in forward
) = self.variance_adaptor(
File "/home/george/anaconda3/envs/reface_tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/george/work/FastSpeech2/model/modules.py", line 121, in forward
x = x + pitch_embedding
RuntimeError: The size of tensor a (79) must match the size of tensor b (620) at non-singleton dimension 1
seems that something wrong with the pitch embedding and duration. As I didn't use mfa I set duration to some hardcoded value and didnt trim pitch array at preprocessing stage. How to properly setup data for training?
Thank you
In README, here is:
The model takes less than 10k steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality
I clone this project, use the same dataset and parameters on GTX1080Ti (Two GPUs). The log is:
It seems to need 4.2 days to complete. Is somewhere I made a mistake?
I have some questions about pinyin phoneme modeling on aishell3 dataset:
1 when use MFA force-alignment tools to get alignments data, the am in MFA is trained on aishell3 dataset or pre_trained am model?
2 when use MFA to get alignment, the G2p model is retrained on aishell3 dataset scripts? the lexicon is limited to aishell3 dataset?
3 if there are prosody tags #1 #2 .. in texts , how to do alignment with MFA?
thanks in advance!!
Hello,
Thank you for ur update
Could you share ur speaker.json
hi, i train fastspeech2 on the biaobei dataset. the result seems that the pitch predictor and the duration predictor doesn't work well, but if you input the groud truth, you can get a good result. so does the your demo use the ground truth pitch and energy or use the predictor to predict the pitch and energy? i find other fastspeech2 code on tensorflowTTS and espnet2, they use the structure different from the orginal paper. they use length regulator after all the predictors finished, and pitch and energy predictor works well.
Hi,
I'm working on writing a paper related to neural vocoder.
In that paper, I wanna use multi-speaker TTS for testing vocoders on multi-speaker dataset(VCTK).
So, I will add experimental sources for training and validating yours on VCTK. And I will add reference information about it.
Can you add bibtex on README.md?
Hi there,
I am wondering since the paper for FastSpeech 2 is about being able to control factors such as F0, how this can be done synthesis time. I would like to add a scaling factor for F0 or duration, but cannot seem to do it in synthesis.py.
Thanks!
Traceback (most recent call last):
File "synthesize.py", line 113, in
args.step), args.duration_control, args.pitch_control, args.energy_control)
File "synthesize.py", line 54, in synthesize
text, src_len, d_control=duration_control, p_control=pitch_control, e_control=energy_control)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ant/.conda/envs/TTS/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_len'
I get the following error:
root@75adae8f35d1:/app# python3 synthesize.py --step 300000
|{DH AH0 N EY1 SH AH0 N Z T UH1 R IH2 Z AH0 M M IH1 N AH0 S T ER0 HH AE1 Z AO1 L S OW0 EH0 N K ER1 IH0 JH D AO2 S T R EY1 L Y AH0 N Z T UW1 T EY1 K DH EH1 R HH AA1 L AH0 D EY2 Z W IH0 DH IH1 N DH AH0 K AH1 N T R IY0 DH IH1 S Y IH1 R} |
Traceback (most recent call last):
File "synthesize.py", line 94, in <module>
synthesize(model, text, sentence, prefix='step_{}'.format(args.step))
File "synthesize.py", line 48, in synthesize
mel, mel_postnet, duration_output, f0_output, energy_output = model(text, src_pos)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/app/fastspeech2.py", line 33, in forward
encoder_output, d_target, p_target, e_target, max_length)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/app/modules.py", line 47, in forward
pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
AttributeError: module 'torch' has no attribute 'bucketize'
I'm running in CPU, and I had to modify get_FastSpeech2
like this:
def get_FastSpeech2(num):
checkpoint_path = os.path.join(hp.checkpoint_path, "checkpoint_{}.pth.tar".format(num))
model = nn.DataParallel(FastSpeech2())
if torch.cuda.is_available():
model.load_state_dict(torch.load(checkpoint_path)['model'])
else:
model.load_state_dict(torch.load(checkpoint_path, map_location=torch.device('cpu'))['model'])
model.requires_grad = False
model.eval()
return model
to set map_location
to cpu device.
Hi, I used Mandarin dataset(BIAOBEI) to train FastSpeech2.
The loss of mel and PostNet mel seems no problem.
But I find out that the loss of variance_adaptor (Duration Loss, F0 Loss and Energy Loss) is really high.
The following is a part of my log:
Epoch [191/1000], Step [115650/608000]:
Total Loss: 68.7120, Mel Loss: 0.2892, Mel PostNet Loss: 0.2889, Duration Loss: 2.2572, F0 Loss: 59.7572, Energy Loss: 6.1195;
Time Used: 28493.331s, Estimated Time Remaining: 97800.871s.
How could I solve this?
Thank you.
how can i train start from 3000 step?what command is?thanks
when I use MFA tool to train the AISHELL3, I can not have unaligned.txt file in the result folder. Should I open some args when training?
Hello, please update the README.md, I believe the Bllizzard2013 is a female voice data set and the speaker is Catherine Byers.
Line 176 in 7011fa1
Line 303 in 7011fa1
In my understanding, 'max_mel_len' of 'mel_predictions' is >= 'max_mel_len' of 'mel_targets', thus shape of 'mel_predictions' and 'mel_targets' is not always the same. How to calculate loss of two matrices with different shapes?
Line 43 in 7011fa1
Here, isn't 'mel_masks.shape[1]' == 'max_mel_len' of 'mel_predictions', and 'mel_masks.shape[1]' >= 'max_mel_len' of 'mel_targets'? I think this operation won't change shape of 'mel_targets'?
I faced a problem that a compiled binary has a version dependency and its version depend on versions of Ubuntu (upper than 18.04).
Probably, I guess from log files that latest shared library of MFA is compiled on ubuntu 18.04 or upper.
So, I upgraded to ubuntu 18.04, then mfa successfully can be run.
(* but be careful while doing it! )
Freely close this issue after you see it.
Hello, thank you for your work. I have a question, what is the purpose of masking in your work? As I assume, this is not for hiding future steps, due to FastSpeech2 is non-autoregressive.
We are implementing the model using a different dataset, are we supposed to produce new text grids or is there a way we can customize ljspeechtextgrids?
If we're to produce new text grids, which method is the most appropriate to use for all the wav files?
Hi, thanks for your great work!
I was training this repo on my own dataset.
Everything was OK when I set batch size as 64
. I got training log and checkpoints of model.
But when I tried to increase batch size above 64
(There were enough free memory on my GPU), the training.py
crashed before it printed any a single line of running log or error message.
Usage of GPU when batch size was 64
:
Any idea on how did this happen?
Thanks!
Hi, is it possible to use something else instead of MFA?
MFA is buggy as hell and such dependency is scary.
Hi, thanks for the repo! I am wondering if you have plans to convert the model to be JIT-traceable for exporting to C++? I tried to JIT trace and it generated some critical warnings:
FastSpeech2/env/lib/python3.7/site-packages/torch/tensor.py:593: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
'incorrect results).', category=RuntimeWarning)
FastSpeech2/utils/tools.py:97: TracerWarning: Converting a tensor to a NumPy array might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
max_len = max_len.detach().cpu().numpy()[0]
FastSpeech2/transformer/Models.py:82: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if not self.training and src_seq.shape[1] > self.max_seq_len:
FastSpeech2/transformer/Models.py:90: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
:, :max_len, :
FastSpeech2/model/modules.py:186: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
expand_size = predicted[i].item()
FastSpeech2/model/modules.py:180: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return output, torch.LongTensor(mel_len).to(device)
FastSpeech2/utils/tools.py:94: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
max_len = torch.max(lengths).item()
FastSpeech2/transformer/Models.py:145: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if not self.training and enc_seq.shape[1] > self.max_seq_len:
FastSpeech2/transformer/Models.py:154: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
max_len = min(max_len, self.max_seq_len)
FastSpeech2/transformer/Models.py:158: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
dec_output = enc_seq[:, :max_len, :] + self.position_enc[
FastSpeech2/transformer/Models.py:159: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
:, :max_len, :
FastSpeech2/transformer/Models.py:161: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
mask = mask[:, :max_len]
FastSpeech2/transformer/Models.py:162: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
slf_attn_mask = slf_attn_mask[:, :, :max_len]
I made the following changes:
tools.py:91
def get_mask_from_lengths(lengths, max_len=None):
batch_size = lengths.shape[0]
if max_len is None:
max_len = torch.max(lengths).item()
else:
print(max_len)
max_len = max_len.detach().cpu().numpy()[0]
print(max_len)
ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
mask = ids >= lengths.unsqueeze(1).expand(-1, max_len)
return mask
and
synthesize:87
def synthesize(model, step, configs, vocoder, batchs, control_values):
preprocess_config, model_config, train_config = configs
pitch_control, energy_control, duration_control = control_values
for batch in batchs:
batch = to_device(batch, device)
with torch.no_grad():
traced_script_module = torch.jit.trace(
model, (batch[2], batch[3], batch[4], torch.tensor([batch[5]]))
)
traced_script_module.save("traced_fastspeech_model.pt")
It seems like most of the issues are with max_len
being used in conditionals and array slices. I will look into this more but wanted to see if you had tried this before
In this case, after generated textgrid files by MFA and placed in the preprocessed folder, i ran the scripts preprocess.py, prepare_align.py and preprocess.py sperataly and no error occured, and created these file: alignment energy f0 mel stat.txt train.txt val.txt;
then i ran the python train.py script to train the model, but get the error as follows:
5 15] (50,) 00001943
[] (0,) [26 15 12 6 4 4 10 11 5 3 2 3 2 5 7 7 8 8 3 3 7 9 5 4
5 8 15 6 7 5 8 9 3 4 5 10 12 4 6 4 6 9 4 9 5 12] (46,) 00000539
Traceback (most recent call last):
File "train.py", line 238, in
main(args)
File "train.py", line 108, in main
text, src_len, mel_len, D, f0, energy, max_src_len, max_mel_len)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/temp/fs2p/fastspeech2.py", line 36, in forward
encoder_output, src_mask, mel_mask, d_target, p_target, e_target, max_mel_len)
File "/home/speechlab/anaconda3/envs/fs2p/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/speechlab/temp/fs2p/modules.py", line 46, in forward
pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_target, self.pitch_bins))
RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1595629403081/work/torch/csrc/autograd/functions/utils.h":59, please report a bug to PyTorch.
(fs2p) [speechlab@localhost fs2p]$ RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1595629403081/work/torch/csrc/autograd/functions/utils.h":59, please report a bug to PyTorch.
in this case, i install pytorch 1.6 stable edition as offical pytorch site. but not torch_nightly
pip3 install --pre torch==1.6.0.dev20200428 -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
how do i fix this problem? thanks advanced!
Why does the training of M2voc keep stuck and there is no information output?
Hi!
Thank you for the wonderful library. I've ported some of your code over to the huggingface framework. Please check out my repo and let me know what you think.
The preprocess.yaml
file for LJSpeech contains the attribute raw_path: "./raw_data/LJSpeech"
, How to obtain raw data from LJSpeech-1.1 ?
Hi,
Thanks for your contribution. As mentioned in the paper for pitch they predict Pitch spectogram, mean and variance however it's not implemented in the current repo. Could you kindly tell if I am wrong or if that part is needed to be implemented.
Regards
Harsh
I saw “please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder” about mel_fmax in preprocess.yaml
But I didnt change it when using MelGAN, it seems not bad.
Why mel_fmax is different in hifigan and melgan?
Here :
Line 48 in bd4c341
and then
Line 50 in bd4c341
But as per paper detail the input of Energy predictor should be output of length regulator not the output of pitch predictor.
See the fastspeech 2 diagram clearly input of Energy predictor is x
output of length regulator without pitch component.
Actual code should be like:
def forward(self, x, duration_target=None, pitch_target=None, energy_target=None, max_length=None):
duration_prediction = self.duration_predictor(x)
if duration_target is not None:
x, mel_pos = self.length_regulator(x, duration_target, max_length)
else:
duration_rounded = torch.round(duration_prediction)
x, mel_pos = self.length_regulator(x, duration_rounded)
pitch_prediction = self.pitch_predictor(x)
if pitch_target is not None:
pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_target, self.pitch_bins))
else:
pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
energy_prediction = self.energy_predictor(x)
if energy_target is not None:
energy_embedding = self.energy_embedding(torch.bucketize(energy_target, self.energy_bins))
else:
energy_embedding = self.energy_embedding(torch.bucketize(energy_prediction, self.energy_bins))
x = x + pitch_embedding
x = x + energy_embedding
return x, duration_prediction, pitch_prediction, energy_prediction, mel_pos
Thanks for amazing work.
Is this implementation allows speed control (adjust speaking rate become faster or slower) ?
If not, where should I look for to implement this feature?
Hi, thank you for your work, it's awesome.
I participated in your lecture about M2VoC. You shared the approach of using AdaIN-VC as speaker encoder, it was impressive.
But the code here seems to only support lookup tables, any plan for it?
FastSpeech2/model/fastspeech2.py
Line 38 in 76b2b65
The output of Blizzard dataset is not good , which is sample at 16khz. Have you tested the model for 16hz sampled audio.
When I train model FastPitch from NVIDIA source code, I have the same images like yours,
My train data is 11239 and validation is 1000,
I have seen that the train line and val line more and more separate it others. It seems not common
Are your models really output a speech or not? I feel so confused?
Thank you for sharing the code <3
Hi, I am using MFA to do force alignment on my own dataset. I found that the alignment results is not accurate.
my dataset is 15 hours, maybe not big enough for training am from scratch. after increased my dataset to 30 hours, still not accurate. Do you have some trick to improve the alignment?
First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning.
Multispeaker support when? Or is it technically implemented but not used yet?
Hi,
I'm trying to run the LJSpeech example described in the README.md
but it fails during preprocessing.
Here's what I've done so far.
git clone https://github.com/ming024/FastSpeech2.git
cd FastSpeech2
I then downloaded LJSpeech.zip
from https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing
Looking at what is in LJSpeech.zip
unzip -l LJSpeech.zip | head
Archive: LJSpeech.zip
Length Date Time Name
--------- ---------- ----- ----
0 2021-02-13 20:13 TextGrid/
0 2021-02-13 20:13 TextGrid/LJSpeech/
7164 2021-02-13 20:06 TextGrid/LJSpeech/LJ001-0019.TextGrid
4589 2021-02-13 20:06 TextGrid/LJSpeech/LJ001-0020.TextGrid
3830 2021-02-13 20:06 TextGrid/LJSpeech/LJ001-0061.TextGrid
8199 2021-02-13 20:06 TextGrid/LJSpeech/LJ001-0088.TextGrid
8678 2021-02-13 20:06 TextGrid/LJSpeech/LJ001-0089.TextGrid
....
9461 2021-02-13 20:06 TextGrid/LJSpeech/LJ049-0158.TextGrid
3698 2021-02-13 20:06 TextGrid/LJSpeech/LJ050-0230.TextGrid
8067 2021-02-13 20:06 TextGrid/LJSpeech/LJ050-0261.TextGrid
7719 2021-02-13 20:06 TextGrid/LJSpeech/LJ050-0272.TextGrid
--------- -------
87949622 13102 files
Then the instructions say to unzip it `unzip -d preprocessed_data/LJSpeech/TextGrid/ LJSpeech.zip`.
Looking at the result from the previous command.
```bash
find preprocessed_data/LJSpeech/TextGrid
preprocessed_data/LJSpeech/TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ003-0286.TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ008-0301.TextGrid
preprocessed_data/LJSpeech/TextGrid/LJSpeech/LJ002-0235.TextGrid
....
I then symlink my LJSpeech wav files under raw_data
( cd raw_data && ln -fs <base_dir>/LJ.Speech.Dataset/LJSpeech-1.1/wavs LJSpeech; )
I then modified preprocess.yaml
to match my paths.
cp config/LJSpeech/preprocess.yaml preprocess.yaml
4c4
< corpus_path: "/home/ming/Data/LJSpeech-1.1"
---
> corpus_path: "<base_dir>/LJ.Speech.Dataset/LJSpeech-1.1"
6c6,7
< raw_path: "./raw_data/LJSpeech"
---
> #raw_path: "./raw_data/LJSpeech"
> raw_path: "./raw_data"
And when I try to run python3 preprocess.py preprocess.yaml
I get the following error for missing lab
files
Processing Data ...
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "preprocess.py", line 15, in <module>
preprocessor.build_from_path()
File "/gpfs/fs1/nrc/ict/others/u/sam037/tts/FastSpeech2.git/preprocessor/preprocessor.py", line 79, in build_from_path
ret = self.process_utterance(speaker, basename)
File "/gpfs/fs1/nrc/ict/others/u/sam037/tts/FastSpeech2.git/preprocessor/preprocessor.py", line 179, in process_utterance
with open(text_path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: './raw_data/LJSpeech/LJ026-0087.lab'
What steps am I missing?
Hi, thanks for sharing this repo!
I did a Colab Notebook for inference only that you can find here HERE.
Even if I have an error when executing the synthesize step
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data] Package cmudict is already up-to-date!
|{F AO1 R AO2 L DH OW1 DH AH0 CH AY0 N IY1 Z T UH1 K IH0 M P R EH1 SH AH0 N Z F R AH1 M W UH1 D B L AA1 K S IH0 N G R EY1 V D IH0 N R IH0 L IY1 F F AO1 R S EH1 N CH ER0 IY0 Z B IH0 F AO1 R DH AH0 W UH1 D K AH2 T ER0 Z AH1 V DH AH0 N EH1 DH ER0 L AH0 N D Z} {B AY1 AH0 S IH1 M AH0 L ER0 P R AA1 S EH2 S}|
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.WaveGlow' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.conv.ConvTranspose1d' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.WN' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv1d' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py:644: SourceChangeWarning: source code of class 'glow.Invertible1x1Conv' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
Traceback (most recent call last):
File "synthesize.py", line 86, in <module>
synthesize(model, text, sentence, prefix='step_{}'.format(args.step))
File "synthesize.py", line 60, in synthesize
wave_glow = utils.get_WaveGlow()
File "/content/FastSpeech2/utils.py", line 107, in get_WaveGlow
wave_glow = wave_glow.remove_weightnorm(wave_glow)
File "/content/FastSpeech2/glow.py", line 307, in remove_weightnorm
WN.cond_layers = remove(WN.cond_layers)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 606, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'WN' object has no attribute 'cond_layers'
I am still able to generate the speech, but it doesn't sound as it should at all.
HERE the generated speech.
Any idea on what's happening?
Thanks!
First thank you, I have solved the issue opened thanks to you support. In my understanding both Melgan (that I have tried) and Waveglow (not run yet) have a female voice. To have a male voice is it necessary to train from scratch the model? Or add support to a specific vocoder?
Thank you.
In modules.py change
self.pitch_bins = nn.Parameter(torch.exp(torch.linspace(np.log(hp.f0_min), np.log(hp.f0_max), hp.n_bins-1))) self.energy_bins = nn.Parameter(torch.linspace(hp.energy_min, hp.energy_max, hp.n_bins-1))
to
self.pitch_bins = torch.exp(torch.linspace(np.log(hp.f0_min), np.log(hp.f0_max), hp.n_bins-1)).cuda() self.energy_bins = torch.linspace(hp.energy_min, hp.energy_max, hp.n_bins-1).cuda()
Originally posted by @Mao-JianGuo in #13 (comment)
Hi, thanks for the job! I have a problem. I am trying to teach the Russian language to the model. The only thing I changed in your code is the kind of cleaner by specifying transliteration_cleaner, since Cyrillic can be transliterated. I went through the TextGrids creation step as well as the preprocess.py step. Run train.py. An error occurs:
how can i solve this problem?
I want to use the mandarin acoustic model provided by MFA ,i failed. So how should i train a acoustic model or can you share your acoustic model used in this project.
I have tried your project for "普通话". It appears a result that generator will skip the num when the input text mix up with Chinese character and number. For example, "台风奥马尔是继1976年的台风帕梅拉以来吹袭关岛的最强台风", it will skip ”1976“ in ”.wav“. How can I add '1976' prounce in result?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.