I've trained a new model, and write a to synthesize new speech using <div c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Total loss is as follows (I cut top 30000): <a target="_blank" rel="noopener noref

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Stil need a long time to synthesize a new speech about fastspeech HOT 9 CLOSED

xcmyz commented on July 18, 2024

Stil need a long time to synthesize a new speech

from fastspeech.

Comments (9)

liangshuang1993 commented on July 18, 2024 1

@Liujingxiu23 Thanks, I just found I calculate time in a wrong way, to generate mel and mel_postnet, it takes about 0.05 using P100.

from fastspeech.

Liujingxiu23 commented on July 18, 2024

How about your wave synthesized? I synthesized wavs using model checkpoint "checkpoint_40000.pth.tar". The quality is bad.

from fastspeech.

xcmyz commented on July 18, 2024

Total loss is as follows (I cut top 30000):

from fastspeech.

xcmyz commented on July 18, 2024

I will put result wav here soon.

from fastspeech.

liangshuang1993 commented on July 18, 2024

@xcmyz Can you tell me how long do you need to synthesis a speech? Thanks!

from fastspeech.

Liujingxiu23 commented on July 18, 2024

@xcmyz Thank you for your reply. How many iters the model need to converge ? I use the LJSpeech dataset

from fastspeech.

liangshuang1993 commented on July 18, 2024

@Liujingxiu23 My loss is about 0.2, not good either. How long do you need to synthesis a new speech?

from fastspeech.

Liujingxiu23 commented on July 18, 2024

mel_output, mel_output_postnet = model(src_seq, src_pos)
mel = mel_output_postnet[k].detach().cpu().numpy()
wav = audio.inv_mel_spectrogram(mel.T)

For batch-size=10, seq_len=75, using one GPU, the time spend:

to generate mel and mel_postnet : 0.159 sec
to generate mel, mel_postnet and wavs: 37.625 sec

from fastspeech.

Liujingxiu23 commented on July 18, 2024

@xcmyz I tried your laest code, the acoustic quality improve much, early the same as tacotron2 I think.
The TTS corpus I use is chinese, and I keep the default hparams setting.
My loss seems not as good as yours, the postnet-mel-loss converge to about 0.5, the duration loss about 0.8. I don not know why?
By the way the pronunciation as well as the tone is not that good. For example, in same wavs "zhang" read like "zhan"， “tao3” read like “tao2” ，why this happens? Do you have any suggest to solve this problem?

from fastspeech.

Recommend Projects

Stil need a long time to synthesize a new speech about fastspeech HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent