yl4579 / auxiliaryasr Goto Github PK

View Code? Open in Web Editor NEW

109.0 8.0 30.0 300 KB

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)

License: MIT License

Python 100.00%

asr text-to-speech voice-conversion

auxiliaryasr's Issues

Is there anyone who has used the phonemizer? Any advice, please, on how to change the code correctly

How much data did you use to train the model?

Hi. Thank you for you great work!
I was wondering how much data did you use to train the model, and did you augment the data?
I notice that you put the LJSpeech dataset here as an example, but the sample rate of LJ is 22050khz, so I think it is not the data you actually used when training the model...?

how to make word_index_dict.txt

I have a little immature question, how to make word_index_dict.txt about Mandarin?

Error Message: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]

@yl4579
I added a extra line to the train_list.txt file and got this error message:

python train.py --config_path ./Configs/config.yml
{'max_lr': 0.0005, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 72}
ctc_linear.2.linear_layer.weight does not have same shape
torch.Size([178, 256]) torch.Size([80, 256])
ctc_linear.2.linear_layer.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.embedding.weight does not have same shape
torch.Size([178, 512]) torch.Size([80, 256])
asr_s2s.project_to_n_symbols.weight does not have same shape
torch.Size([178, 128]) torch.Size([80, 128])
asr_s2s.project_to_n_symbols.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.decoder_rnn.weight_ih does not have same shape
torch.Size([512, 640]) torch.Size([512, 384])

Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/train.py", line 116, in
main()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/bud/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/bud/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/meldataset.py", line 65, in getitem
mel_tensor = self.to_melspec(wave_tensor)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 619, in forward
specgram = self.spectrogram(waveform)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 110, in forward
return F.spectrogram(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 126, in spectrogram
spec_f = torch.stft(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/functional.py", line 648, in stft
input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]

how to train for mandarin asr?

if I want to train mandarin asr,dict is like thisdict.txt,and I use g2pM as phonemizer,and train.txt is like this:
SSB06930002.wav | 武 wu3 术 shu4 始 shi3 终 zhong1 被 bei4 看 kan4 作 zuo4 我 wo3 国 guo2 的 de5 国 guo2 粹 cui4 | 0
I don't know how to change my format（dict.txt \train.txt） to suit this project, and how to change it in meldataset. can you help me? Thank you very much

How to train ZH-EN duo language aligner？

Hi there.
I saw that the repo's code only support Engilish aligner training

About the loss

Hi. Could you kindly share your training loss of the model (maybe a tensorboard picture)? Thank you very much.

Why is " " used as the blank in the CTCLoss?

Hey @yl4579 thank you for your great work on this (and StyleTTS).

I was wondering if there was a reason for using " " as the blank token in the CTCLoss instead of something distinct from what can be returned from G2p as is suggested here? I was thinking of using something like id 80 if appending onto the vocab defined here.

Was wondering if this would affect the downstream training of StyleTTS much or if the aligner just has to be a "good enough" starting point?

Thanks!

why mel_spectrogam feature extracting using only MEL_PARAMS here?

Hi, why mel_spectrogam feature extracting using only MEL_PARAMS? why SPECT_PARAMS not used?

AuxiliaryASR/meldataset.py

Line 50 in 7bca68a

self.to_melspec = torchaudio.transforms.MelSpectrogram(**MEL_PARAMS)

Multiple GPU training and changing to librosa mel spec?

Hello again, Is there multiple gpu training for this repo? Also do you have any training logs I can compare with? Thanks!

ill also post this question here as its for this repo...

Should i convert from torchaudio to librosa in AuxiliaryASR and PitchExtractor or just leave it with torchaudio?
something like this?
chatgpt converted:

import librosa
import numpy as np

DEFAULT_DICT_PATH = osp.join(osp.dirname(file), 'word_index_dict.txt')
SPECT_PARAMS = {
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}
MEL_PARAMS = {
"n_mels": 80,
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}

class MelDataset(torch.utils.data.Dataset):
def init(self,
data_list,
dict_path=DEFAULT_DICT_PATH,
sr=24000
):

    spect_params = SPECT_PARAMS
    mel_params = MEL_PARAMS

    _data_list = [l[:-1].split('|') for l in data_list]
    self.data_list = [data if len(data) == 3 else (*data, 0) for data in _data_list]
    self.text_cleaner = TextCleaner(dict_path)
    self.sr = sr

    self.to_melspec = librosa.feature.melspectrogram(**MEL_PARAMS)
    self.mean, self.std = -4, 4
    ```

get error

[train]: 24%|██▍ | 16/66 [00:04<00:15, 3.20it/s]
Traceback (most recent call last):
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 116, in
main()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/mike/PycharmProjects/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data
return self._process_data(data)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 60, in getitem
wave, text_tensor, speaker_id = self._load_tensor(data)
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 78, in _load_tensor
speaker_id = int(speaker_id)
ValueError: invalid literal for int() with base 10: ''

my train_list :
/media/mike/yys/data_asr/SSB00800056.wav|wo men can jia guo xu duo zhong da huo dong de biao yan|0
/media/mike/yys/data_asr/SSB00050001.wav|guang zhou nv da xue sheng deng shan shi lian si tian jing fang zhao dao yi si nv shi|0
/media/mike/yys/data_asr/SSB00050002.wav|zhun zhong ke xue gui lv de yao qiu|0
/media/mike/yys/data_asr/SSB00050003.wav|qi lu wu ren shou piao|0
..

yl4579 / auxiliaryasr Goto Github PK

auxiliaryasr's Issues

Is there anyone who has used the phonemizer? Any advice, please, on how to change the code correctly

How much data did you use to train the model?

how to make word_index_dict.txt

Error Message: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]

how to train for mandarin asr?

How to train ZH-EN duo language aligner？

About the loss

Why is " " used as the blank in the CTCLoss?

why mel_spectrogam feature extracting using only MEL_PARAMS here?

Multiple GPU training and changing to librosa mel spec?

get error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent