yl4579 / auxiliaryasr Goto Github PK
View Code? Open in Web Editor NEWJoint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)
License: MIT License
Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)
License: MIT License
Hi. Thank you for you great work!
I was wondering how much data did you use to train the model, and did you augment the data?
I notice that you put the LJSpeech dataset here as an example, but the sample rate of LJ is 22050khz, so I think it is not the data you actually used when training the model...?
I have a little immature question, how to make word_index_dict.txt about Mandarin?
@yl4579
I added a extra line to the train_list.txt file and got this error message:
python train.py --config_path ./Configs/config.yml
{'max_lr': 0.0005, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 72}
ctc_linear.2.linear_layer.weight does not have same shape
torch.Size([178, 256]) torch.Size([80, 256])
ctc_linear.2.linear_layer.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.embedding.weight does not have same shape
torch.Size([178, 512]) torch.Size([80, 256])
asr_s2s.project_to_n_symbols.weight does not have same shape
torch.Size([178, 128]) torch.Size([80, 128])
asr_s2s.project_to_n_symbols.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.decoder_rnn.weight_ih does not have same shape
torch.Size([512, 640]) torch.Size([512, 384])
Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/train.py", line 116, in
main()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/bud/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/bud/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/meldataset.py", line 65, in getitem
mel_tensor = self.to_melspec(wave_tensor)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 619, in forward
specgram = self.spectrogram(waveform)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 110, in forward
return F.spectrogram(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 126, in spectrogram
spec_f = torch.stft(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/functional.py", line 648, in stft
input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]
if I want to train mandarin asr,dict is like thisdict.txt,and I use g2pM as phonemizer,and train.txt is like this:
SSB06930002.wav | 武 wu3 术 shu4 始 shi3 终 zhong1 被 bei4 看 kan4 作 zuo4 我 wo3 国 guo2 的 de5 国 guo2 粹 cui4 | 0
I don't know how to change my format(dict.txt \train.txt) to suit this project, and how to change it in meldataset. can you help me? Thank you very much
Hi there.
I saw that the repo's code only support Engilish aligner training
Hi. Could you kindly share your training loss of the model (maybe a tensorboard picture)? Thank you very much.
Hey @yl4579 thank you for your great work on this (and StyleTTS).
I was wondering if there was a reason for using " "
as the blank token in the CTCLoss instead of something distinct from what can be returned from G2p
as is suggested here? I was thinking of using something like id 80 if appending onto the vocab defined here.
Was wondering if this would affect the downstream training of StyleTTS much or if the aligner just has to be a "good enough" starting point?
Thanks!
Hi, why mel_spectrogam feature extracting using only MEL_PARAMS? why SPECT_PARAMS not used?
Line 50 in 7bca68a
Hello again, Is there multiple gpu training for this repo? Also do you have any training logs I can compare with? Thanks!
ill also post this question here as its for this repo...
Should i convert from torchaudio to librosa in AuxiliaryASR and PitchExtractor or just leave it with torchaudio?
something like this?
chatgpt converted:
import librosa
import numpy as np
DEFAULT_DICT_PATH = osp.join(osp.dirname(file), 'word_index_dict.txt')
SPECT_PARAMS = {
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}
MEL_PARAMS = {
"n_mels": 80,
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}
class MelDataset(torch.utils.data.Dataset):
def init(self,
data_list,
dict_path=DEFAULT_DICT_PATH,
sr=24000
):
spect_params = SPECT_PARAMS
mel_params = MEL_PARAMS
_data_list = [l[:-1].split('|') for l in data_list]
self.data_list = [data if len(data) == 3 else (*data, 0) for data in _data_list]
self.text_cleaner = TextCleaner(dict_path)
self.sr = sr
self.to_melspec = librosa.feature.melspectrogram(**MEL_PARAMS)
self.mean, self.std = -4, 4
```
[train]: 24%|██▍ | 16/66 [00:04<00:15, 3.20it/s]
Traceback (most recent call last):
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 116, in
main()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/mike/PycharmProjects/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data
return self._process_data(data)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 60, in getitem
wave, text_tensor, speaker_id = self._load_tensor(data)
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 78, in _load_tensor
speaker_id = int(speaker_id)
ValueError: invalid literal for int() with base 10: ''
my train_list :
/media/mike/yys/data_asr/SSB00800056.wav|wo men can jia guo xu duo zhong da huo dong de biao yan|0
/media/mike/yys/data_asr/SSB00050001.wav|guang zhou nv da xue sheng deng shan shi lian si tian jing fang zhao dao yi si nv shi|0
/media/mike/yys/data_asr/SSB00050002.wav|zhun zhong ke xue gui lv de yao qiu|0
/media/mike/yys/data_asr/SSB00050003.wav|qi lu wu ren shou piao|0
..
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.