microsoft / speecht5 Goto Github PK
View Code? Open in Web Editor NEWUnified-Modal Speech-Text Pre-Training for Spoken Language Processing
License: MIT License
Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
License: MIT License
Hi there,
Great repo and paper. I follow the exact installation and data prep steps for Speech2C training and get this error when I run the pre_training command:
AssertionError: number of labels does not match (5567 != 5566)
Any help would be appreciated!
Additionally, I also had to change:
dir: ./
in config and
common.user_dir=speech2c
to make the code work.
Hello!
Thank you very much for adding a code snippet to outline how to load pre-trained SpeechT5 weights, super helpful for understanding how to process the data and load the task π
I've been attempting to load the 'base' pre-trained weights according to the code snippet provided here:
import torch
from speecht5.tasks.speecht5 import SpeechT5Task
from speecht5.models.speecht5 import T5TransformerModel
checkpoint = torch.load('/path/to/speecht5_checkpoint')
checkpoint['cfg']['task'].t5_task = 'pretrain'
checkpoint['cfg']['task'].hubert_label_dir = "/path/to/hubert_label"
checkpoint['cfg']['task'].data = "/path/to/tsv_file"
task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])
Steps performed:
n_clusters=500
:for x in $(seq 0 $((n_clusters - 1))); do
echo "$x 1"
done >> $lab_dir/dict.km.txt
data
and the Hubert labels under hubert_label_dir
.task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:242, in Dictionary.add_from_file(self, f)
241 try:
--> 242 line, field = line.rstrip().rsplit(" ", 1)
243 if field == "#fairseq:overwrite":
ValueError: not enough values to unpack (expected 2, got 1)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
File ~/SpeechT5/SpeechT5/speecht5/tasks/speecht5.py:301, in SpeechT5Task.setup_task(cls, args, **kwargs)
299 if args.t5_task == "pretrain":
300 dicts["hubert"] = [Dictionary.load(f"{args.hubert_label_dir}/dict.{label}.txt") for label in args.hubert_labels]
--> 301 dicts["text"] = Dictionary.load(op.join(args.data, "dict.txt"))
302 else:
303 if config is None:
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:216, in Dictionary.load(cls, f)
207 """Loads the dictionary from a text file with the format:
208
209 ```
(...)
213 ```
214 """
215 d = cls()
--> 216 d.add_from_file(f)
217 return d
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:227, in Dictionary.add_from_file(self, f)
225 try:
226 with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
--> 227 self.add_from_file(fd)
228 except FileNotFoundError as fnfe:
229 raise fnfe
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:260, in Dictionary.add_from_file(self, f)
258 self.add_symbol(word, n=count, overwrite=overwrite)
259 except ValueError:
--> 260 raise ValueError(
261 "Incorrect dictionary format, expected '<token> <cnt> [flags]'"
262 )
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [17], in <cell line: 2>()
1 model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
----> 2 model.load_state_dict(checkpoint['model'])
File ~/SpeechT5/SpeechT5/speecht5/models/speecht5.py:1040, in T5TransformerModel.load_state_dict(self, state_dict, strict, model_cfg, args)
1036 m_state_dict = {
1037 key.replace(f"{m}.", ""): value for key, value in state_dict.items() if key.startswith(f"{m}.")
1038 }
1039 if hasattr(self, m):
-> 1040 self._modules[m].load_state_dict(m_state_dict, False)
1041 return self
File ~/venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1497, in Module.load_state_dict(self, state_dict, strict)
1492 error_msgs.insert(
1493 0, 'Missing key(s) in state_dict: {}. '.format(
1494 ', '.join('"{}"'.format(k) for k in missing_keys)))
1496 if len(error_msgs) > 0:
-> 1497 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
1498 self.__class__.__name__, "\n\t".join(error_msgs)))
1499 return _IncompatibleKeys(missing_keys, unexpected_keys)
RuntimeError: Error(s) in loading state_dict for TransformerEncoder:
size mismatch for proj.weight: copying a param with shape torch.Size([81, 768]) from checkpoint, the shape in current model is torch.Size([7, 768]).
size mismatch for proj.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([7]).
Would be very grateful to get some insight on these two questions:
Many thanks for your help!
when pretrained 95400 num_updates,
File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/multitask_dataset.py", line 58, in getitem
sample = self.datasets[dataset_idx][sample_idx]
File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/text_dataset.py", line 218, in getitem
assert (source[1:-1] >= 1).all()
IndexError: slice() cannot be applied to a 0-dim tensor
the reason comes from text data preparation?
Hey SpeechT5 Team, I've been seeing some activity of speechT5 on Huggingface. But didn't really see Huggingface version of speechT5. Could you please tell me if you guys are working on porting speechT5 to Huggingface , or perhaps it's already done.
As i am planning to pretrain speechT5 and then further use it in Huggingface for other downstream tasks.
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
May I ask how much epoch is set during pre-training and fine-tuningοΌ
Hi,
I am trying to load the finetuned models provided for VATLM. However, I encounter this error where it tries to access a local storage where it was trained. This occurs with all the models you have shared.
The error is:
`File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 93, in main
return _main(cfg, h)
File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 115, in _main
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])
File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task
model = task.build_model(cfg.model)
File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
model = models.build_model(cfg, self)
File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/models/init.py", line 96, in build_model
return model.build_model(cfg, task)
File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/models/vathubert_asr.py", line 400, in build_model
state = checkpoint_utils.load_checkpoint_to_cpu(
File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 303, in load_checkpoint_to_cpu
with open(local_path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/default/v-qiushizhu/vatlm_related/results/fbank_large_vox_pretrain_iter5_ext_audio_1_32ngpu_2updatefreq/checkpoints/checkpoint_388_600000.pt'
`
I noticed that the shared VATLM models don't have cfg.model.w2v_args in the statedict and are None during the loading.
Would be great if you could help in resolving this.
Hello!
Thank You for the great work again! I try to train Speech2C and got this error after 49 epochs:
[2022-11-07 00:33:16,340][fairseq.nan_detector][WARNING] - Inf detected in output of , shape: torch.Size([1464, 505]), forward
Some training details:
dataset: libri 360
k-means trained on: libri 100
config: https://drive.google.com/file/d/1Ms5m-cuTrv43xsntHBdM_PEWaXtGGMOR/view?usp=sharing
hydra_log: https://drive.google.com/file/d/1HWvXqUGhNU-LnKNRj52HAbXPR-GqOVBU/view?usp=sharing
Can you please let me know if this has happened in your training setup ever? Or if you know where I am going wrong?
Thank You!
Hi,
In the paper of SpeechT5, it used librispeech-960h as the speech pre-training dataset whose sample rate is 16kHz, while it used libri-tts as the tts dataset whose sample rate is 24kHz. How do you deal with this mismatch? Thank you! οΌοΌ
Hi,
I tried running fairseq-generate
according to the instructions in the README. It crashed at tasks/speecht5.py line 173, with the following message exception:
argparse.ArgumentError: argument --mask-length: conflicting option string: --mask-length
I used the debugger and found that there is already an existing --mask-length
argument configured in the parser, it was added from fairseq/fairseq/options.py
line 149, where arguments from wav2vec2 were added. Apparently fairseq's generate.py
sets wav2vec2 as the default for --arch
.
I tried manually specifying the --arch
argument as t5_transformer_base
or t5_transformer_base_asr
, but then the argument parser complains that --path
is not a supported argument.
My versions are SpeechT5 commit f9b059b and fairseq commit e35c593c84bd84d5c7777ef7ace98dab508ff88e.
Any idea how to fix it, preferably without modifying fairseq code? Thanks.
Hello.
First of all, thank you for your great work.
Unfortunately, I have some issues on the preparation of the text data for the pre-training and the ASR finetuning.
I have followed the introduction provided, but I cannot figure out how should I preprocess text data using SPM and fairseq.
How can i create the text_train.tsv/text_valid.tsv?
and also i have some difficulties of creating the label data of the text data, what format should i use?
can you provide more details or examples of the manifest for the text?
I understand that it's a placeholder, but it should at least link to an empty repo/folder.
Hello,
I would like to do a grid search over hyper-parameters using ./SpeecgT5/speecht5. I keep getting:
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: model
Any recommendation?
Thanks!
In from speecht5.data.text_to_speech_dataset
used here, it seems like a text_to_speech_dataset.py
file is missing in the speecht5/data folder, like other similar files present in it
Hi,
Do you think it would be possible to combine both speech and text as input to the encoder? I'm looking to then decode text based on this multimodal input. Should I be looking at the MultitaskDataset
? Would the s2t
task work for this?
Thanks.
Hi, there. Recently, I'm working on extracting features using the model "VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h visual"
From the config file
However, when I'm loading the pretrained model, telling from the log, fp16 is disabled(The bottom line of the full traceback).
2022-12-04 11:26:57 | INFO | vathubert.tasks.vathubert_pretraining | VATHubertPretrainingTask Config {'_name': 'vat_hubert_pretraining', 'data': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'labels': ['km'], 'label_dir': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'label_rate': 25, 'sample_rate': 25, 'normalize': True, 'enable_padding': False, 'max_sample_size': 500, 'min_sample_size': 5, 'max_trim_sample_size': '${task.max_sample_size}', 'single_target': False, 'random_crop': False, 'pad_audio': True, 'pdb': False, 'stack_order_audio': 4, 'skip_verify': False, 'text_sampling_alpha': 0.2, 'split_modality_batch': False, 'image_aug': True, 'image_crop_size': 88, 'image_mean': 0.421, 'image_std': 0.165, 'modalities': ['audio', 'video'], 'is_s2s': False, 'tokenizer_bpe_name': None, 'tokenizer_bpe_model': None, 'noise_wav': None, 'noise_prob': 0.0, 'noise_snr': '0', 'noise_num': 1, 'fine_tuning': False, 'use_supervised_data': True, 'sup_data_path': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sup_manifest': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sample_distributions': '0.13,0.15,0.32,0.3', 'use_extra_textdata': True, 'onlytext_manifest': '/LocalData/vatlm_related/fbankdata/cantab2_vox_tsv', 'use_extra_audiodata': True, 'onlyaudio_manifest': '/LocalData/vatlm_related/fbankdata/fbank_giga_vox_tsv_km'} 2022-12-04 11:26:57 | INFO | vathubert.models.vathubert | HubertModel Config: {'_name': 'vat_hubert', 'label_rate': 25, 'modalities': '${task.modalities}', 'extractor_mode': default, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length_audio': 10, 'mask_prob_audio': 0.8, 'mask_length_image': 5, 'mask_prob_image': 0.3, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'resnet_relu_type': 'prelu', 'resnet_weights': None, 'sim_type': 'cosine', 'sub_encoder_layers': 0, 'audio_feat_dim': 104, 'modality_dropout': 0.5, 'audio_dropout': 0.5, 'modality_fuse': 'concat', 'selection_type': 'same_seq', 'masking_type': 'input', 'decoder_embed_dim': 768, 'decoder_ffn_embed_dim': 3072, 'decoder_layers': 6, 'decoder_layerdrop': 0.0, 'decoder_attention_heads': 4, 'decoder_learned_pos': False, 'decoder_normalize_before': False, 'no_token_positional_embeddings': False, 'decoder_dropout': 0.1, 'decoder_attention_dropout': 0.1, 'decoder_activation_dropout': 0.0, 'max_target_positions': 2048, 'share_decoder_input_output_embed': False, 'no_scale_embedding': True, 'layer_type': transformer, 'pos_conv_depth': 1, 'max_positions': 100000, 'checkpoint_activations': False, 'required_seq_len_multiple': 1, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
From my personal experience, the inconsistent settings of fp16 during linear probing and pre-training often leads to degraded performance, so I want to know whether fp16 is enabled or not in pretraining. Thanks!
Hi, there. I'm trying to extract features with the released VATLM pre-trained models. Here's what I did.
Firstly, I tried loading the pre-trained model:
# cwd: ..../av_hubert
import vathubert
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(["./finetune_large_vox2_v_433h.pt"])
but an error occurs: AssertionError: Could not infer task type from {'_name': 'vat_hubert_pretraining'...
.
It's because the task vat_hubert_pretraining
wasn't been successfully registered by Fairseq
, and the reason for this error is that the sub-package vathubert.tasks
is not successfully initialized since there's not __init__.py
in it (from what I know, if there're not __init__.py
, it's just a "NameSpace" package instead of a normal package). To fix it, I add __init__.py
to ./vathubert/tasks.py
, which content is
from .vathubert_pretraining import VATHubertPretrainingTask
This is for registering the task in Fairseq. However, another error occurs in
ModuleNotFoundError: No module named 'fairseq.data.audio.multi_corpus_dataset_audio'
I read the source code of Fairseq and didn't find the file multi_corpus_dataset_audio.py
in the directory fairseq.data.audio
Did I miss anything? Or is there some bug in the code? Any help is appreciated, Thanks!
Hi, congratulations on your achievement in this great work!
It is my first time to use fairseq, so could you please give the exact values or an exmple of the parameters in "ASR finetune" training and inference partοΌ which are these values:
DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=
Thanks a lotοΌ
οΌAnd the steps that how can get these values would be great!οΌ
HiοΌI'm trying to use the script in README to Extract features using pre-trained models, I used the model speechlmp_base_asr_checkpoint_best.pt.But I encountered an error while initing the SpeechLMConfigοΌ
Traceback (most recent call last):
File "/remote-home/jzhan/SpeechT5/SpeechLM/test.py", line 7, in
cfg = SpeechLMConfig(checkpoint['cfg']['model'])
File "/SpeechT5/SpeechLM/SpeechLM.py", line 128, in init
self.update(cfg)
File "/SpeechT5/SpeechLM/SpeechLM.py", line 132, in update
self.text_transformer = TransformerConfig(model_cfg['text_transformer'])
KeyError: 'text_transformer'
Am I missing any model files?
are all the required preprocessing steps:
or are there any missing parts?
i want to run the sid pretrain model, but i got an error like this:
generate_class.py: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
if i must fine tune sid, i run fine tune sid, and got this error
fairseq-train: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt') SID finetuning finished
so, how to run the model correctly? thanks!
First of all, Thanks your great works and code
I am studying SpeechLM and found some curious things about training and inference.
Can you guide which stage did you use for learning? below #L155 as I expected?
[https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh#L155]
Can you guide which decoder is used for Pseudo label generation and share you command ?
steps/decode_fmllr.sh or online2-wav-gmm-latgen-faster directly?
Best Regards
Hi, I am trying to reproduce the T2U generator but have issues about converting asr transcripts to phoneme sequence. I think the phoneme sequence in dataset/LibriSpeech/fast_phone2units/genset_examples.tsv
is not produced by speechlm/data_process/prepare_phn2ltr_librilm.sh
. The phonemes in the former are not up-sampled and the probability of inserting silence is less than 0.25. Is there an example on how to prepare the phoneme sequence for T2U generator?
Thanks.
Hi, I want to pretrain a model using SpeechT5 arch. I follow the scripts you given here https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#data-preparation. But I wonder if there is a restrict in fairseq-preprocess when preparing data. Because I met this error.
I found it raised error in the process of batching samples of the .index and .bin data provided by fairseq-preprocess. And here is what my batch_sampler shape looks like. There are 455 items in batch_sampler and each item has 6 items in it except the last one :
So in order to run successfully, I tried to give up the last row:
batch_sampler = batch_sampler[:-2]
But then I got this:
I would really appreciative to you if you can explain this. Thank you!!!!!!!
I have use "< break time="3s" / > " in my text file, but it's seems not work.
How can I make short pause between two words ?
Hi, I have the same question as #16 (comment). My training dataset is Chinese, so can i use speechbrain/spkrec-xvect-voxceleb to extract speaker embedding for pre-training?
Hi, thank you for your great work.
According to the appendix of paper, it uses a kaldi model to convert audio into phonemes. I have trained a kaldi model with frame rate of 30ms.
To generate the SpeechLM Base label (10ms), I just repeat each phoneme 3 times, it works fine.
But the SpeechLM Large label (20ms) cannot be generated simply by repeat phonemes. Could you provide some details about this convertion?
Hi,
Are you planning to open source the configuration of the baselines and downstream tasks?
Thanks a lot!
Hi,
Could you tell me where I can find the fine-tuned SpeechT5 for the speech enhancement task? Also, a link to how I can load and use it would be very useful.
Thank you,
Andrei
Hello,
The following Inference instructions seem outdated :
https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#inference-1
The script SpeechT5/scripts/generate_speech.py
trigger this error when using --task speecht5
:
generate_speech.py: error: argument --task: invalid choice: 'speecht5' (choose from 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'sentence_prediction', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
Thanks
In the README for SpeechT5, the download SPM model (here) redirects to https://github.com/microsoft/SpeechT5#language-model-and-vocabulary, but this section is missing in the README provided. Can you share the download links for the SPM model and vocabulary files.
Thanks
Hey there, I am looking forward. to pre-training SpeechT5 on a custom dataset. preferably multi-lingual datasets. could i please get some references, documentations etc as a starting point to get started on the same please. Thanks.
Code for finetuning speech synthesis with the predicted log Mel-filterbank features, as described in the SpeechT5 paper, is not availiable.
Is it possible to provide this?
Many thanks
The quantizer and mixup method in joint pre-trainingis impressive. My question is whether the quantizer is used when fine-tune the pretrained backbone for the downstream task or not. During reading paper, i do not find the relate statement. Thanks for answer.
Hello, thank you so much for providing these models and code along with all the documentation. The HuggingFace integration is very helpful for people like me whose specialty is not ML :) I tried out the TTS model available on HuggingFace and the results are very good, but I'm curious what the difference would be like using the larger SpeechT5 model.
My goal is to prepare the SpeechT5 Large model (60k hrs Libri-Light + LibriSpeech LM Dataset) for TTS in the same way that the smaller model on HuggingFace is tuned for TTS. I'm a little confused though on how the training was done for the smaller model in order to prepare it for TTS. I looked at the manifest and it says: "speecht5_tts.pt are reimplemented Text-to-Speech fine-tuning on the released manifest but with a smaller batch size or max updates (Ensure the manifest is ok)." Does this mean that the SpeechT5 for TTS model was completely retrained from scratch with different batch size/max updates, or was it fine-tuned from the SpeechT5 base model (960 hrs LibriSpeech + LibriSpeech LM Dataset)?
The manifest also says: "This manifest is an attempt to recreate the Text-to-Speech recipe used for training SpeechT5. This manifest was constructed using LibriTTS clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation." Does this mean that it was trained from scratch using 100 + 360 = 460 hours of LibriTTS data, or was it fine-tuned on those 460 hours of data?
Thank you!
I want to use the released version of pretrained SpeechUT model to fine tune and also want to use the released fine-tuned model on MUST-C ende model to inference, however when i reload the checkpoints, there are some extra non-local path which caused "FileNotFoundError", how can i solve this problem?
I used colab notebookto fine-tuned this model.When I run trainer.train(),It goes into error.
in <cell line: 2>:2 β
β β
β /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train β
β β
β 1659 β β inner_training_loop = find_executable_batch_size( β
β 1660 β β β self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size β
β 1661 β β ) β
β β± 1662 β β return inner_training_loop( β
β 1663 β β β args=args, β
β 1664 β β β resume_from_checkpoint=resume_from_checkpoint, β
β 1665 β β β trial=trial, β
β β
β /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1839 in _inner_training_loop β
β β
β 1836 β β self.state.is_world_process_zero = self.is_world_process_zero() β
β 1837 β β β
β 1838 β β # tr_loss is a tensor to avoid synchronization of TPUs through .item() β
β β± 1839 β β tr_loss = torch.tensor(0.0).to(args.device) β
β 1840 β β # _total_loss_scalar is updated everytime .item() has to be called on tr_loss an β
β 1841 β β self._total_loss_scalar = 0.0 β
β 1842 β β self._globalstep_last_logged = self.state.global_step β
β°βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ―
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I do use GPU,why did this error happen?
Hello, congratulations on your success in this paper!
I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".
I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.
Thanks a lot!
If i want to pretrain SpeechT5 on my own dataset, how can i get the xvector ?
Hi, I found that there must be 3 columns in the audio manifest tsv file. Is there a tutorial or example on how to get the speaker embedding using my own dataset? Is it possible to pretrain a model on a dataset without speaker label?
Thanks π
Hello,thanks for your great work.However, I want to ask you some question. I notice that there is a model namedFast Text2Unit Model in the item SpeechLM, but I didn't find the usage about the model. I want to know if the model is used for transforem the text which is transformed from speech to units?
Hi, congratulations on your achievement in this great work!
I did pre-training according to the given configuration, but the loss of the curve converges quickly (about 20k updates) and then rises, I don't know if this is normal, or can you share your pretrain curve, thanks.
Fig1 is from SpeechLM(https://arxiv.org/pdf/2209.15329.pdf), Fig2 is from SpeechUT(https://arxiv.org/pdf/2210.03730.pdf)
I notice that the WER is the same in Base Size group, but is different in Large Size group.
Why?
First, Thank you, for your amazing achievements.
I tried asr finetune. It's work well!
So, I'd like to do other things too!
Such like voice conversion task!!
Can you provide voice conversion finetune and convert recipe??
Can one fine-tune SpeechT5 on Hugging Face? Until now, I've only seen already fine-tuned models.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.