microsoft / speecht5 Goto Github PK

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing

License: MIT License

Python 94.73% Shell 5.10% Lua 0.16%

speech-pretraining speech2c speecht5 speechlm speechut speech-recognition speech-synthesis speech-translation speech-text-pretraining vallex vatlm

speecht5's People

Contributors

Stargazers

Watchers

Forkers

ishine ai-x-king techthiyanes birgermoell test-mass-forker-org-1 wgwangang rosssong shenjiangtian ktbrowardhealth gavin90s normonisping zyjcsf zz12375 balacoon hylihitic j-pong zhuzcz hiyoung-asr zxyzqs merumeru-rururu hollance dumpmemory abdoiiii platform-kit wht2020 utopiazh 2059594176 rjzevallos dannynis higuseonhye research94iman yuchengs zurichrain ronlin1 emmacirl zhangziliang04 whaozl maxmax2016 thanhpham1987 vkothapally qipotianmfxt serendipityseeker ajgit lizezheng vpnry bmwx86 gongchenghhu davmacias77 ming024 soon14 legendzhu liroda manu87ds gtrevg nurran chienlinhuang1116 rickqi 791054681 hus18 tommy13579 martjay zhhao1 runngezhang smeyerhuky tuanbc youngwoo3283 marsmeng1994 zhaozhuoyang-six duckheada buptygz lucy9527 srija616 cnalphaai jftavares donstang sunnnnnnnny zhoulingjie z1qzhang rengongzhihuimengjing vitaly-z nrailg himanshu372 zephyrgit jindaxzillusion mtayyab2 liudyboy wpq3142 suavepirate bestdpf iamwwn hamidsta kustomzone ilovejs jinny960812 pinyenliu1104 oytunturk dsspring runngezhang-jx baiiiiiiiiii fanglong

speecht5's Issues

Speech2C training error

Hi there,

Great repo and paper. I follow the exact installation and data prep steps for Speech2C training and get this error when I run the pre_training command:

AssertionError: number of labels does not match (5567 != 5566)

Any help would be appreciated!

Additionally, I also had to change:

dir: ./

in config and

common.user_dir=speech2c

to make the code work.

Difficulties loading pre-trained weights!

Hello!

Thank you very much for adding a code snippet to outline how to load pre-trained SpeechT5 weights, super helpful for understanding how to process the data and load the task 😊

I've been attempting to load the 'base' pre-trained weights according to the code snippet provided here:

import torch
from speecht5.tasks.speecht5 import SpeechT5Task
from speecht5.models.speecht5 import T5TransformerModel

checkpoint = torch.load('/path/to/speecht5_checkpoint')

checkpoint['cfg']['task'].t5_task = 'pretrain'
checkpoint['cfg']['task'].hubert_label_dir = "/path/to/hubert_label"
checkpoint['cfg']['task'].data = "/path/to/tsv_file"

task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])

Steps performed:

Downloaded the fine-tuned base checkpoint from https://github.com/microsoft/SpeechT5#pre-trained-models
Create a dummy dict of Hubert labels with using the instructions provided here with n_clusters=500:

for x in $(seq 0 $((n_clusters - 1))); do
  echo "$x 1"
done >> $lab_dir/dict.km.txt

Download the dummy text dictionary using the download link provided in @Ajyy 's previous issue response from the G Drive link. As outlined, the text dict should be placed under data and the Hubert labels under hubert_label_dir.
Using the Hubert labels and text dict, running the aforementioned code snippet to load the pre-trained model. Loading the task throws an error:

Click for full traceback

task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:242, in Dictionary.add_from_file(self, f)
    241 try:
--> 242     line, field = line.rstrip().rsplit(" ", 1)
    243     if field == "#fairseq:overwrite":

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])

File ~/SpeechT5/SpeechT5/speecht5/tasks/speecht5.py:301, in SpeechT5Task.setup_task(cls, args, **kwargs)
    299 if args.t5_task == "pretrain":
    300     dicts["hubert"] = [Dictionary.load(f"{args.hubert_label_dir}/dict.{label}.txt") for label in args.hubert_labels]
--> 301     dicts["text"] = Dictionary.load(op.join(args.data, "dict.txt"))
    302 else:
    303     if config is None:

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:216, in Dictionary.load(cls, f)
    207 """Loads the dictionary from a text file with the format:
    208 
    209 ```
   (...)
    213 ```
    214 """
    215 d = cls()
--> 216 d.add_from_file(f)
    217 return d

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:227, in Dictionary.add_from_file(self, f)
    225 try:
    226     with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
--> 227         self.add_from_file(fd)
    228 except FileNotFoundError as fnfe:
    229     raise fnfe

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:260, in Dictionary.add_from_file(self, f)
    258     self.add_symbol(word, n=count, overwrite=overwrite)
    259 except ValueError:
--> 260     raise ValueError(
    261         "Incorrect dictionary format, expected '<token> <cnt> [flags]'"
    262     )

ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'

Do we need to add a flag column to the dummy text dict? I appended a column of zeros to the dummy text dict (giving token, count, 0). The task can then be loaded:
task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
Error with loading the weights:

model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])

Click for full traceback

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [17], in <cell line: 2>()
      1 model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
----> 2 model.load_state_dict(checkpoint['model'])

File ~/SpeechT5/SpeechT5/speecht5/models/speecht5.py:1040, in T5TransformerModel.load_state_dict(self, state_dict, strict, model_cfg, args)
   1036     m_state_dict = {
   1037         key.replace(f"{m}.", ""): value for key, value in state_dict.items() if key.startswith(f"{m}.")
   1038     }
   1039     if hasattr(self, m):
-> 1040         self._modules[m].load_state_dict(m_state_dict, False)
   1041 return self

File ~/venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1497, in Module.load_state_dict(self, state_dict, strict)
   1492         error_msgs.insert(
   1493             0, 'Missing key(s) in state_dict: {}. '.format(
   1494                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   1496 if len(error_msgs) > 0:
-> 1497     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1498                        self.__class__.__name__, "\n\t".join(error_msgs)))
   1499 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for TransformerEncoder:
	size mismatch for proj.weight: copying a param with shape torch.Size([81, 768]) from checkpoint, the shape in current model is torch.Size([7, 768]).
	size mismatch for proj.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([7]).

Would be very grateful to get some insight on these two questions:

Do we need to process the dummy text data in an additional way to add the 'flag' column?
Is the size mismatch error being thrown related to the saved PT checkpoint?

Many thanks for your help!

File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/multitask_dataset.py", line 58, in getitem
sample = self.datasets[dataset_idx][sample_idx]
File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/text_dataset.py", line 218, in getitem
assert (source[1:-1] >= 1).all()
IndexError: slice() cannot be applied to a 0-dim tensor

the reason comes from text data preparation?

Port to Huggingface

Hey SpeechT5 Team, I've been seeing some activity of speechT5 on Huggingface. But didn't really see Huggingface version of speechT5. Could you please tell me if you guys are working on porting speechT5 to Huggingface , or perhaps it's already done.

As i am planning to pretrain speechT5 and then further use it in Huggingface for other downstream tasks.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

SpeechT5：how much epoch is set

May I ask how much epoch is set during pre-training and fine-tuning？

VATLM: Error when loading finetuned checkpoints for infer_s2s

Hi,

I am trying to load the finetuned models provided for VATLM. However, I encounter this error where it tries to access a local storage where it was trained. This occurs with all the models you have shared.

The error is:

`File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 93, in main
return _main(cfg, h)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 115, in _main
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task
model = task.build_model(cfg.model)

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
model = models.build_model(cfg, self)

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/models/init.py", line 96, in build_model
return model.build_model(cfg, task)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/models/vathubert_asr.py", line 400, in build_model
state = checkpoint_utils.load_checkpoint_to_cpu(

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 303, in load_checkpoint_to_cpu
with open(local_path, "rb") as f:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/default/v-qiushizhu/vatlm_related/results/fbank_large_vox_pretrain_iter5_ext_audio_1_32ngpu_2updatefreq/checkpoints/checkpoint_388_600000.pt'
`

I noticed that the shared VATLM models don't have cfg.model.w2v_args in the statedict and are None during the loading.

Would be great if you could help in resolving this.

Speech2C "Inf detected in output" while training

Hello!

Thank You for the great work again! I try to train Speech2C and got this error after 49 epochs:

[2022-11-07 00:33:16,340][fairseq.nan_detector][WARNING] - Inf detected in output of , shape: torch.Size([1464, 505]), forward
Some training details:
dataset: libri 360
k-means trained on: libri 100

config: https://drive.google.com/file/d/1Ms5m-cuTrv43xsntHBdM_PEWaXtGGMOR/view?usp=sharing
hydra_log: https://drive.google.com/file/d/1HWvXqUGhNU-LnKNRj52HAbXPR-GqOVBU/view?usp=sharing

Can you please let me know if this has happened in your training setup ever? Or if you know where I am going wrong?

Thank You!

Sample Rates are different between speech pre-training dataset and tts dataset

Hi,
In the paper of SpeechT5, it used librispeech-960h as the speech pre-training dataset whose sample rate is 16kHz, while it used libri-tts as the tts dataset whose sample rate is 24kHz. How do you deal with this mismatch? Thank you! ：）

ArgumentError in SpeechT5Task.add_args() when running fairseq-generate

Hi,

I tried running fairseq-generate according to the instructions in the README. It crashed at tasks/speecht5.py line 173, with the following message exception:

argparse.ArgumentError: argument --mask-length: conflicting option string: --mask-length

I used the debugger and found that there is already an existing --mask-length argument configured in the parser, it was added from fairseq/fairseq/options.py line 149, where arguments from wav2vec2 were added. Apparently fairseq's generate.py sets wav2vec2 as the default for --arch.

I tried manually specifying the --arch argument as t5_transformer_base or t5_transformer_base_asr, but then the argument parser complains that --path is not a supported argument.

My versions are SpeechT5 commit f9b059b and fairseq commit e35c593c84bd84d5c7777ef7ace98dab508ff88e.

Any idea how to fix it, preferably without modifying fairseq code? Thanks.

Text data preparation

Hello.
First of all, thank you for your great work.

Unfortunately, I have some issues on the preparation of the text data for the pre-training and the ASR finetuning.
I have followed the introduction provided, but I cannot figure out how should I preprocess text data using SPM and fairseq.
How can i create the text_train.tsv/text_valid.tsv?
and also i have some difficulties of creating the label data of the text data, what format should i use?
can you provide more details or examples of the manifest for the text?

The link for Prosody-SpeechT5 in the Readme is dead/404

I understand that it's a placeholder, but it should at least link to an empty repo/folder.

hydra fine-tunning for speechT5?

Hello,

I would like to do a grid search over hyper-parameters using ./SpeecgT5/speecht5. I keep getting:

omegaconf.errors.MissingMandatoryValue: Missing mandatory value: model

Any recommendation?

Thanks!

Missing text_to_speech_dataset.py in speecht5/data

In from speecht5.data.text_to_speech_dataset used here, it seems like a text_to_speech_dataset.py file is missing in the speecht5/data folder, like other similar files present in it

Combining speech and text in the encoder

Hi,

Do you think it would be possible to combine both speech and text as input to the encoder? I'm looking to then decode text based on this multimodal input. Should I be looking at the MultitaskDataset? Would the s2t task work for this?

Thanks.

Whether fp16 is enabled in VATLM during pre-training

Hi, there. Recently, I'm working on extracting features using the model "VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h visual"
From the config file

SpeechT5/VATLM/vat_hubert/vathubert/conf/pretrain/large_vox_iter5.yaml

Line 4 in a7b72c6

fp16: true

, fp16 is enabled, and it's not overwritten in the launching script

SpeechT5/VATLM/vat_hubert/vathubert/scripts/pretrain/large_vox_pretrain_iter5.sh

Lines 9 to 30 in a7b72c6

 python /path/to/fairseq/fairseq_cli/hydra_train.py \ 

 --config-dir /path/to/vat_hubert/vathubert/conf/pretrain --config-name large_vox_iter5.yaml \ 

 task.data=${datapath}/fbank_lrs3_vox_tsv \ 

 task.label_dir=${datapath}/fbank_lrs3_vox_tsv \ 

 +task.sup_data_path=${datapath}/fbank_tedv3_phone_concat_vox_tsv \ 

 +task.sup_manifest=${datapath}/fbank_tedv3_phone_concat_vox_tsv \ 

 +task.onlytext_manifest=${datapath}/cantab2_vox_tsv \ 

 +task.onlyaudio_manifest=${datapath}/fbank_giga_vox_tsv_km \ 

 hydra.run.dir=${save_path} \ 

 common.user_dir=/path/to/vat_hubert/vathubert \ 

 distributed_training.distributed_world_size=${ngpu} \ 

 optimization.update_freq=[${updatefreq}] \ 

 dataset.max_tokens=3000 \ 

 model.label_rate=25 \ 

 common.log_interval=200 \ 

 checkpoint.save_interval=5 \ 

 +task.sample_distributions=\"0.13,0.15,0.32,0.3\" \ 

 +criterion.banlance_loss_weights=[1.0,1.0] \ 

 dataset.data_buffer_size=40 \ 

 +task.use_supervised_data=True \ 

 +task.use_extra_textdata=True \ 

 +task.use_extra_audiodata=True \

However, when I'm loading the pretrained model, telling from the log, fp16 is disabled(The bottom line of the full traceback).

2022-12-04 11:26:57 | INFO | vathubert.tasks.vathubert_pretraining | VATHubertPretrainingTask Config {'_name': 'vat_hubert_pretraining', 'data': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'labels': ['km'], 'label_dir': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'label_rate': 25, 'sample_rate': 25, 'normalize': True, 'enable_padding': False, 'max_sample_size': 500, 'min_sample_size': 5, 'max_trim_sample_size': '${task.max_sample_size}', 'single_target': False, 'random_crop': False, 'pad_audio': True, 'pdb': False, 'stack_order_audio': 4, 'skip_verify': False, 'text_sampling_alpha': 0.2, 'split_modality_batch': False, 'image_aug': True, 'image_crop_size': 88, 'image_mean': 0.421, 'image_std': 0.165, 'modalities': ['audio', 'video'], 'is_s2s': False, 'tokenizer_bpe_name': None, 'tokenizer_bpe_model': None, 'noise_wav': None, 'noise_prob': 0.0, 'noise_snr': '0', 'noise_num': 1, 'fine_tuning': False, 'use_supervised_data': True, 'sup_data_path': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sup_manifest': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sample_distributions': '0.13,0.15,0.32,0.3', 'use_extra_textdata': True, 'onlytext_manifest': '/LocalData/vatlm_related/fbankdata/cantab2_vox_tsv', 'use_extra_audiodata': True, 'onlyaudio_manifest': '/LocalData/vatlm_related/fbankdata/fbank_giga_vox_tsv_km'} 2022-12-04 11:26:57 | INFO | vathubert.models.vathubert | HubertModel Config: {'_name': 'vat_hubert', 'label_rate': 25, 'modalities': '${task.modalities}', 'extractor_mode': default, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length_audio': 10, 'mask_prob_audio': 0.8, 'mask_length_image': 5, 'mask_prob_image': 0.3, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'resnet_relu_type': 'prelu', 'resnet_weights': None, 'sim_type': 'cosine', 'sub_encoder_layers': 0, 'audio_feat_dim': 104, 'modality_dropout': 0.5, 'audio_dropout': 0.5, 'modality_fuse': 'concat', 'selection_type': 'same_seq', 'masking_type': 'input', 'decoder_embed_dim': 768, 'decoder_ffn_embed_dim': 3072, 'decoder_layers': 6, 'decoder_layerdrop': 0.0, 'decoder_attention_heads': 4, 'decoder_learned_pos': False, 'decoder_normalize_before': False, 'no_token_positional_embeddings': False, 'decoder_dropout': 0.1, 'decoder_attention_dropout': 0.1, 'decoder_activation_dropout': 0.0, 'max_target_positions': 2048, 'share_decoder_input_output_embed': False, 'no_scale_embedding': True, 'layer_type': transformer, 'pos_conv_depth': 1, 'max_positions': 100000, 'checkpoint_activations': False, 'required_seq_len_multiple': 1, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

From my personal experience, the inconsistent settings of fp16 during linear probing and pre-training often leads to degraded performance, so I want to know whether fp16 is enabled or not in pretraining. Thanks!

VATLM: ModuleNotFoundError: No module named 'fairseq.data.audio.multi_corpus_dataset_audio'

Hi, there. I'm trying to extract features with the released VATLM pre-trained models. Here's what I did.

Firstly, I tried loading the pre-trained model:

# cwd: ..../av_hubert
import vathubert
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(["./finetune_large_vox2_v_433h.pt"])

but an error occurs: AssertionError: Could not infer task type from {'_name': 'vat_hubert_pretraining'....
It's because the task vat_hubert_pretraining wasn't been successfully registered by Fairseq, and the reason for this error is that the sub-package vathubert.tasks is not successfully initialized since there's not __init__.py in it (from what I know, if there're not __init__.py, it's just a "NameSpace" package instead of a normal package). To fix it, I add __init__.py to ./vathubert/tasks.py, which content is

from .vathubert_pretraining import VATHubertPretrainingTask

This is for registering the task in Fairseq. However, another error occurs in

SpeechT5/VATLM/vat_hubert/vathubert/tasks/vathubert_pretraining.py

Line 40 in 996bb42

from fairseq.data.audio.multi_corpus_dataset_audio import MultiCorpusDataset

ModuleNotFoundError: No module named 'fairseq.data.audio.multi_corpus_dataset_audio'

I read the source code of Fairseq and didn't find the file multi_corpus_dataset_audio.py in the directory fairseq.data.audio

Did I miss anything? Or is there some bug in the code? Any help is appreciated, Thanks!

Example values for finetuning asr

Hi, congratulations on your achievement in this great work!
It is my first time to use fairseq, so could you please give the exact values or an exmple of the parameters in "ASR finetune" training and inference part， which are these values:

DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=

Thanks a lot！
（And the steps that how can get these values would be great!）

SpeechLM：KeyError: 'text_transformer' while initing the SpeechLMConfig

Hi，I'm trying to use the script in README to Extract features using pre-trained models, I used the model speechlmp_base_asr_checkpoint_best.pt.But I encountered an error while initing the SpeechLMConfig：

Traceback (most recent call last):
File "/remote-home/jzhan/SpeechT5/SpeechLM/test.py", line 7, in
cfg = SpeechLMConfig(checkpoint['cfg']['model'])
File "/SpeechT5/SpeechLM/SpeechLM.py", line 128, in init
self.update(cfg)
File "/SpeechT5/SpeechLM/SpeechLM.py", line 132, in update
self.text_transformer = TransformerConfig(model_cfg['text_transformer'])
KeyError: 'text_transformer'

Am I missing any model files?

How to load the pretrained models in pytorch

Hi, how can I instantiate an object of the SpeechT5 model in a Pytorch code file, and maybe load the provided pretrained weights in it?

Something similar to ( this doesn't work btw)

reproduction steps for inference

are all the required preprocessing steps:

acquire dataset, checkpoint, source
train spm
hubert feature extraction
run fairseq

or are there any missing parts?

how to fine tune sid on pretrained model？

i want to run the sid pretrain model, but i got an error like this:
generate_class.py: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
if i must fine tune sid, i run fine tune sid, and got this error
fairseq-train: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt') SID finetuning finished
so, how to run the model correctly? thanks!

[SpeechLM] About phoneme tokenizer in detail?

First of all, Thanks your great works and code

I am studying SpeechLM and found some curious things about training and inference.

Can you guide which stage did you use for learning? below #L155 as I expected?
[https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh#L155]
Can you guide which decoder is used for Pseudo label generation and share you command ?
steps/decode_fmllr.sh or online2-wav-gmm-latgen-faster directly?

Best Regards

SpeechLM: how to prepare phoneme sequence for T2U generator

Hi, I am trying to reproduce the T2U generator but have issues about converting asr transcripts to phoneme sequence. I think the phoneme sequence in dataset/LibriSpeech/fast_phone2units/genset_examples.tsv is not produced by speechlm/data_process/prepare_phn2ltr_librilm.sh. The phonemes in the former are not up-sampled and the probability of inserting silence is less than 0.25. Is there an example on how to prepare the phoneme sequence for T2U generator?
Thanks.

Pretraining SpeechT5, meet problems about batch_sampler in multitask_dataset. Should I get idx and bin files of data one by one (wav) or get all of them in only two file(idx and bin each have one)

Hi, I want to pretrain a model using SpeechT5 arch. I follow the scripts you given here https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#data-preparation. But I wonder if there is a restrict in fairseq-preprocess when preparing data. Because I met this error.

I found it raised error in the process of batching samples of the .index and .bin data provided by fairseq-preprocess. And here is what my batch_sampler shape looks like. There are 455 items in batch_sampler and each item has 6 items in it except the last one :

So in order to run successfully, I tried to give up the last row:

batch_sampler = batch_sampler[:-2]
But then I got this:

I think it is caused by the function np.random.choice(). And I infer from it that the batch_sampler should be a list, which only contains one array in it, right?
But I have no idea how it comes out, should the index and bin files containing all train_data or just one row of train data?
What's the sampled object of the batch_sampler?

Here is what my directory:

I would really appreciative to you if you can explain this. Thank you!!!!!!!

how to pause between two words ?

I have use "< break time="3s" / > " in my text file, but it's seems not work.

How can I make short pause between two words ?

SpeechT5: extracting Chinese speaker embedding

Hi, I have the same question as #16 (comment). My training dataset is Chinese, so can i use speechbrain/spkrec-xvect-voxceleb to extract speaker embedding for pre-training?

SpeechLM: How to resample phonemes' frame rate from 30ms to 20ms?

Hi, thank you for your great work.
According to the appendix of paper, it uses a kaldi model to convert audio into phonemes. I have trained a kaldi model with frame rate of 30ms.
To generate the SpeechLM Base label (10ms), I just repeat each phoneme 3 times, it works fine.
But the SpeechLM Large label (20ms) cannot be generated simply by repeat phonemes. Could you provide some details about this convertion?

Are you planning to open source the configuration of the baselines and downstream tasks?

Hi,
Are you planning to open source the configuration of the baselines and downstream tasks?

Thanks a lot!

SpeechT5 Speech Enhancement

Hi,

Could you tell me where I can find the fine-tuned SpeechT5 for the speech enhancement task? Also, a link to how I can load and use it would be very useful.

Thank you,
Andrei

SpeechUT inference error in en_fr checkpoint

When i use fine-tuned en-fr checkpoint to inference,
the results produce serious errors (bleu score is 0, and the inference time is too long). This error did not occur on en-de and en-es checkoints. I wonder whether there is an error in the checkpoint of en-fr?

SpeechT5: Finetuned SID model

Would it be possible to share the SpeechT5 model finetuned on VoxCeleb1 for the SID task? (I noticed the re-implemented VC and TTS models shared on HF).

Missing speecht5 task

Hello,

The following Inference instructions seem outdated :
https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#inference-1

The script SpeechT5/scripts/generate_speech.py trigger this error when using --task speecht5 :

generate_speech.py: error: argument --task: invalid choice: 'speecht5' (choose from 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'sentence_prediction', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

Thanks

Missing SPM and Vocabulary files

In the README for SpeechT5, the download SPM model (here) redirects to https://github.com/microsoft/SpeechT5#language-model-and-vocabulary, but this section is missing in the README provided. Can you share the download links for the SPM model and vocabulary files.

Thanks

how to pre-train on a custom dataset ?

Hey there, I am looking forward. to pre-training SpeechT5 on a custom dataset. preferably multi-lingual datasets. could i please get some references, documentations etc as a starting point to get started on the same please. Thanks.

No code for Speech Synthesis

Code for finetuning speech synthesis with the predicted log Mel-filterbank features, as described in the SpeechT5 paper, is not availiable.

Is it possible to provide this?

Many thanks

Does the quantizer is used when fine-tune the pretrained backbone for the downstream task ?

The quantizer and mixup method in joint pre-trainingis impressive. My question is whether the quantizer is used when fine-tune the pretrained backbone for the downstream task or not. During reading paper, i do not find the relate statement. Thanks for answer.

Using SpeechT5 Large for TTS

Hello, thank you so much for providing these models and code along with all the documentation. The HuggingFace integration is very helpful for people like me whose specialty is not ML :) I tried out the TTS model available on HuggingFace and the results are very good, but I'm curious what the difference would be like using the larger SpeechT5 model.

My goal is to prepare the SpeechT5 Large model (60k hrs Libri-Light + LibriSpeech LM Dataset) for TTS in the same way that the smaller model on HuggingFace is tuned for TTS. I'm a little confused though on how the training was done for the smaller model in order to prepare it for TTS. I looked at the manifest and it says: "speecht5_tts.pt are reimplemented Text-to-Speech fine-tuning on the released manifest but with a smaller batch size or max updates (Ensure the manifest is ok)." Does this mean that the SpeechT5 for TTS model was completely retrained from scratch with different batch size/max updates, or was it fine-tuned from the SpeechT5 base model (960 hrs LibriSpeech + LibriSpeech LM Dataset)?

The manifest also says: "This manifest is an attempt to recreate the Text-to-Speech recipe used for training SpeechT5. This manifest was constructed using LibriTTS clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation." Does this mean that it was trained from scratch using 100 + 360 = 460 hours of LibriTTS data, or was it fine-tuned on those 460 hours of data?

Thank you!

SpeechUT inference and fine-tune problem

I want to use the released version of pretrained SpeechUT model to fine tune and also want to use the released fine-tuned model on MUST-C ende model to inference, however when i reload the checkpoints, there are some extra non-local path which caused "FileNotFoundError", how can i solve this problem?

SpeechT5-tts fine-tuned on Chinese

I used colab notebookto fine-tuned this model.When I run trainer.train(),It goes into error.

in <cell line: 2>:2                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train                     │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1839 in _inner_training_loop      │
│                                                                                                  │
│   1836 │   │   self.state.is_world_process_zero = self.is_world_process_zero()                   │
│   1837 │   │                                                                                     │
│   1838 │   │   # tr_loss is a tensor to avoid synchronization of TPUs through .item()            │
│ ❱ 1839 │   │   tr_loss = torch.tensor(0.0).to(args.device)                                       │
│   1840 │   │   # _total_loss_scalar is updated everytime .item() has to be called on tr_loss an  │
│   1841 │   │   self._total_loss_scalar = 0.0                                                     │
│   1842 │   │   self._globalstep_last_logged = self.state.global_step                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be 
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I do use GPU,why did this error happen?

SpeechLM: How to train 'Phone-unit tokenizer for speech' using kaldi?

Hello, congratulations on your success in this paper!

I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".

I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.

Thanks a lot!

Pretrain SpeechT5 on my own dataset

If i want to pretrain SpeechT5 on my own dataset, how can i get the xvector ?

SpeechT5: How to get speaker embeddings ?

Hi, I found that there must be 3 columns in the audio manifest tsv file. Is there a tutorial or example on how to get the speaker embedding using my own dataset? Is it possible to pretrain a model on a dataset without speaker label?
Thanks 😊

SpeechLM

Hello,thanks for your great work.However, I want to ask you some question. I notice that there is a model namedFast Text2Unit Model in the item SpeechLM, but I didn't find the usage about the model. I want to know if the model is used for transforem the text which is transformed from speech to units?

About the SpeechT5 pre-training curve

Hi, congratulations on your achievement in this great work!
I did pre-training according to the given configuration, but the loss of the curve converges quickly (about 20k updates) and then rises, I don't know if this is normal, or can you share your pretrain curve, thanks.

Same benchmark, same architecture, but the WER is differenet, why?

Fig1 is from SpeechLM(https://arxiv.org/pdf/2209.15329.pdf), Fig2 is from SpeechUT(https://arxiv.org/pdf/2210.03730.pdf)

I notice that the WER is the same in Base Size group, but is different in Large Size group.

Why?

Such like voice conversion task!!
Can you provide voice conversion finetune and convert recipe??

Fine-tunning on Hugging Face

Can one fine-tune SpeechT5 on Hugging Face? Until now, I've only seen already fine-tuned models.

Thanks

	python /path/to/fairseq/fairseq_cli/hydra_train.py \
	--config-dir /path/to/vat_hubert/vathubert/conf/pretrain --config-name large_vox_iter5.yaml \
	task.data=${datapath}/fbank_lrs3_vox_tsv \
	task.label_dir=${datapath}/fbank_lrs3_vox_tsv \
	+task.sup_data_path=${datapath}/fbank_tedv3_phone_concat_vox_tsv \
	+task.sup_manifest=${datapath}/fbank_tedv3_phone_concat_vox_tsv \
	+task.onlytext_manifest=${datapath}/cantab2_vox_tsv \
	+task.onlyaudio_manifest=${datapath}/fbank_giga_vox_tsv_km \
	hydra.run.dir=${save_path} \
	common.user_dir=/path/to/vat_hubert/vathubert \
	distributed_training.distributed_world_size=${ngpu} \
	optimization.update_freq=[${updatefreq}] \
	dataset.max_tokens=3000 \
	model.label_rate=25 \
	common.log_interval=200 \
	checkpoint.save_interval=5 \
	+task.sample_distributions=\"0.13,0.15,0.32,0.3\" \
	+criterion.banlance_loss_weights=[1.0,1.0] \
	dataset.data_buffer_size=40 \
	+task.use_supervised_data=True \
	+task.use_extra_textdata=True \
	+task.use_extra_audiodata=True \

microsoft / speecht5 Goto Github PK

speecht5's People

Contributors

Stargazers

Watchers

Forkers

speecht5's Issues

Recommend Projects

Recommend Topics

Recommend Org