mailong25 / self-supervised-speech-recognition Goto Github PK
View Code? Open in Web Editor NEWspeech to text with self-supervised learning based on wav2vec 2.0 framework
speech to text with self-supervised learning based on wav2vec 2.0 framework
Hello thank you for this. Everything works fine with me, until the last step when trying to transcribe a file.
I get the notification: malloc(): mismatching next->prev_size (unsorted), aborted.
I also sometimes get: segmentation fault, but without any traceback..
Do you know what the cause of this might be?
Thanks :-)
what vietnamese dataset did you use?
Chào anh @mailong25 ,
Em có thử test với một vài file audio có sẵn trong repo này của anh, theo như anh công bố thì WER chỉ tầm 15%, nhưng em test thì WER lại lên đến giá trị 450, mặc dù transcribe đúng ko sai một từ nào, có 1 file transcribe xong sai có 1/11 từ thôi nhưng WER lại lên đến 470, không biết là có nhầm lẫn gì ở đây không ạ
Hi, I want to run multiple models. but after running one instance. i unable to run other instance.
I am getting this error.
ile "/media/administrator/hdd/self-supervised-speech-recognition/stt.py", line 187, in optimize_models
model.cuda()
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please help me out.
Gpu memory consuming for one instance is 2246 gb / 8 gb
Thanks,
@mailong25
Hey
Once again, thanks for making a nice wrapper. I understand that your code ultimately calls the wav2vec commands given in the original repository. I have some questions which didn't get resolved there and I would love your suggestions:
Thanks for helping out.
Hi, your work is really awesome. I currenly got a ValueError when running prediction by your public Pre-trained models.
Here is the issue code:
While i was trying to debug the error via facebookresearch/fairseq#2106, the problem still occured.
Please tell me how to fix it. My computer os is Windows 10 by the way. Many thanks
Hi, It says Saved but I checked the folder and it is empty.
2021-01-29 08:29:52 | INFO | fairseq.trainer | Preparing to save checkpoint to checkpoints/checkpoint_last.pt after 21 updates
2021-01-29 08:29:56 | INFO | fairseq.trainer | Finished saving checkpoint to checkpoints/checkpoint_last.pt
2021-01-29 08:29:56 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/checkpoint_last.pt (epoch 33 @ 21 updates, score 3.125) (writing took 4.057792565999989 seconds)
2021-01-29 08:29:56 | INFO | fairseq_cli.train | end of epoch 33 (average epoch stats below)
Given that I'm using for training 5 GPUs GeForce GTX 1080 Ti with 10.917GB memory each, how can I calculate the max_tokens so that no memory error occurs?
Hi Mr @mailong25,
I'm having issue in generate train and test files. I see in your code that you're using 2 files which has lst extensions. But with my own datasets (for example: VLSP2020), I create my own train and test file and training process seems to work fine until this happened
Can you explain this. Thank you so much. Below are my 2 files
test.txt
train.txt
When creating an instance of Transcriber class, getting the following error:
usage: data [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[etc]
data
data: error: the following arguments are required: data
An exception has occurred, use %tb to see the full traceback.
SystemExit Traceback (most recent call last)
in ()
----> 1 parser.parse_args()
4 frames
/usr/lib/python3.7/argparse.py in parse_args(self, args, namespace)
1762 # =====================================
1763 def parse_args(self, args=None, namespace=None):
-> 1764 args, argv = self.parse_known_args(args, namespace)
1765 if argv:
1766 msg = _('unrecognized arguments: %s')
/usr/lib/python3.7/argparse.py in parse_known_args(self, args, namespace)
1794 # parse the arguments and exit if there are any errors
1795 try:
-> 1796 namespace, args = self._parse_known_args(args, namespace)
1797 if hasattr(namespace, _UNRECOGNIZED_ARGS_ATTR):
1798 args.extend(getattr(namespace, _UNRECOGNIZED_ARGS_ATTR))
/usr/lib/python3.7/argparse.py in parse_known_args(self, arg_strings, namespace)
2029 if required_actions:
2030 self.error(('the following arguments are required: %s') %
-> 2031 ', '.join(required_actions))
2032
2033 # make sure all required groups had one option present
/usr/lib/python3.7/argparse.py in error(self, message)
2515 self.print_usage(_sys.stderr)
2516 args = {'prog': self.prog, 'message': message}
-> 2517 self.exit(2, _('%(prog)s: error: %(message)s\n') % args)
/usr/lib/python3.7/argparse.py in exit(self, status, message)
2502 if message:
2503 self._print_message(message, _sys.stderr)
-> 2504 _sys.exit(status)
2505
2506 def error(self, message):
SystemExit: 2
Hey
I was browsing through this code and wanted to know a few things:
Hi all
After following Install Instruction and downloading your Pre-trained models I executed this code in colab:
from stt import Transcriber
transcriber = Transcriber(pretrain_model = '/content/vietnamese_wav2vec/pretrain.pt', finetune_model = '/content/vietnamese_wav2vec/finetune.pt',
dictionary = '/content/vietnamese_wav2vec/dict.ltr.txt',
lm_type = 'kenlm',
lm_lexicon = '/content/vietnamese_wav2vec/lexicon.txt', lm_model = '/content/vietnamese_wav2vec/lm.bin',
lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['/content/1000.wav'])
print(hypos)
But it gives me this error:
usage: ipykernel_launcher.py [-h] [--no-progress-bar]
[--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT]
[--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16]
[--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging]
[--criterion {cross_entropy,label_smoothed_cross_entropy,composite_loss,sentence_ranking,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,wav2vec,masked_lm,ctc,sentence_prediction,nat_loss,model,adaptive_loss,vocab_parallel_cross_entropy}]
[--tokenizer {space,moses,nltk}]
[--bpe {bert,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,sentencepiece,bytes,characters}]
[--optimizer {adamax,adadelta,nag,adagrad,adafactor,adam,sgd,composite,lamb}]
[--lr-scheduler {inverse_sqrt,triangular,cosine,fixed,tri_stage,reduce_lr_on_plateau,polynomial_decay,manual,pass_through}]
[--scoring {sacrebleu,bleu,chrf,wer}]
[--task TASK] [--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS]
[--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM]
[--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB]
[--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--path PATH]
[--post-process [POST_PROCESS]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam BEAM]
[--nbest NBEST] [--max-len-a MAX_LEN_A]
[--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
[--match-source-len] [--unnormalized]
[--no-early-stop] [--no-beamable-mm]
[--lenpen LENPEN] [--unkpen UNKPEN]
[--replace-unk [REPLACE_UNK]] [--sacrebleu]
[--score-reference] [--prefix-size PREFIX_SIZE]
[--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
[--sampling] [--sampling-topk SAMPLING_TOPK]
[--sampling-topp SAMPLING_TOPP]
[--constraints [{ordered,unordered}]]
[--temperature TEMPERATURE]
[--diverse-beam-groups DIVERSE_BEAM_GROUPS]
[--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
[--diversity-rate DIVERSITY_RATE]
[--print-alignment [{hard,soft}]] [--print-step]
[--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
[--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
[--iter-decode-max-iter ITER_DECODE_MAX_ITER]
[--iter-decode-force-max-iter]
[--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
[--no-seed-provided] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints]
[--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--patience PATIENCE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks]
[--kspmodel KSPMODEL] [--wfstlm WFSTLM]
[--rnnt_decoding_type RNNT_DECODING_TYPE]
[--rnnt_len_penalty RNNT_LEN_PENALTY]
[--w2l-decoder W2L_DECODER] [--lexicon LEXICON]
[--unit-lm] [--kenlm-model KENLM_MODEL]
[--beam-threshold BEAM_THRESHOLD]
[--beam-size-token BEAM_SIZE_TOKEN]
[--word-score WORD_SCORE]
[--unk-weight UNK_WEIGHT]
[--sil-weight SIL_WEIGHT]
[--dump-emissions DUMP_EMISSIONS]
[--dump-features DUMP_FEATURES]
[--load-emissions LOAD_EMISSIONS] [-s SRC]
[-t TARGET] [--load-alignments]
[--left-pad-source BOOL] [--left-pad-target BOOL]
[--max-source-positions N]
[--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY]
[--truncate-source] [--num-batch-buckets N]
[--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
[--eval-bleu-detok-args JSON]
[--eval-tokenized-bleu]
[--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
[--eval-bleu-args JSON]
[--eval-bleu-print-samples]
[--force-anneal FORCE_ANNEAL]
[--lr-shrink LR_SHRINK]
[--warmup-updates WARMUP_UPDATES] [--pad PAD]
[--eos EOS] [--unk UNK]
data
ipykernel_launcher.py: error: unrecognized arguments: -f /mnt/disks2/data /mnt/disks2/data
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
and %tb
gives this:
---------------------------------------------------------------------------
SystemExit Traceback (most recent call last)
<ipython-input-46-24d37fe9d36e> in <module>()
4 lm_type = 'kenlm',
5 lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
----> 6 lm_weight = 1.5, word_score = -1, beam_size = 50)
7 hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
8 print(hypos)
4 frames
/usr/lib/python3.7/argparse.py in exit(self, status, message)
2502 if message:
2503 self._print_message(message, _sys.stderr)
-> 2504 _sys.exit(status)
2505
2506 def error(self, message):
SystemExit: 2
Why is this happening?
Could you please help me with that?
Hi @mailong25,
Thanks for your repo, it's very nice. I want to pre-train model(base-large) with my audio in Japanese, but anytime i do that, i get error: "dataset is empty"? How can i resloved it or how to pretrain model with other language(not English)? Thanks for your help!
--
In the examples folder there is a file called hydra_train_finetune.log. In that file after each epoch or before each epoch I can see that there's a validation step.
[2020-12-19 17:41:16,018][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2020-12-19 17:41:22,153][valid][INFO] - {"epoch": 1, "valid_loss": "729.494", "valid_ntokens": "2521.93", "valid_nsentences": "49.1429", "valid_nll_loss": "14.215", "valid_uer": "123.012", "valid_wer": "100.393", "valid_raw_wer": "100.393", "valid_wps": "6759", "valid_wpb": "2521.9", "valid_bsz": "49.1", "valid_num_updates": "85"}
[2020-12-19 17:41:22,155][fairseq_cli.train][INFO] - begin save checkpoint
[2020-12-19 17:41:22,156][fairseq.trainer][INFO] - Preparing to save checkpoint to checkpoints/checkpoint_best.pt after 85 updates
[2020-12-19 17:41:24,604][fairseq.trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2020-12-19 17:41:25,543][fairseq.checkpoint_utils][INFO] - saved checkpoint checkpoints/checkpoint_best.pt (epoch 1 @ 85 updates, score 100.393) (writing took 3.387641379999991 seconds)
[2020-12-19 17:41:25,543][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below)
[2020-12-19 17:41:25,566][train][INFO] - {"epoch": 1, "train_loss": "887.379", "train_ntokens": "49633.1",
What configs should I take into consideration in order to have that validate step run for my dataset?
I'm running the finetune.py script with the config base_100h.yaml.
The Transcriber fails to initialize,
File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 250, in __init__
self.transcribe([sample_audio_path])
File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 297, in transcribe
state['cfg']['model']['w2v_path'] = self.pretrain_model
TypeError: 'Namespace' object does not support item assignment
Cloning into 'wav2letter'...
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 6587 (delta 34), reused 72 (delta 19), pack-reused 6468
Receiving objects: 100% (6587/6587), 6.13 MiB | 22.26 MiB/s, done.
Resolving deltas: 100% (4207/4207), done.
/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
env: KENLM_ROOT_DIR=/content/self-supervised-speech-recognition/libs/kenlm
Obtaining file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
Installing collected packages: wav2letter
Running setup.py develop for wav2letter
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; file='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
/content/self-supervised-speech-recognition/libs
When I run pretrain.py or finetune.py I get the following error.
python pretrain.py --fairseq_path libs/fairseq --audio_path wav --init_model wav2vec_small.pt
fairseq-hydra-train task.data=/home/bcode/speech-recognition/temp distributed_training.distributed_world_size=1 +optimization.update_freq='[64]' checkpoint.finetune_from_model=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=1200000 --config-dir config/pretraining --config-name wav2vec2_base_librispeech
sh: 1: fairseq-hydra-train: not found
python finetune.py --transcript_file transcript.txt --pretrain_model wav2vec_small.pt --dict_file dict.ltr.txt
100%|████████████████████████████████████| 1831/1831 [00:00<00:00, 15427.24it/s]
fairseq-hydra-train task.data=/home/bcode/speech-recognition/manifest distributed_training.distributed_world_size=1 +optimization.update_freq='[24]' model.w2v_path=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=2800000 --config-dir config/finetuning --config-name base_1h
sh: 1: fairseq-hydra-train: not found
When I run the following script after I pretrained the xlsr model with my own dataset, Im getting error below,
FileNotFoundError: [Errno 2] No such file or directory: '/home/edo/self-supervised-speech-recognition/manifest/dev_other.tsv'
I have generated the dictionary file as well as labeled data according to the tutorial in the page with the following command,
python3 gen_dict.py --transcript_file /home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --save_dir dictionary
However, I run this command:
python3 finetune.py --transcript_file home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --pretrain_model ../models/07-46-31/checkpoints/checkpoint_best.pt --dict_file dictionary/dict.ltr.txt
and get the error above.
OS: Ubuntu 18.04 LTS
Cuda Version: 11.2
However, I tried to change the model to be used in fine-tuning process with wav2vec_small.pt instead of models/07-46-31/checkpoints/checkpoint_best.pt, I dont why it works.
Am I missing something? But im pretty sure I've followed the instructions. Or it is a bug from the fairseq itself?
Hey
How do you recommend we tune the parameters in transcribe function: lm_weight, word_score, and beam_size? Normally with things like Deepspeech 2, we use its logits to tune this, but how about wav2vec?
Thanks
Hello,
My fine-tuning goes all the way up to epoch 49 (and counting) but no checkpoint is saved. Is this normal? It seems very different from the pre-training where it's saving a checkpoint after each epoch.
Many thanks.
Kind Regards,
Hi there
I am following the instructions in 'Install instruction' section but when I run this
!pip install -e .
I get the following error
file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python Installing collected packages: wav2letter Running setup.py develop for wav2letter ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; __file__='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output
Any idea how can I fix this issue?
Also I tried to remove build file in the wav2letter repo but not working again.
I have little data and I want to ask if it is possible to finetune your wav2vec pretrained model deriectly on another language, such as chinese, thanks
Chào a Mai Long, em thấy phần inference nếu mình đang làm như thế thì đang bị load lại model ạ. Không biết, a đã tinh chỉnh phần inference chưa ạ .
Hi @mailong25, I am sorry if this is a basic question. I would like to get pointers on how you generated the lexicon.txt. I have an asr dataset (audio + transcriptions)[https://github.com/csikasote/bemba-language-corpus]. My goal is to develop an ASR using wav2vec and wav2letter, possibly adapt your code to my dataset. Did you generate the lexicon.txt from the transcriptions? I would appreciate your help.
I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU.
cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'")
In the wav2vec Readme https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md they say that the base model was trained using 64 V100 GPUs and as I understood if we want to do more training on the base model we should simulate the number of the GPUs they've used.
Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 64/k
Have you found that setting update_freq to be 24/NUM_GPU is better for training or is it a bug?
Hi,
To do lexicon-free decoding I set the args.lexicon to False in wsl_decoder.py. But I got an empty string. So Please explain how to do lexicon free decoding. In wsl_decoder.py i have seen this line "lexicon free decoding can only be done with a unit language model" . if that so how can i create unit language model.
Right now I having this below problem in the concurrent lexicon lm model.
For example, In lexicon.txt I have two words tamil and ama. if I send tamil spoken audio to ASR. it predicts as a tamah(no lm). After language model processing it always giving as a ama.
another example:
Now in lexicon.txt I have tamil and am words. If asr predicts as a tamah. the language model giving string as am. I just want to solve this problem. I tried all different value of lm weight or wordgroup parameter but no use. I know that lm_weigt and word group will not influence that much in these scenarios.
Thanks,
Please anyone helpme out.
@SenriYoshikawa
@mailong25
Hi Mailong,
We are doing the group project with wav2vec on Colab and we use Colab to train the model using VLSP dataset since Colab is limited, we use your pre-train model for the finetuning our dataset. Our dataset is around 60-70 hours with the 100h config, the process seem fine at the beginning but when the train loss value reach 400, it's valid loss suddenly increase significantly and also the wer, only the uer seem to decrease. Also the train loss reduce very slow, about 1 point per epoch. Can you help us with the problem.
Thank you very much.
Hey
I was wondering if you have any suggestions on how to get around argparse for wav2vec? Argparse is great for training, but once you have a set of parameters fixed, argparse must be avoided since it also interferes with other things once the model is part of an application.
I know wav2vec itself uses argparse everywhere, but I was wondering if you might have thought of this problem and a workaround. Thanks.
What do I need to do for real-time translation with microphone? Is there a sample written in Python? I trained my model. I am getting successful output with wav files.
Hi Mailong,
Thank you for helping us the last time, after we terminate files with under 2s, we had able to finetune the model with wer =16, bit when we using STT with Colab, we come to a problem:
ipykernel_launcher.py: error: unrecognized arguments: -f / mnt/disk2/data
Can you help us how to modify the code to solve this problem? We really need your help.
Hi anh mailong25. Trong quá trình huấn luyện mô hình trên tập vivos em nhận ra khi mô hình càng hội tụ, những đoạn văn bản mô hình dự đoán qua những file em test ngẫu nhiên dần xuất hiện các ký tự lạ được dự đoán xen kẽ nhau qua các từ (em nghĩ có thể do dự đoán các tạp âm).
Không biết anh có gặp trường hợp này bao giờ chưa ạ? Nếu được, xin anh cho em giải pháp khắc phục vấn đề này. Em cảm ơn.
Hi, @mailong25 How to skip the lm model. I want to see direct model prediction after CTC decoding and also I'm getting valid_wer as 100 for every epoch, but the training loss is decreasing. why so?
Note: apart from your config I just added the memory_efficient_fp16 = True, to reduce GPU memory consumption, is this affecting the wer.
Please help me out.
Thanks
hydra_train.log
Hi @mailong25 @SenriYoshikawa ,
I am able to train acoustic model and Kenlm but when i am trying to import stt for transcribing, I am getting the following error in the self-supervised-speech-recognition directory.
Error
ImportError Traceback (most recent call last)
in ()
1 get_ipython().magic('cd /content/self-supervised-speech-recognition')
----> 2 import stt
/content/self-supervised-speech-recognition/stt.py in ()
7 import numpy as np
8 import torch
----> 9 from fairseq import checkpoint_utils, options, progress_bar, tasks, utils
10 from fairseq.data.data_utils import post_process
11 from fairseq.logging.meters import StopwatchMeter, TimeMeter
ImportError: cannot import name 'checkpoint_utils' from 'fairseq' (unknown location)
I tried importing fairseq in the directory i.e. /content/self-supervised-speech-recognition/libs/fairseq and it is working fine. Kindly help me with the work-around for this conflict in the directories. Regards
Hi mr @mailong25 , I'm wondering how do you calculate WER on test audio files? If you does how can I find it in your code?
Thank you
Hi,
In the install instruction, I saw that this repo uses a specific commit c8a0659be5cdc15caa102d5bbf72b872567c4859 of Fairseq. This commit is 4 months old. I have two questions regarding this -
Thanks in advance.
Hey
Thank you for releasing this wrapper. It will be very helpful if you can add a Licence for your code.
Hi a mailong25,
E có build source trên google colab, mọi thứ chạy oke, nhưng khi e test với data thì response trả về là ['']. E không tìm được lí do, có phải là do model mk train, hay một lí do nào khác. Mong a giải đáp. Greate Libs!
Hi @mailong25 , Thanks for your great working especially combining the language model in this asr system. I have one doubt is adding noise to audio will affect the performance. because I searched regarding this augmentation in your GitHub and fairseq wav2vec GitHub but I unable to find any reference regarding the noise augmentation or other augmentation. Why So?
I have read the wav2vec for simplicity it works like a clustering thing using contractive loss. So I feel like even though if we add noise to the original audio anyway it's going cluster the sampled part of the audio to the nearby neighbor embedding.
Thanks
Hey
Right now, the transcribe function expects audio paths. What if we want to supply a NumPy array of audios preloaded? Can we modify this in stt.py or does fairseq/wav2vec demand audio paths?
Once again, thanks for your quick responses.
Hi Maillong,
I am in the stage of pretrain model (use config/pretraining/wav2vec2_base_librispeech.yaml)
Total duration of unlabeled data is ~100 hours but there are ~60% shorter than 5sec.
The only different setting of environments is that I fix the distributed_world_size
from 64 to 1 (I only have one GPU). Others are same.
In the begin of training, there are some WARNING about overflow:
...
2021-01-26 03:01:31 | INFO | fairseq.trainer | begin training epoch 1
2021-01-26 03:02:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-01-26 03:03:23 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-01-26 03:04:15 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-01-26 03:10:26 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-01-26 03:56:38 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.
...
Two questions:
lr
because the gradient is out-of-range in FP16 or ignore this warning?train_ups
is really small =0.02
. Is that normal?Theo mình hiểu thì code từ đây https://github.com/mailong25/wav2letter.git
đang chưa support run/train sử dụng GPU. Long có thể update code lên wav2letter v0.2 stable hiện tại https://github.com/facebookresearch/wav2letter/tree/v0.2
được không?
Hiện tại mình build cả flashlight và wav2letter v0.2 thì mô hình wav2vec load ổn, các file temp xuất ra đều đúng, nhưng đến bước chạy Decoder của wav2letter thì chương trình dừng lại (có bug nhỏ về đổi biến silweight thành silscore thì mình đã sửa).
Cảm ơn,
Describe your question
seeking guidelines from the community for improving models for production:
different accents, for example, US/UK/AU English
filler words, for example, um; ah
domain words, for example, company names, finance terms
"Please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Can I pretrain the different versions of the wav2vec with the same code?
Hi mailong
I follow your install instruction and now stuck at stage 4 (Make prediction on multiple audios programmatically)
It only shows Segmentation fault (core dumped)
Is that means my wav2letter installation broken?
Hi,
I just modified the code to work on batch inference. Now it able to do batch inference but I am getting shuffled predicted output. I tried to turn off all shuffle flags. but till the shuffled result. please help me out
Thanks, @mailong25
Em chào anh Long,
Trước tiên em xin cảm ơn về sự đóng góp của anh cho cộng đồng NLP Việt Nam.
Em có 1 câu hỏi, khi dev repo này xuống dưới những con chip nhỏ ví dụ như ARM cortext thì tốc độ inference có tốt không ạ so với phương pháp truyền thống như MFCC+HMM. Em đang trong quá trình lựa chọn model tuy nhiên lại chưa có điều kiện dev xuống để test thử.
Em cảm ơn anh ạ!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.