speech to text with self-supervised learning based on wav2vec 2.0 framework

Python 100.00%

speech-recognition self-supervised-learning wav2vec vietnamese-speech-recognition speech-to-text semi-supervised-learning unsupervised-learning

self-supervised-speech-recognition's Issues

malloc(): mismatching next->prev_size (unsorted)

Hello thank you for this. Everything works fine with me, until the last step when trying to transcribe a file.
I get the notification: malloc(): mismatching next->prev_size (unsorted), aborted.
I also sometimes get: segmentation fault, but without any traceback..
Do you know what the cause of this might be?
Thanks :-)

Dataset

what vietnamese dataset did you use?

How to calculate WER in testing

Chào anh @mailong25 ,
Em có thử test với một vài file audio có sẵn trong repo này của anh, theo như anh công bố thì WER chỉ tầm 15%, nhưng em test thì WER lại lên đến giá trị 450, mặc dù transcribe đúng ko sai một từ nào, có 1 file transcribe xong sai có 1/11 từ thôi nhưng WER lại lên đến 470, không biết là có nhầm lẫn gì ở đây không ạ

Unable to load multiple model

Hi, I want to run multiple models. but after running one instance. i unable to run other instance.
I am getting this error.
ile "/media/administrator/hdd/self-supervised-speech-recognition/stt.py", line 187, in optimize_models
model.cuda()
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please help me out.
Gpu memory consuming for one instance is 2246 gb / 8 gb
Thanks,
@mailong25

Validation in finetune.py and Tensorboarding

Hey

Once again, thanks for making a nice wrapper. I understand that your code ultimately calls the wav2vec commands given in the original repository. I have some questions which didn't get resolved there and I would love your suggestions:

Like in pretrain.py, because you call manifest.py, you are able to set some part for validation. However, how do you think one can supply a validation split in finetune.py?
How do I create a nice tensorboard log on the existing hydra logs? Any existing utilities that can help me do it?

Thanks for helping out.

ValueError in making prediction

Hi, your work is really awesome. I currenly got a ValueError when running prediction by your public Pre-trained models.
Here is the issue code:

While i was trying to debug the error via facebookresearch/fairseq#2106, the problem still occured.
Please tell me how to fix it. My computer os is Windows 10 by the way. Many thanks

It says Saved but I checked the folder and it is empty.

Hi, It says Saved but I checked the folder and it is empty.

[Question] How do I calculate max_tokens max value?

Given that I'm using for training 5 GPUs GeForce GTX 1080 Ti with 10.917GB memory each, how can I calculate the max_tokens so that no memory error occurs?

How did you create lst file for train and test

Hi Mr @mailong25,
I'm having issue in generate train and test files. I see in your code that you're using 2 files which has lst extensions. But with my own datasets (for example: VLSP2020), I create my own train and test file and training process seems to work fine until this happened

Can you explain this. Thank you so much. Below are my 2 files
test.txt
train.txt

STT problem

When creating an instance of Transcriber class, getting the following error:

usage: data [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[etc]
data
data: error: the following arguments are required: data
An exception has occurred, use %tb to see the full traceback.

%tb output:

SystemExit Traceback (most recent call last)
in ()
----> 1 parser.parse_args()

4 frames
/usr/lib/python3.7/argparse.py in parse_args(self, args, namespace)
1762 # =====================================
1763 def parse_args(self, args=None, namespace=None):
-> 1764 args, argv = self.parse_known_args(args, namespace)
1765 if argv:
1766 msg = _('unrecognized arguments: %s')

/usr/lib/python3.7/argparse.py in parse_known_args(self, args, namespace)
1794 # parse the arguments and exit if there are any errors
1795 try:
-> 1796 namespace, args = self._parse_known_args(args, namespace)
1797 if hasattr(namespace, _UNRECOGNIZED_ARGS_ATTR):
1798 args.extend(getattr(namespace, _UNRECOGNIZED_ARGS_ATTR))

/usr/lib/python3.7/argparse.py in parse_known_args(self, arg_strings, namespace)
2029 if required_actions:
2030 self.error(('the following arguments are required: %s') %
-> 2031 ', '.join(required_actions))
2032
2033 # make sure all required groups had one option present

/usr/lib/python3.7/argparse.py in error(self, message)
2515 self.print_usage(_sys.stderr)
2516 args = {'prog': self.prog, 'message': message}
-> 2517 self.exit(2, _('%(prog)s: error: %(message)s\n') % args)

/usr/lib/python3.7/argparse.py in exit(self, status, message)
2502 if message:
2503 self._print_message(message, _sys.stderr)
-> 2504 _sys.exit(status)
2505
2506 def error(self, message):

SystemExit: 2

Batch size, resuming from checkpoints and other utilities

Hey

I was browsing through this code and wanted to know a few things:

How can I control the batch size? (Large models will require me to train on very small batch sizes due to limited compute)
How can I resume from a checkpoint, currently there is no utility or function in your codebase?

Error in Making prediction

Hi all

After following Install Instruction and downloading your Pre-trained models I executed this code in colab:

from stt import Transcriber
transcriber = Transcriber(pretrain_model = '/content/vietnamese_wav2vec/pretrain.pt', finetune_model = '/content/vietnamese_wav2vec/finetune.pt', 
                          dictionary = '/content/vietnamese_wav2vec/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = '/content/vietnamese_wav2vec/lexicon.txt', lm_model = '/content/vietnamese_wav2vec/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['/content/1000.wav'])
print(hypos)

But it gives me this error:

usage: ipykernel_launcher.py [-h] [--no-progress-bar]
                             [--log-interval LOG_INTERVAL]
                             [--log-format {json,none,simple,tqdm}]
                             [--tensorboard-logdir TENSORBOARD_LOGDIR]
                             [--wandb-project WANDB_PROJECT]
                             [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
                             [--bf16] [--memory-efficient-bf16] [--fp16]
                             [--memory-efficient-fp16]
                             [--fp16-no-flatten-grads]
                             [--fp16-init-scale FP16_INIT_SCALE]
                             [--fp16-scale-window FP16_SCALE_WINDOW]
                             [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                             [--min-loss-scale MIN_LOSS_SCALE]
                             [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                             [--user-dir USER_DIR]
                             [--empty-cache-freq EMPTY_CACHE_FREQ]
                             [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                             [--model-parallel-size MODEL_PARALLEL_SIZE]
                             [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                             [--profile] [--reset-logging]
                             [--criterion {cross_entropy,label_smoothed_cross_entropy,composite_loss,sentence_ranking,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,wav2vec,masked_lm,ctc,sentence_prediction,nat_loss,model,adaptive_loss,vocab_parallel_cross_entropy}]
                             [--tokenizer {space,moses,nltk}]
                             [--bpe {bert,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,sentencepiece,bytes,characters}]
                             [--optimizer {adamax,adadelta,nag,adagrad,adafactor,adam,sgd,composite,lamb}]
                             [--lr-scheduler {inverse_sqrt,triangular,cosine,fixed,tri_stage,reduce_lr_on_plateau,polynomial_decay,manual,pass_through}]
                             [--scoring {sacrebleu,bleu,chrf,wer}]
                             [--task TASK] [--num-workers NUM_WORKERS]
                             [--skip-invalid-size-inputs-valid-test]
                             [--max-tokens MAX_TOKENS]
                             [--batch-size BATCH_SIZE]
                             [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                             [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                             [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                             [--data-buffer-size DATA_BUFFER_SIZE]
                             [--train-subset TRAIN_SUBSET]
                             [--valid-subset VALID_SUBSET]
                             [--validate-interval VALIDATE_INTERVAL]
                             [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                             [--validate-after-updates VALIDATE_AFTER_UPDATES]
                             [--fixed-validation-seed FIXED_VALIDATION_SEED]
                             [--disable-validation]
                             [--max-tokens-valid MAX_TOKENS_VALID]
                             [--batch-size-valid BATCH_SIZE_VALID]
                             [--curriculum CURRICULUM]
                             [--gen-subset GEN_SUBSET]
                             [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                             [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                             [--distributed-rank DISTRIBUTED_RANK]
                             [--distributed-backend DISTRIBUTED_BACKEND]
                             [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                             [--distributed-port DISTRIBUTED_PORT]
                             [--device-id DEVICE_ID] [--distributed-no-spawn]
                             [--ddp-backend {c10d,no_c10d}]
                             [--bucket-cap-mb BUCKET_CAP_MB]
                             [--fix-batches-to-gpus]
                             [--find-unused-parameters] [--fast-stat-sync]
                             [--broadcast-buffers]
                             [--distributed-wrapper {DDP,SlowMo}]
                             [--slowmo-momentum SLOWMO_MOMENTUM]
                             [--slowmo-algorithm SLOWMO_ALGORITHM]
                             [--localsgd-frequency LOCALSGD_FREQUENCY]
                             [--nprocs-per-node NPROCS_PER_NODE]
                             [--pipeline-model-parallel]
                             [--pipeline-balance PIPELINE_BALANCE]
                             [--pipeline-devices PIPELINE_DEVICES]
                             [--pipeline-chunks PIPELINE_CHUNKS]
                             [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                             [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                             [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                             [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                             [--pipeline-checkpoint {always,never,except_last}]
                             [--zero-sharding {none,os}] [--path PATH]
                             [--post-process [POST_PROCESS]] [--quiet]
                             [--model-overrides MODEL_OVERRIDES]
                             [--results-path RESULTS_PATH] [--beam BEAM]
                             [--nbest NBEST] [--max-len-a MAX_LEN_A]
                             [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                             [--match-source-len] [--unnormalized]
                             [--no-early-stop] [--no-beamable-mm]
                             [--lenpen LENPEN] [--unkpen UNKPEN]
                             [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                             [--score-reference] [--prefix-size PREFIX_SIZE]
                             [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                             [--sampling] [--sampling-topk SAMPLING_TOPK]
                             [--sampling-topp SAMPLING_TOPP]
                             [--constraints [{ordered,unordered}]]
                             [--temperature TEMPERATURE]
                             [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                             [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                             [--diversity-rate DIVERSITY_RATE]
                             [--print-alignment [{hard,soft}]] [--print-step]
                             [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                             [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                             [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                             [--iter-decode-force-max-iter]
                             [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                             [--iter-decode-with-external-reranker]
                             [--retain-iter-history] [--retain-dropout]
                             [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                             [--decoding-format {unigram,ensemble,vote,dp,bs}]
                             [--no-seed-provided] [--save-dir SAVE_DIR]
                             [--restore-file RESTORE_FILE]
                             [--finetune-from-model FINETUNE_FROM_MODEL]
                             [--reset-dataloader] [--reset-lr-scheduler]
                             [--reset-meters] [--reset-optimizer]
                             [--optimizer-overrides OPTIMIZER_OVERRIDES]
                             [--save-interval SAVE_INTERVAL]
                             [--save-interval-updates SAVE_INTERVAL_UPDATES]
                             [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                             [--keep-last-epochs KEEP_LAST_EPOCHS]
                             [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                             [--no-save] [--no-epoch-checkpoints]
                             [--no-last-checkpoints]
                             [--no-save-optimizer-state]
                             [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                             [--maximize-best-checkpoint-metric]
                             [--patience PATIENCE]
                             [--checkpoint-suffix CHECKPOINT_SUFFIX]
                             [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                             [--load-checkpoint-on-all-dp-ranks]
                             [--kspmodel KSPMODEL] [--wfstlm WFSTLM]
                             [--rnnt_decoding_type RNNT_DECODING_TYPE]
                             [--rnnt_len_penalty RNNT_LEN_PENALTY]
                             [--w2l-decoder W2L_DECODER] [--lexicon LEXICON]
                             [--unit-lm] [--kenlm-model KENLM_MODEL]
                             [--beam-threshold BEAM_THRESHOLD]
                             [--beam-size-token BEAM_SIZE_TOKEN]
                             [--word-score WORD_SCORE]
                             [--unk-weight UNK_WEIGHT]
                             [--sil-weight SIL_WEIGHT]
                             [--dump-emissions DUMP_EMISSIONS]
                             [--dump-features DUMP_FEATURES]
                             [--load-emissions LOAD_EMISSIONS] [-s SRC]
                             [-t TARGET] [--load-alignments]
                             [--left-pad-source BOOL] [--left-pad-target BOOL]
                             [--max-source-positions N]
                             [--max-target-positions N]
                             [--upsample-primary UPSAMPLE_PRIMARY]
                             [--truncate-source] [--num-batch-buckets N]
                             [--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
                             [--eval-bleu-detok-args JSON]
                             [--eval-tokenized-bleu]
                             [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
                             [--eval-bleu-args JSON]
                             [--eval-bleu-print-samples]
                             [--force-anneal FORCE_ANNEAL]
                             [--lr-shrink LR_SHRINK]
                             [--warmup-updates WARMUP_UPDATES] [--pad PAD]
                             [--eos EOS] [--unk UNK]
                             data
ipykernel_launcher.py: error: unrecognized arguments: -f /mnt/disks2/data /mnt/disks2/data

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

and %tb gives this:


---------------------------------------------------------------------------

SystemExit                                Traceback (most recent call last)

<ipython-input-46-24d37fe9d36e> in <module>()
      4                           lm_type = 'kenlm',
      5                           lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
----> 6                           lm_weight = 1.5, word_score = -1, beam_size = 50)
      7 hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
      8 print(hypos)

4 frames

/usr/lib/python3.7/argparse.py in exit(self, status, message)
   2502         if message:
   2503             self._print_message(message, _sys.stderr)
-> 2504         _sys.exit(status)
   2505 
   2506     def error(self, message):

SystemExit: 2

Why is this happening?
Could you please help me with that?

Pretrain with Japanese audio?

Hi @mailong25,
Thanks for your repo, it's very nice. I want to pre-train model(base-large) with my audio in Japanese, but anytime i do that, i get error: "dataset is empty"? How can i resloved it or how to pretrain model with other language(not English)? Thanks for your help!

--

validate step in example log file but not when i'm running it

In the examples folder there is a file called hydra_train_finetune.log. In that file after each epoch or before each epoch I can see that there's a validation step.

[2020-12-19 17:41:16,018][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2020-12-19 17:41:22,153][valid][INFO] - {"epoch": 1, "valid_loss": "729.494", "valid_ntokens": "2521.93", "valid_nsentences": "49.1429", "valid_nll_loss": "14.215", "valid_uer": "123.012", "valid_wer": "100.393", "valid_raw_wer": "100.393", "valid_wps": "6759", "valid_wpb": "2521.9", "valid_bsz": "49.1", "valid_num_updates": "85"}
[2020-12-19 17:41:22,155][fairseq_cli.train][INFO] - begin save checkpoint
[2020-12-19 17:41:22,156][fairseq.trainer][INFO] - Preparing to save checkpoint to checkpoints/checkpoint_best.pt after 85 updates
[2020-12-19 17:41:24,604][fairseq.trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2020-12-19 17:41:25,543][fairseq.checkpoint_utils][INFO] - saved checkpoint checkpoints/checkpoint_best.pt (epoch 1 @ 85 updates, score 100.393) (writing took 3.387641379999991 seconds)
[2020-12-19 17:41:25,543][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below)
[2020-12-19 17:41:25,566][train][INFO] - {"epoch": 1, "train_loss": "887.379", "train_ntokens": "49633.1",

What configs should I take into consideration in order to have that validate step run for my dataset?
I'm running the finetune.py script with the config base_100h.yaml.

Error in using STT

The Transcriber fails to initialize,

File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 250, in __init__
    self.transcribe([sample_audio_path])
  File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 297, in transcribe
    state['cfg']['model']['w2v_path'] = self.pretrain_model
TypeError: 'Namespace' object does not support item assignment

I am getting an error when installing to Colab

Cloning into 'wav2letter'...
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 6587 (delta 34), reused 72 (delta 19), pack-reused 6468
Receiving objects: 100% (6587/6587), 6.13 MiB | 22.26 MiB/s, done.
Resolving deltas: 100% (4207/4207), done.
/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
env: KENLM_ROOT_DIR=/content/self-supervised-speech-recognition/libs/kenlm
Obtaining file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
Installing collected packages: wav2letter
Running setup.py develop for wav2letter
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; file='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
/content/self-supervised-speech-recognition/libs

Stuck on decoding. Running using Wav2Letter Docker Container

Hi, I tried both CPU and GPU versions but it fails. Any idea what could be happening?

Best wishes :)

Can not load checkpoints in the finetuning step

When I try steps 2.3 (Run Fine-tuning on the pretrained model) and input the proper pretrained model from step 1.3, I got the message below. Is it correct?

sh: 1: fairseq-hydra-train: not found

When I run pretrain.py or finetune.py I get the following error.

python pretrain.py --fairseq_path libs/fairseq --audio_path wav --init_model wav2vec_small.pt
fairseq-hydra-train task.data=/home/bcode/speech-recognition/temp distributed_training.distributed_world_size=1 +optimization.update_freq='[64]' checkpoint.finetune_from_model=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=1200000 --config-dir config/pretraining --config-name wav2vec2_base_librispeech
sh: 1: fairseq-hydra-train: not found

python finetune.py --transcript_file transcript.txt --pretrain_model wav2vec_small.pt --dict_file dict.ltr.txt

100%|████████████████████████████████████| 1831/1831 [00:00<00:00, 15427.24it/s]
fairseq-hydra-train task.data=/home/bcode/speech-recognition/manifest distributed_training.distributed_world_size=1 +optimization.update_freq='[24]' model.w2v_path=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=2800000 --config-dir config/finetuning --config-name base_1h
sh: 1: fairseq-hydra-train: not found

Issue while fine-tuning the wav2vec model

When I run the following script after I pretrained the xlsr model with my own dataset, Im getting error below,

FileNotFoundError: [Errno 2] No such file or directory: '/home/edo/self-supervised-speech-recognition/manifest/dev_other.tsv'

I have generated the dictionary file as well as labeled data according to the tutorial in the page with the following command,
python3 gen_dict.py --transcript_file /home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --save_dir dictionary

However, I run this command:
python3 finetune.py --transcript_file home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --pretrain_model ../models/07-46-31/checkpoints/checkpoint_best.pt --dict_file dictionary/dict.ltr.txt
and get the error above.

Environment

OS: Ubuntu 18.04 LTS
Cuda Version: 11.2

Additional Context

However, I tried to change the model to be used in fine-tuning process with wav2vec_small.pt instead of models/07-46-31/checkpoints/checkpoint_best.pt, I dont why it works.

Am I missing something? But im pretty sure I've followed the instructions. Or it is a bug from the fairseq itself?

Tuning lm_weight, word_score and beam_size

Hey

How do you recommend we tune the parameters in transcribe function: lm_weight, word_score, and beam_size? Normally with things like Deepspeech 2, we use its logits to tune this, but how about wav2vec?

Thanks

Fine-tuning checkpoints

Hello,

My fine-tuning goes all the way up to epoch 49 (and counting) but no checkpoint is saved. Is this normal? It seems very different from the pre-training where it's saving a checkpoint after each epoch.

Many thanks.

Kind Regards,

Wav2letter installation error

Hi there
I am following the instructions in 'Install instruction' section but when I run this

!pip install -e .

I get the following error

file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python Installing collected packages: wav2letter Running setup.py develop for wav2letter ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; __file__='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output

Any idea how can I fix this issue?
Also I tried to remove build file in the wav2letter repo but not working again.

can finetune on other language?

I have little data and I want to ask if it is possible to finetune your wav2vec pretrained model deriectly on another language, such as chinese, thanks

Decoder using build/Decode will reload model?

Chào a Mai Long, em thấy phần inference nếu mình đang làm như thế thì đang bị load lại model ạ. Không biết, a đã tinh chỉnh phần inference chưa ạ .

How to create the lexicon.txt

Hi @mailong25, I am sorry if this is a basic question. I would like to get pointers on how you generated the lexicon.txt. I have an asr dataset (audio + transcriptions)[https://github.com/csikasote/bemba-language-corpus]. My goal is to develop an ASR using wav2vec and wav2letter, possibly adapt your code to my dataset. Did you generate the lexicon.txt from the transcriptions? I would appreciate your help.

'Wav2VecCtc' object has no attribute 'remove_pretraining_modules'

Hello, I've done all the steps. However I am getting the following error. What should I do?

finetune.py optimization.update_freq

I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU.

    cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'")

In the wav2vec Readme https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md they say that the base model was trained using 64 V100 GPUs and as I understood if we want to do more training on the base model we should simulate the number of the GPUs they've used.

Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 64/k

Have you found that setting update_freq to be 24/NUM_GPU is better for training or is it a bug?

how to do lexicon free decoding

Hi,
To do lexicon-free decoding I set the args.lexicon to False in wsl_decoder.py. But I got an empty string. So Please explain how to do lexicon free decoding. In wsl_decoder.py i have seen this line "lexicon free decoding can only be done with a unit language model" . if that so how can i create unit language model.

Right now I having this below problem in the concurrent lexicon lm model.
For example, In lexicon.txt I have two words tamil and ama. if I send tamil spoken audio to ASR. it predicts as a tamah(no lm). After language model processing it always giving as a ama.
another example:
Now in lexicon.txt I have tamil and am words. If asr predicts as a tamah. the language model giving string as am. I just want to solve this problem. I tried all different value of lm weight or wordgroup parameter but no use. I know that lm_weigt and word group will not influence that much in these scenarios.
Thanks,
Please anyone helpme out.
@SenriYoshikawa
@mailong25

Finetuning a Vietnamese model

Hi Mailong,
We are doing the group project with wav2vec on Colab and we use Colab to train the model using VLSP dataset since Colab is limited, we use your pre-train model for the finetuning our dataset. Our dataset is around 60-70 hours with the 100h config, the process seem fine at the beginning but when the train loss value reach 400, it's valid loss suddenly increase significantly and also the wer, only the uer seem to decrease. Also the train loss reduce very slow, about 1 point per epoch. Can you help us with the problem.

Thank you very much.

Hard coding parameters and avoiding argparse

Hey

I was wondering if you have any suggestions on how to get around argparse for wav2vec? Argparse is great for training, but once you have a set of parameters fixed, argparse must be avoided since it also interferes with other things once the model is part of an application.

I know wav2vec itself uses argparse everywhere, but I was wondering if you might have thought of this problem and a workaround. Thanks.

What do I need to do for real-time translation with microphone?

What do I need to do for real-time translation with microphone? Is there a sample written in Python? I trained my model. I am getting successful output with wav files.

Using STT

Hi Mailong,
Thank you for helping us the last time, after we terminate files with under 2s, we had able to finetune the model with wer =16, bit when we using STT with Colab, we come to a problem:
ipykernel_launcher.py: error: unrecognized arguments: -f / mnt/disk2/data
Can you help us how to modify the code to solve this problem? We really need your help.

Strange characters prediction

Hi anh mailong25. Trong quá trình huấn luyện mô hình trên tập vivos em nhận ra khi mô hình càng hội tụ, những đoạn văn bản mô hình dự đoán qua những file em test ngẫu nhiên dần xuất hiện các ký tự lạ được dự đoán xen kẽ nhau qua các từ (em nghĩ có thể do dự đoán các tạp âm).

Không biết anh có gặp trường hợp này bao giờ chưa ạ? Nếu được, xin anh cho em giải pháp khắc phục vấn đề này. Em cảm ơn.

How to skip the lm model, i want to see the directly model prediction after ctc decoding.

Hi, @mailong25 How to skip the lm model. I want to see direct model prediction after CTC decoding and also I'm getting valid_wer as 100 for every epoch, but the training loss is decreasing. why so?
Note: apart from your config I just added the memory_efficient_fp16 = True, to reduce GPU memory consumption, is this affecting the wer.
Please help me out.
Thanks
hydra_train.log

Import stt is not working!

Hi @mailong25 @SenriYoshikawa ,

I am able to train acoustic model and Kenlm but when i am trying to import stt for transcribing, I am getting the following error in the self-supervised-speech-recognition directory.

Error

ImportError Traceback (most recent call last)
in ()
1 get_ipython().magic('cd /content/self-supervised-speech-recognition')
----> 2 import stt

/content/self-supervised-speech-recognition/stt.py in ()
7 import numpy as np
8 import torch
----> 9 from fairseq import checkpoint_utils, options, progress_bar, tasks, utils
10 from fairseq.data.data_utils import post_process
11 from fairseq.logging.meters import StopwatchMeter, TimeMeter

ImportError: cannot import name 'checkpoint_utils' from 'fairseq' (unknown location)

I tried importing fairseq in the directory i.e. /content/self-supervised-speech-recognition/libs/fairseq and it is working fine. Kindly help me with the work-around for this conflict in the directories. Regards

WER on testing phase

Hi mr @mailong25 , I'm wondering how do you calculate WER on test audio files? If you does how can I find it in your code?
Thank you

[Question] Can we use updated version of Fairseq or train existing version with TPU?

Hi,
In the install instruction, I saw that this repo uses a specific commit c8a0659be5cdc15caa102d5bbf72b872567c4859 of Fairseq. This commit is 4 months old. I have two questions regarding this -

Can I use later commits of Fairseq with this repo?
Can I use TPU for training?

Thanks in advance.

Licensing Information

Hey

Thank you for releasing this wrapper. It will be very helpful if you can add a Licence for your code.

Response empty value?

Hi a mailong25,
E có build source trên google colab, mọi thứ chạy oke, nhưng khi e test với data thì response trả về là ['']. E không tìm được lí do, có phải là do model mk train, hay một lí do nào khác. Mong a giải đáp. Greate Libs!

is it ok, To do specAugment and also adding noise to audio, then train it out.

Hi @mailong25 , Thanks for your great working especially combining the language model in this asr system. I have one doubt is adding noise to audio will affect the performance. because I searched regarding this augmentation in your GitHub and fairseq wav2vec GitHub but I unable to find any reference regarding the noise augmentation or other augmentation. Why So?

I have read the wav2vec for simplicity it works like a clustering thing using contractive loss. So I feel like even though if we add noise to the original audio anyway it's going cluster the sampled part of the audio to the nearby neighbor embedding.

Thanks

Loading Audio array directly into transcribe, instead of supplying Audio Path

Hey

Right now, the transcribe function expects audio paths. What if we want to supply a NumPy array of audios preloaded? Can we modify this in stt.py or does fairseq/wav2vec demand audio paths?

Once again, thanks for your quick responses.

Some overflow in pretrain model

Hi Maillong,
I am in the stage of pretrain model (use config/pretraining/wav2vec2_base_librispeech.yaml)
Total duration of unlabeled data is ~100 hours but there are ~60% shorter than 5sec.
The only different setting of environments is that I fix the distributed_world_size from 64 to 1 (I only have one GPU). Others are same.
In the begin of training, there are some WARNING about overflow:
...
2021-01-26 03:01:31 | INFO | fairseq.trainer | begin training epoch 1
2021-01-26 03:02:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-01-26 03:03:23 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-01-26 03:04:15 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-01-26 03:10:26 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-01-26 03:56:38 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.
...

Two questions:

Is that means I should tune our lr because the gradient is out-of-range in FP16 or ignore this warning?
Our train_ups is really small =0.02. Is that normal?

Cập nhật cho wav2letter v0.2?

Theo mình hiểu thì code từ đây https://github.com/mailong25/wav2letter.git đang chưa support run/train sử dụng GPU. Long có thể update code lên wav2letter v0.2 stable hiện tại https://github.com/facebookresearch/wav2letter/tree/v0.2 được không?

Hiện tại mình build cả flashlight và wav2letter v0.2 thì mô hình wav2vec load ổn, các file temp xuất ra đều đúng, nhưng đến bước chạy Decoder của wav2letter thì chương trình dừng lại (có bug nhỏ về đổi biến silweight thành silscore thì mình đã sửa).

Cảm ơn,

[Question] How to improve ASR for production usage

Describe your question
seeking guidelines from the community for improving models for production:

different accents, for example, US/UK/AU English
filler words, for example, um; ah
domain words, for example, company names, finance terms

Pretraining larger models?

"Please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can I pretrain the different versions of the wav2vec with the same code?

Segmentation fault in inference

Hi mailong
I follow your install instruction and now stuck at stage 4 (Make prediction on multiple audios programmatically)
It only shows Segmentation fault (core dumped)
Is that means my wav2letter installation broken?

How to disable shuffle in wav2vec during inference

Hi,
I just modified the code to work on batch inference. Now it able to do batch inference but I am getting shuffled predicted output. I tried to turn off all shuffle flags. but till the shuffled result. please help me out

Thanks, @mailong25

Model Performance

Em chào anh Long,
Trước tiên em xin cảm ơn về sự đóng góp của anh cho cộng đồng NLP Việt Nam.
Em có 1 câu hỏi, khi dev repo này xuống dưới những con chip nhỏ ví dụ như ARM cortext thì tốc độ inference có tốt không ạ so với phương pháp truyền thống như MFCC+HMM. Em đang trong quá trình lựa chọn model tuy nhiên lại chưa có điều kiện dev xuống để test thử.
Em cảm ơn anh ạ!

mailong25 / self-supervised-speech-recognition Goto Github PK

self-supervised-speech-recognition's Issues

%tb output:

Environment

Additional Context

Recommend Projects

Recommend Topics

Recommend Org