speech to text with self-supervised learning based on wav2vec 2.0 framework

Python 100.00%

speech-recognition self-supervised-learning wav2vec vietnamese-speech-recognition speech-to-text semi-supervised-learning unsupervised-learning

self-supervised-speech-recognition's Introduction

Self-supervised speech recognition with limited amount of labeled data

Note: As I no longer maintain the repo, If you encounter any problems, please take a look at similar reported issues from fairseq repo.

This is a wrapper version of wav2vec 2.0 framework, which attempts to build an accurate speech recognition models with small amount of transcribed data (eg. 1 hour)

Transfer learning is still the main technique:

Transfer from self-supervised models (pretrain on unlabeled data)
Transfer from multilingual models (pretrain on multilingual data)

Required resources

1. Labeled data, which is pairs of (audio, transcript)

The more you have, the better the model is. Prepare at least 1 hour if you have a large amount of unlabeled data. Otherwise, at least 50 hours is recommended.

2. Text data for building language models.

This should includes both well-written text and conversational text, which can easily collected from news/forums websties. At least 1 GB of text is recommended.

3. Unlabeled data (audios without transcriptions) of your own language.

This is optional but very crucial. A good amount of unlabeled audios (eg. 500 hours) will significantly reduce the amount of labeled data needed, and also boost up the model performance. Youtube/Podcast is a great place to collect the data for your own language

Install instruction

Please follow this instruction

Steps to build an accurate speech recognition model for your language

1. Train a self-supervised model on unlabeled data (Pretrain)

1.1 Prepare unlabeled audios

Collect unlabel audios and put them all together in a single directory. Audio format requirements:
Format: wav, PCM 16 bit, single channel
Sampling_rate: 16000
Length: 5 to 30 seconds
Content: silence should be removed from the audio. Also, each audio should contain only one person speaking.
Please look at examples/unlabel_audio directory for reference.

1.2 Download an initial model

Instead of training from scratch, we download and use english wav2vec model for weight initialization. This pratice can be apply to all languages.

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt

1.3 Run Pre-training

python3 pretrain.py --fairseq_path path/to/libs/fairseq --audio_path path/to/audio_directory --init_model path/to/wav2vec_small.pt

Where:

fairseq_path: path to installed fairseq library, after install instruction
audio_path: path to unlabel audio directory
init_model: downloaded model from step 1.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 4 days for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

2. Finetune the self-supervised model on the labeled data

2.1 Prepare labeled data

-- Transcript file ---
One trainng sample per line with format "audio_absolute_path \tab transcript"
Example of a transcript file:

/path/to/1.wav AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
/path/to/2.wav AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
/path/to/3.wav JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
/path/to/4.wav BEING THIN TOUGH AND OPAQUE

Some notes on transcript file:

One sample per line
Upper case
All numbers should be transformed into verbal form.
All special characters (eg. punctuation) should be removed. The final text should contain words only
Words in a sentence must be separated by whitespace character

-- Labeled audio file ---
Format: wav, PCM 16 bit, single channel, Sampling_rate: 16000.
Silence should be removed from the audio.
Also, each audio should contain only one person speaking.\

2.2 Generate dictionary file

python3 gen_dict.py --transcript_file path/to/transcript.txt --save_dir path/to/save_dir

The dictionary file will be stored at save_dir/dict.ltr.txt. Use the file for fine-tuning and inference.

2.3 Run Fine-tuning on the pretrain model

python3 finetune.py --transcript_file path/to/transcript.txt --pretrain_model path/to/pretrain_checkpoint_best.pt --dict_file path/to/dict.ltr.txt

Where:

transcript_file: path to transcript file from step 2.1
pretrain_model: path to best model checkpoint from step 1.3
dict_file: dictionary file generated from step 2.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 12 hours for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

3. Train a language model

3.1 Prepare text corpus

Collect all texts and put them all together in a single file.
Text file format:

One sentence per line
Upper case
All numbers should be transformed into verbal form.
All special characters (eg. punctuation) should be removed. The final text should contain words only
Words in a sentence must be separated by whitespace character

Example of a text corpus file for English case:

AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
BEING THIN TOUGH AND OPAQUE
...

Example of a text corpus file for Chinese case:

每 个 人 都 有 他 的 作 战 策 略 直 到 脸 上 中 了 一 拳
这 是 我 年 轻 时 候 住 的 房 子 。
这 首 歌 使 我 想 起 了 我 年 轻 的 时 候 。
...

3.2 Train the language model

python3 train_lm.py --kenlm_path path/to/libs/kenlm --transcript_file path/to/transcript.txt --additional_file path/to/text_corpus.txt --ngram 3 --output_path ./lm

Where:

kenlm_path: path to installed kenlm library, after install instruction
transcript_file: path to transcript file from step 2.1
additional_file: path to text corpus file from step 3.1

The LM model and the lexicon file will be stored at output_path

4. Make prediction on multiple audios programmatically

from stt import Transcriber
transcriber = Transcriber(pretrain_model = 'path/to/pretrain.pt', finetune_model = 'path/to/finetune.pt', 
                          dictionary = 'path/to/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
print(hypos)

Where:

pretrain_model: path to best pretrain checkpoint from step 1.3
finetune_model: path to best fine-tuned checkpoint from step 2.3
dictionary: dictionary file generated from step 2.2
lm_lexicon and lm_model: generated from step 3.2

Note: If you running inference in a juyter notebook. Please add these lines above the inference script:

import sys
sys.argv = ['']

Pre-trained models (Pretrain + Fine-tune + LM)

Vietnamese

Older version on Vietnamese speech recognition:

https://github.com/mailong25/self-supervised-speech-recognition/tree/vietnamese

Reference:

Paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations: https://arxiv.org/abs/2006.11477
Source code: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

License

MIT

self-supervised-speech-recognition's People

Contributors

Stargazers

Watchers

Forkers

thaile189 nitrus96 juanvulcano chapter544 jokecorleone phamvandan nhatanh174 lamhoangtung longtruong wahyubram82 mickmi phvien 18445864529 ankishb khanhp142 muktan zjc6666 pdhlong xiexukang etrain-xyz roxima casker2209 abnerzhangxinbinappen ishine engmubarak48 senriyoshikawa taridageorge luiyen antran-mespitech entn-at gavin90s songtaoshi hijinkery yyht dragoncartoon pesipasi anirudhsreeram chuongnvk54 davimesq7 siship chenchy ksingla025 sreyan88 zhouzhenkun edosyhptra sharonibejih trungnn07 jojocorleone kinwaicheuk silyfox donghwa-kim zoucan520 kyhoolee eszahedi quanpcp daiduongnguyen68 nguyenthienhy manhph2211 atuan98 vuongdinhcong congson1293 coderboy24x7 mbencherif rk-baku marklisrc huynkprovn masterofcs ox91ox2 zhengzheng-yay china-challengehub yasinjan99 rexiome pupil011 evelynzhou gheyret normonisping xuridongsheng7142 yulong-liang flp1990 grit1024 petrichorxjh nursumusod salforis jaedukseo ai-x-king jmukiibi datngobk kaen2891 chlee10 laozhanger hieu2695 ethanyhzhang techthiyanes gandalf012 volyminhnhan ngminhphuc thientc ron4u1998 anhlbt summerflowers

self-supervised-speech-recognition's Issues

Error in using STT

The Transcriber fails to initialize,

File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 250, in __init__
    self.transcribe([sample_audio_path])
  File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 297, in transcribe
    state['cfg']['model']['w2v_path'] = self.pretrain_model
TypeError: 'Namespace' object does not support item assignment

Validation in finetune.py and Tensorboarding

Hey

Once again, thanks for making a nice wrapper. I understand that your code ultimately calls the wav2vec commands given in the original repository. I have some questions which didn't get resolved there and I would love your suggestions:

Like in pretrain.py, because you call manifest.py, you are able to set some part for validation. However, how do you think one can supply a validation split in finetune.py?
How do I create a nice tensorboard log on the existing hydra logs? Any existing utilities that can help me do it?

Thanks for helping out.

STT problem

When creating an instance of Transcriber class, getting the following error:

usage: data [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[etc]
data
data: error: the following arguments are required: data
An exception has occurred, use %tb to see the full traceback.

%tb output:

SystemExit Traceback (most recent call last)
in ()
----> 1 parser.parse_args()

4 frames
/usr/lib/python3.7/argparse.py in parse_args(self, args, namespace)
1762 # =====================================
1763 def parse_args(self, args=None, namespace=None):
-> 1764 args, argv = self.parse_known_args(args, namespace)
1765 if argv:
1766 msg = _('unrecognized arguments: %s')

/usr/lib/python3.7/argparse.py in parse_known_args(self, args, namespace)
1794 # parse the arguments and exit if there are any errors
1795 try:
-> 1796 namespace, args = self._parse_known_args(args, namespace)
1797 if hasattr(namespace, _UNRECOGNIZED_ARGS_ATTR):
1798 args.extend(getattr(namespace, _UNRECOGNIZED_ARGS_ATTR))

/usr/lib/python3.7/argparse.py in parse_known_args(self, arg_strings, namespace)
2029 if required_actions:
2030 self.error(('the following arguments are required: %s') %
-> 2031 ', '.join(required_actions))
2032
2033 # make sure all required groups had one option present

/usr/lib/python3.7/argparse.py in error(self, message)
2515 self.print_usage(_sys.stderr)
2516 args = {'prog': self.prog, 'message': message}
-> 2517 self.exit(2, _('%(prog)s: error: %(message)s\n') % args)

/usr/lib/python3.7/argparse.py in exit(self, status, message)
2502 if message:
2503 self._print_message(message, _sys.stderr)
-> 2504 _sys.exit(status)
2505
2506 def error(self, message):

SystemExit: 2

Licensing Information

Hey

Thank you for releasing this wrapper. It will be very helpful if you can add a Licence for your code.

How to calculate WER in testing

Chào anh @mailong25 ,
Em có thử test với một vài file audio có sẵn trong repo này của anh, theo như anh công bố thì WER chỉ tầm 15%, nhưng em test thì WER lại lên đến giá trị 450, mặc dù transcribe đúng ko sai một từ nào, có 1 file transcribe xong sai có 1/11 từ thôi nhưng WER lại lên đến 470, không biết là có nhầm lẫn gì ở đây không ạ

Can not load checkpoints in the finetuning step

When I try steps 2.3 (Run Fine-tuning on the pretrained model) and input the proper pretrained model from step 1.3, I got the message below. Is it correct?

Cập nhật cho wav2letter v0.2?

Theo mình hiểu thì code từ đây https://github.com/mailong25/wav2letter.git đang chưa support run/train sử dụng GPU. Long có thể update code lên wav2letter v0.2 stable hiện tại https://github.com/facebookresearch/wav2letter/tree/v0.2 được không?

Hiện tại mình build cả flashlight và wav2letter v0.2 thì mô hình wav2vec load ổn, các file temp xuất ra đều đúng, nhưng đến bước chạy Decoder của wav2letter thì chương trình dừng lại (có bug nhỏ về đổi biến silweight thành silscore thì mình đã sửa).

Cảm ơn,

I am getting an error when installing to Colab

Cloning into 'wav2letter'...
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 6587 (delta 34), reused 72 (delta 19), pack-reused 6468
Receiving objects: 100% (6587/6587), 6.13 MiB | 22.26 MiB/s, done.
Resolving deltas: 100% (4207/4207), done.
/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
env: KENLM_ROOT_DIR=/content/self-supervised-speech-recognition/libs/kenlm
Obtaining file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
Installing collected packages: wav2letter
Running setup.py develop for wav2letter
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; file='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
/content/self-supervised-speech-recognition/libs

[Question] How do I calculate max_tokens max value?

Given that I'm using for training 5 GPUs GeForce GTX 1080 Ti with 10.917GB memory each, how can I calculate the max_tokens so that no memory error occurs?

Batch size, resuming from checkpoints and other utilities

Hey

I was browsing through this code and wanted to know a few things:

How can I control the batch size? (Large models will require me to train on very small batch sizes due to limited compute)
How can I resume from a checkpoint, currently there is no utility or function in your codebase?

Decoder using build/Decode will reload model?

Chào a Mai Long, em thấy phần inference nếu mình đang làm như thế thì đang bị load lại model ạ. Không biết, a đã tinh chỉnh phần inference chưa ạ .

Some overflow in pretrain model

Hi Maillong,
I am in the stage of pretrain model (use config/pretraining/wav2vec2_base_librispeech.yaml)
Total duration of unlabeled data is ~100 hours but there are ~60% shorter than 5sec.
The only different setting of environments is that I fix the distributed_world_size from 64 to 1 (I only have one GPU). Others are same.
In the begin of training, there are some WARNING about overflow:
...
2021-01-26 03:01:31 | INFO | fairseq.trainer | begin training epoch 1
2021-01-26 03:02:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-01-26 03:03:23 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-01-26 03:04:15 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-01-26 03:10:26 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-01-26 03:56:38 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.
...

Two questions:

Is that means I should tune our lr because the gradient is out-of-range in FP16 or ignore this warning?
Our train_ups is really small =0.02. Is that normal?

Loading Audio array directly into transcribe, instead of supplying Audio Path

Hey

Right now, the transcribe function expects audio paths. What if we want to supply a NumPy array of audios preloaded? Can we modify this in stt.py or does fairseq/wav2vec demand audio paths?

Once again, thanks for your quick responses.

Model Performance

Em chào anh Long,
Trước tiên em xin cảm ơn về sự đóng góp của anh cho cộng đồng NLP Việt Nam.
Em có 1 câu hỏi, khi dev repo này xuống dưới những con chip nhỏ ví dụ như ARM cortext thì tốc độ inference có tốt không ạ so với phương pháp truyền thống như MFCC+HMM. Em đang trong quá trình lựa chọn model tuy nhiên lại chưa có điều kiện dev xuống để test thử.
Em cảm ơn anh ạ!

How did you create lst file for train and test

Hi Mr @mailong25,
I'm having issue in generate train and test files. I see in your code that you're using 2 files which has lst extensions. But with my own datasets (for example: VLSP2020), I create my own train and test file and training process seems to work fine until this happened

Can you explain this. Thank you so much. Below are my 2 files
test.txt
train.txt

validate step in example log file but not when i'm running it

In the examples folder there is a file called hydra_train_finetune.log. In that file after each epoch or before each epoch I can see that there's a validation step.

[2020-12-19 17:41:16,018][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2020-12-19 17:41:22,153][valid][INFO] - {"epoch": 1, "valid_loss": "729.494", "valid_ntokens": "2521.93", "valid_nsentences": "49.1429", "valid_nll_loss": "14.215", "valid_uer": "123.012", "valid_wer": "100.393", "valid_raw_wer": "100.393", "valid_wps": "6759", "valid_wpb": "2521.9", "valid_bsz": "49.1", "valid_num_updates": "85"}
[2020-12-19 17:41:22,155][fairseq_cli.train][INFO] - begin save checkpoint
[2020-12-19 17:41:22,156][fairseq.trainer][INFO] - Preparing to save checkpoint to checkpoints/checkpoint_best.pt after 85 updates
[2020-12-19 17:41:24,604][fairseq.trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2020-12-19 17:41:25,543][fairseq.checkpoint_utils][INFO] - saved checkpoint checkpoints/checkpoint_best.pt (epoch 1 @ 85 updates, score 100.393) (writing took 3.387641379999991 seconds)
[2020-12-19 17:41:25,543][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below)
[2020-12-19 17:41:25,566][train][INFO] - {"epoch": 1, "train_loss": "887.379", "train_ntokens": "49633.1",

What configs should I take into consideration in order to have that validate step run for my dataset?
I'm running the finetune.py script with the config base_100h.yaml.

Using STT

Hi Mailong,
Thank you for helping us the last time, after we terminate files with under 2s, we had able to finetune the model with wer =16, bit when we using STT with Colab, we come to a problem:
ipykernel_launcher.py: error: unrecognized arguments: -f / mnt/disk2/data
Can you help us how to modify the code to solve this problem? We really need your help.

Finetuning a Vietnamese model

Hi Mailong,
We are doing the group project with wav2vec on Colab and we use Colab to train the model using VLSP dataset since Colab is limited, we use your pre-train model for the finetuning our dataset. Our dataset is around 60-70 hours with the 100h config, the process seem fine at the beginning but when the train loss value reach 400, it's valid loss suddenly increase significantly and also the wer, only the uer seem to decrease. Also the train loss reduce very slow, about 1 point per epoch. Can you help us with the problem.

Thank you very much.

What do I need to do for real-time translation with microphone?

What do I need to do for real-time translation with microphone? Is there a sample written in Python? I trained my model. I am getting successful output with wav files.

Fine-tuning checkpoints

Hello,

My fine-tuning goes all the way up to epoch 49 (and counting) but no checkpoint is saved. Is this normal? It seems very different from the pre-training where it's saving a checkpoint after each epoch.

Many thanks.

Kind Regards,

How to disable shuffle in wav2vec during inference

Hi,
I just modified the code to work on batch inference. Now it able to do batch inference but I am getting shuffled predicted output. I tried to turn off all shuffle flags. but till the shuffled result. please help me out

Thanks, @mailong25

'Wav2VecCtc' object has no attribute 'remove_pretraining_modules'

Hello, I've done all the steps. However I am getting the following error. What should I do?

WER on testing phase

Hi mr @mailong25 , I'm wondering how do you calculate WER on test audio files? If you does how can I find it in your code?
Thank you

malloc(): mismatching next->prev_size (unsorted)

Hello thank you for this. Everything works fine with me, until the last step when trying to transcribe a file.
I get the notification: malloc(): mismatching next->prev_size (unsorted), aborted.
I also sometimes get: segmentation fault, but without any traceback..
Do you know what the cause of this might be?
Thanks :-)

Pretrain with Japanese audio?

Hi @mailong25,
Thanks for your repo, it's very nice. I want to pre-train model(base-large) with my audio in Japanese, but anytime i do that, i get error: "dataset is empty"? How can i resloved it or how to pretrain model with other language(not English)? Thanks for your help!

Unable to load multiple model

Hi, I want to run multiple models. but after running one instance. i unable to run other instance.
I am getting this error.
ile "/media/administrator/hdd/self-supervised-speech-recognition/stt.py", line 187, in optimize_models
model.cuda()
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please help me out.
Gpu memory consuming for one instance is 2246 gb / 8 gb
Thanks,
@mailong25

sh: 1: fairseq-hydra-train: not found

When I run pretrain.py or finetune.py I get the following error.

python pretrain.py --fairseq_path libs/fairseq --audio_path wav --init_model wav2vec_small.pt
fairseq-hydra-train task.data=/home/bcode/speech-recognition/temp distributed_training.distributed_world_size=1 +optimization.update_freq='[64]' checkpoint.finetune_from_model=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=1200000 --config-dir config/pretraining --config-name wav2vec2_base_librispeech
sh: 1: fairseq-hydra-train: not found

python finetune.py --transcript_file transcript.txt --pretrain_model wav2vec_small.pt --dict_file dict.ltr.txt

100%|████████████████████████████████████| 1831/1831 [00:00<00:00, 15427.24it/s]
fairseq-hydra-train task.data=/home/bcode/speech-recognition/manifest distributed_training.distributed_world_size=1 +optimization.update_freq='[24]' model.w2v_path=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=2800000 --config-dir config/finetuning --config-name base_1h
sh: 1: fairseq-hydra-train: not found

ValueError in making prediction

Hi, your work is really awesome. I currenly got a ValueError when running prediction by your public Pre-trained models.
Here is the issue code:

While i was trying to debug the error via facebookresearch/fairseq#2106, the problem still occured.
Please tell me how to fix it. My computer os is Windows 10 by the way. Many thanks

Wav2letter installation error

Hi there
I am following the instructions in 'Install instruction' section but when I run this

!pip install -e .

I get the following error

file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python Installing collected packages: wav2letter Running setup.py develop for wav2letter ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; __file__='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output

Any idea how can I fix this issue?
Also I tried to remove build file in the wav2letter repo but not working again.

Dataset

what vietnamese dataset did you use?

How to skip the lm model, i want to see the directly model prediction after ctc decoding.

Hi, @mailong25 How to skip the lm model. I want to see direct model prediction after CTC decoding and also I'm getting valid_wer as 100 for every epoch, but the training loss is decreasing. why so?
Note: apart from your config I just added the memory_efficient_fp16 = True, to reduce GPU memory consumption, is this affecting the wer.
Please help me out.
Thanks
hydra_train.log

[Question] How to improve ASR for production usage

Describe your question
seeking guidelines from the community for improving models for production:

different accents, for example, US/UK/AU English
filler words, for example, um; ah
domain words, for example, company names, finance terms

--

can finetune on other language?

I have little data and I want to ask if it is possible to finetune your wav2vec pretrained model deriectly on another language, such as chinese, thanks

Strange characters prediction

Hi anh mailong25. Trong quá trình huấn luyện mô hình trên tập vivos em nhận ra khi mô hình càng hội tụ, những đoạn văn bản mô hình dự đoán qua những file em test ngẫu nhiên dần xuất hiện các ký tự lạ được dự đoán xen kẽ nhau qua các từ (em nghĩ có thể do dự đoán các tạp âm).

Không biết anh có gặp trường hợp này bao giờ chưa ạ? Nếu được, xin anh cho em giải pháp khắc phục vấn đề này. Em cảm ơn.

[Question] Can we use updated version of Fairseq or train existing version with TPU?

Hi,
In the install instruction, I saw that this repo uses a specific commit c8a0659be5cdc15caa102d5bbf72b872567c4859 of Fairseq. This commit is 4 months old. I have two questions regarding this -

Can I use later commits of Fairseq with this repo?
Can I use TPU for training?

Thanks in advance.

How to create the lexicon.txt

Hi @mailong25, I am sorry if this is a basic question. I would like to get pointers on how you generated the lexicon.txt. I have an asr dataset (audio + transcriptions)[https://github.com/csikasote/bemba-language-corpus]. My goal is to develop an ASR using wav2vec and wav2letter, possibly adapt your code to my dataset. Did you generate the lexicon.txt from the transcriptions? I would appreciate your help.

how to do lexicon free decoding

Hi,
To do lexicon-free decoding I set the args.lexicon to False in wsl_decoder.py. But I got an empty string. So Please explain how to do lexicon free decoding. In wsl_decoder.py i have seen this line "lexicon free decoding can only be done with a unit language model" . if that so how can i create unit language model.

Right now I having this below problem in the concurrent lexicon lm model.
For example, In lexicon.txt I have two words tamil and ama. if I send tamil spoken audio to ASR. it predicts as a tamah(no lm). After language model processing it always giving as a ama.
another example:
Now in lexicon.txt I have tamil and am words. If asr predicts as a tamah. the language model giving string as am. I just want to solve this problem. I tried all different value of lm weight or wordgroup parameter but no use. I know that lm_weigt and word group will not influence that much in these scenarios.
Thanks,
Please anyone helpme out.
@SenriYoshikawa
@mailong25

Response empty value?

Hi a mailong25,
E có build source trên google colab, mọi thứ chạy oke, nhưng khi e test với data thì response trả về là ['']. E không tìm được lí do, có phải là do model mk train, hay một lí do nào khác. Mong a giải đáp. Greate Libs!

Stuck on decoding. Running using Wav2Letter Docker Container

Hi, I tried both CPU and GPU versions but it fails. Any idea what could be happening?

Best wishes :)

Import stt is not working!

Hi @mailong25 @SenriYoshikawa ,

I am able to train acoustic model and Kenlm but when i am trying to import stt for transcribing, I am getting the following error in the self-supervised-speech-recognition directory.

Error

ImportError Traceback (most recent call last)
in ()
1 get_ipython().magic('cd /content/self-supervised-speech-recognition')
----> 2 import stt

/content/self-supervised-speech-recognition/stt.py in ()
7 import numpy as np
8 import torch
----> 9 from fairseq import checkpoint_utils, options, progress_bar, tasks, utils
10 from fairseq.data.data_utils import post_process
11 from fairseq.logging.meters import StopwatchMeter, TimeMeter

ImportError: cannot import name 'checkpoint_utils' from 'fairseq' (unknown location)

I tried importing fairseq in the directory i.e. /content/self-supervised-speech-recognition/libs/fairseq and it is working fine. Kindly help me with the work-around for this conflict in the directories. Regards

Error in Making prediction

Hi all

After following Install Instruction and downloading your Pre-trained models I executed this code in colab:

from stt import Transcriber
transcriber = Transcriber(pretrain_model = '/content/vietnamese_wav2vec/pretrain.pt', finetune_model = '/content/vietnamese_wav2vec/finetune.pt', 
                          dictionary = '/content/vietnamese_wav2vec/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = '/content/vietnamese_wav2vec/lexicon.txt', lm_model = '/content/vietnamese_wav2vec/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['/content/1000.wav'])
print(hypos)

But it gives me this error:

usage: ipykernel_launcher.py [-h] [--no-progress-bar]
                             [--log-interval LOG_INTERVAL]
                             [--log-format {json,none,simple,tqdm}]
                             [--tensorboard-logdir TENSORBOARD_LOGDIR]
                             [--wandb-project WANDB_PROJECT]
                             [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
                             [--bf16] [--memory-efficient-bf16] [--fp16]
                             [--memory-efficient-fp16]
                             [--fp16-no-flatten-grads]
                             [--fp16-init-scale FP16_INIT_SCALE]
                             [--fp16-scale-window FP16_SCALE_WINDOW]
                             [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                             [--min-loss-scale MIN_LOSS_SCALE]
                             [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                             [--user-dir USER_DIR]
                             [--empty-cache-freq EMPTY_CACHE_FREQ]
                             [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                             [--model-parallel-size MODEL_PARALLEL_SIZE]
                             [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                             [--profile] [--reset-logging]
                             [--criterion {cross_entropy,label_smoothed_cross_entropy,composite_loss,sentence_ranking,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,wav2vec,masked_lm,ctc,sentence_prediction,nat_loss,model,adaptive_loss,vocab_parallel_cross_entropy}]
                             [--tokenizer {space,moses,nltk}]
                             [--bpe {bert,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,sentencepiece,bytes,characters}]
                             [--optimizer {adamax,adadelta,nag,adagrad,adafactor,adam,sgd,composite,lamb}]
                             [--lr-scheduler {inverse_sqrt,triangular,cosine,fixed,tri_stage,reduce_lr_on_plateau,polynomial_decay,manual,pass_through}]
                             [--scoring {sacrebleu,bleu,chrf,wer}]
                             [--task TASK] [--num-workers NUM_WORKERS]
                             [--skip-invalid-size-inputs-valid-test]
                             [--max-tokens MAX_TOKENS]
                             [--batch-size BATCH_SIZE]
                             [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                             [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                             [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                             [--data-buffer-size DATA_BUFFER_SIZE]
                             [--train-subset TRAIN_SUBSET]
                             [--valid-subset VALID_SUBSET]
                             [--validate-interval VALIDATE_INTERVAL]
                             [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                             [--validate-after-updates VALIDATE_AFTER_UPDATES]
                             [--fixed-validation-seed FIXED_VALIDATION_SEED]
                             [--disable-validation]
                             [--max-tokens-valid MAX_TOKENS_VALID]
                             [--batch-size-valid BATCH_SIZE_VALID]
                             [--curriculum CURRICULUM]
                             [--gen-subset GEN_SUBSET]
                             [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                             [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                             [--distributed-rank DISTRIBUTED_RANK]
                             [--distributed-backend DISTRIBUTED_BACKEND]
                             [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                             [--distributed-port DISTRIBUTED_PORT]
                             [--device-id DEVICE_ID] [--distributed-no-spawn]
                             [--ddp-backend {c10d,no_c10d}]
                             [--bucket-cap-mb BUCKET_CAP_MB]
                             [--fix-batches-to-gpus]
                             [--find-unused-parameters] [--fast-stat-sync]
                             [--broadcast-buffers]
                             [--distributed-wrapper {DDP,SlowMo}]
                             [--slowmo-momentum SLOWMO_MOMENTUM]
                             [--slowmo-algorithm SLOWMO_ALGORITHM]
                             [--localsgd-frequency LOCALSGD_FREQUENCY]
                             [--nprocs-per-node NPROCS_PER_NODE]
                             [--pipeline-model-parallel]
                             [--pipeline-balance PIPELINE_BALANCE]
                             [--pipeline-devices PIPELINE_DEVICES]
                             [--pipeline-chunks PIPELINE_CHUNKS]
                             [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                             [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                             [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                             [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                             [--pipeline-checkpoint {always,never,except_last}]
                             [--zero-sharding {none,os}] [--path PATH]
                             [--post-process [POST_PROCESS]] [--quiet]
                             [--model-overrides MODEL_OVERRIDES]
                             [--results-path RESULTS_PATH] [--beam BEAM]
                             [--nbest NBEST] [--max-len-a MAX_LEN_A]
                             [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                             [--match-source-len] [--unnormalized]
                             [--no-early-stop] [--no-beamable-mm]
                             [--lenpen LENPEN] [--unkpen UNKPEN]
                             [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                             [--score-reference] [--prefix-size PREFIX_SIZE]
                             [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                             [--sampling] [--sampling-topk SAMPLING_TOPK]
                             [--sampling-topp SAMPLING_TOPP]
                             [--constraints [{ordered,unordered}]]
                             [--temperature TEMPERATURE]
                             [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                             [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                             [--diversity-rate DIVERSITY_RATE]
                             [--print-alignment [{hard,soft}]] [--print-step]
                             [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                             [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                             [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                             [--iter-decode-force-max-iter]
                             [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                             [--iter-decode-with-external-reranker]
                             [--retain-iter-history] [--retain-dropout]
                             [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                             [--decoding-format {unigram,ensemble,vote,dp,bs}]
                             [--no-seed-provided] [--save-dir SAVE_DIR]
                             [--restore-file RESTORE_FILE]
                             [--finetune-from-model FINETUNE_FROM_MODEL]
                             [--reset-dataloader] [--reset-lr-scheduler]
                             [--reset-meters] [--reset-optimizer]
                             [--optimizer-overrides OPTIMIZER_OVERRIDES]
                             [--save-interval SAVE_INTERVAL]
                             [--save-interval-updates SAVE_INTERVAL_UPDATES]
                             [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                             [--keep-last-epochs KEEP_LAST_EPOCHS]
                             [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                             [--no-save] [--no-epoch-checkpoints]
                             [--no-last-checkpoints]
                             [--no-save-optimizer-state]
                             [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                             [--maximize-best-checkpoint-metric]
                             [--patience PATIENCE]
                             [--checkpoint-suffix CHECKPOINT_SUFFIX]
                             [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                             [--load-checkpoint-on-all-dp-ranks]
                             [--kspmodel KSPMODEL] [--wfstlm WFSTLM]
                             [--rnnt_decoding_type RNNT_DECODING_TYPE]
                             [--rnnt_len_penalty RNNT_LEN_PENALTY]
                             [--w2l-decoder W2L_DECODER] [--lexicon LEXICON]
                             [--unit-lm] [--kenlm-model KENLM_MODEL]
                             [--beam-threshold BEAM_THRESHOLD]
                             [--beam-size-token BEAM_SIZE_TOKEN]
                             [--word-score WORD_SCORE]
                             [--unk-weight UNK_WEIGHT]
                             [--sil-weight SIL_WEIGHT]
                             [--dump-emissions DUMP_EMISSIONS]
                             [--dump-features DUMP_FEATURES]
                             [--load-emissions LOAD_EMISSIONS] [-s SRC]
                             [-t TARGET] [--load-alignments]
                             [--left-pad-source BOOL] [--left-pad-target BOOL]
                             [--max-source-positions N]
                             [--max-target-positions N]
                             [--upsample-primary UPSAMPLE_PRIMARY]
                             [--truncate-source] [--num-batch-buckets N]
                             [--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
                             [--eval-bleu-detok-args JSON]
                             [--eval-tokenized-bleu]
                             [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
                             [--eval-bleu-args JSON]
                             [--eval-bleu-print-samples]
                             [--force-anneal FORCE_ANNEAL]
                             [--lr-shrink LR_SHRINK]
                             [--warmup-updates WARMUP_UPDATES] [--pad PAD]
                             [--eos EOS] [--unk UNK]
                             data
ipykernel_launcher.py: error: unrecognized arguments: -f /mnt/disks2/data /mnt/disks2/data

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

and %tb gives this:


---------------------------------------------------------------------------

SystemExit                                Traceback (most recent call last)

<ipython-input-46-24d37fe9d36e> in <module>()
      4                           lm_type = 'kenlm',
      5                           lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
----> 6                           lm_weight = 1.5, word_score = -1, beam_size = 50)
      7 hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
      8 print(hypos)

4 frames

/usr/lib/python3.7/argparse.py in exit(self, status, message)
   2502         if message:
   2503             self._print_message(message, _sys.stderr)
-> 2504         _sys.exit(status)
   2505 
   2506     def error(self, message):

SystemExit: 2

Why is this happening?
Could you please help me with that?

finetune.py optimization.update_freq

I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU.

    cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'")

In the wav2vec Readme https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md they say that the base model was trained using 64 V100 GPUs and as I understood if we want to do more training on the base model we should simulate the number of the GPUs they've used.

Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 64/k

Have you found that setting update_freq to be 24/NUM_GPU is better for training or is it a bug?

Pretraining larger models?

"Please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can I pretrain the different versions of the wav2vec with the same code?

is it ok, To do specAugment and also adding noise to audio, then train it out.

Hi @mailong25 , Thanks for your great working especially combining the language model in this asr system. I have one doubt is adding noise to audio will affect the performance. because I searched regarding this augmentation in your GitHub and fairseq wav2vec GitHub but I unable to find any reference regarding the noise augmentation or other augmentation. Why So?

I have read the wav2vec for simplicity it works like a clustering thing using contractive loss. So I feel like even though if we add noise to the original audio anyway it's going cluster the sampled part of the audio to the nearby neighbor embedding.

Thanks

It says Saved but I checked the folder and it is empty.

Hi, It says Saved but I checked the folder and it is empty.

Segmentation fault in inference

Hi mailong
I follow your install instruction and now stuck at stage 4 (Make prediction on multiple audios programmatically)
It only shows Segmentation fault (core dumped)
Is that means my wav2letter installation broken?

Hard coding parameters and avoiding argparse

Hey

I was wondering if you have any suggestions on how to get around argparse for wav2vec? Argparse is great for training, but once you have a set of parameters fixed, argparse must be avoided since it also interferes with other things once the model is part of an application.

I know wav2vec itself uses argparse everywhere, but I was wondering if you might have thought of this problem and a workaround. Thanks.

Issue while fine-tuning the wav2vec model

When I run the following script after I pretrained the xlsr model with my own dataset, Im getting error below,

FileNotFoundError: [Errno 2] No such file or directory: '/home/edo/self-supervised-speech-recognition/manifest/dev_other.tsv'

I have generated the dictionary file as well as labeled data according to the tutorial in the page with the following command,
python3 gen_dict.py --transcript_file /home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --save_dir dictionary

However, I run this command:
python3 finetune.py --transcript_file home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --pretrain_model ../models/07-46-31/checkpoints/checkpoint_best.pt --dict_file dictionary/dict.ltr.txt
and get the error above.

Environment

OS: Ubuntu 18.04 LTS
Cuda Version: 11.2

Additional Context

However, I tried to change the model to be used in fine-tuning process with wav2vec_small.pt instead of models/07-46-31/checkpoints/checkpoint_best.pt, I dont why it works.

Am I missing something? But im pretty sure I've followed the instructions. Or it is a bug from the fairseq itself?

Tuning lm_weight, word_score and beam_size

Hey

How do you recommend we tune the parameters in transcribe function: lm_weight, word_score, and beam_size? Normally with things like Deepspeech 2, we use its logits to tune this, but how about wav2vec?

Thanks

mailong25 / self-supervised-speech-recognition Goto Github PK

self-supervised-speech-recognition's Introduction

Self-supervised speech recognition with limited amount of labeled data

Required resources

1. Labeled data, which is pairs of (audio, transcript)

2. Text data for building language models.

3. Unlabeled data (audios without transcriptions) of your own language.

Install instruction

Steps to build an accurate speech recognition model for your language

1. Train a self-supervised model on unlabeled data (Pretrain)

1.1 Prepare unlabeled audios

1.2 Download an initial model

1.3 Run Pre-training

2. Finetune the self-supervised model on the labeled data

2.1 Prepare labeled data

2.2 Generate dictionary file

2.3 Run Fine-tuning on the pretrain model

3. Train a language model

3.1 Prepare text corpus

3.2 Train the language model

4. Make prediction on multiple audios programmatically

Pre-trained models (Pretrain + Fine-tune + LM)

Older version on Vietnamese speech recognition:

Reference:

License

self-supervised-speech-recognition's People

Contributors

Stargazers

Watchers

Forkers

self-supervised-speech-recognition's Issues

%tb output:

Environment

Additional Context

Recommend Projects

Recommend Topics

Recommend Org