Giter Club home page Giter Club logo

self-supervised-speech-recognition's Introduction

Self-supervised speech recognition with limited amount of labeled data

Note: As I no longer maintain the repo, If you encounter any problems, please take a look at similar reported issues from fairseq repo.

This is a wrapper version of wav2vec 2.0 framework, which attempts to build an accurate speech recognition models with small amount of transcribed data (eg. 1 hour)

Transfer learning is still the main technique:

  • Transfer from self-supervised models (pretrain on unlabeled data)
  • Transfer from multilingual models (pretrain on multilingual data)

Required resources

1. Labeled data, which is pairs of (audio, transcript)

The more you have, the better the model is. Prepare at least 1 hour if you have a large amount of unlabeled data. Otherwise, at least 50 hours is recommended.

2. Text data for building language models.

This should includes both well-written text and conversational text, which can easily collected from news/forums websties. At least 1 GB of text is recommended.

3. Unlabeled data (audios without transcriptions) of your own language.

This is optional but very crucial. A good amount of unlabeled audios (eg. 500 hours) will significantly reduce the amount of labeled data needed, and also boost up the model performance. Youtube/Podcast is a great place to collect the data for your own language

Install instruction

Please follow this instruction

Steps to build an accurate speech recognition model for your language

1. Train a self-supervised model on unlabeled data (Pretrain)

1.1 Prepare unlabeled audios

Collect unlabel audios and put them all together in a single directory. Audio format requirements:
Format: wav, PCM 16 bit, single channel
Sampling_rate: 16000
Length: 5 to 30 seconds
Content: silence should be removed from the audio. Also, each audio should contain only one person speaking.
Please look at examples/unlabel_audio directory for reference.

1.2 Download an initial model

Instead of training from scratch, we download and use english wav2vec model for weight initialization. This pratice can be apply to all languages.

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt

1.3 Run Pre-training

python3 pretrain.py --fairseq_path path/to/libs/fairseq --audio_path path/to/audio_directory --init_model path/to/wav2vec_small.pt

Where:

  • fairseq_path: path to installed fairseq library, after install instruction
  • audio_path: path to unlabel audio directory
  • init_model: downloaded model from step 1.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 4 days for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

2. Finetune the self-supervised model on the labeled data

2.1 Prepare labeled data

-- Transcript file ---
One trainng sample per line with format "audio_absolute_path \tab transcript"
Example of a transcript file:

/path/to/1.wav AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
/path/to/2.wav AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
/path/to/3.wav JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
/path/to/4.wav BEING THIN TOUGH AND OPAQUE

Some notes on transcript file:

  • One sample per line
  • Upper case
  • All numbers should be transformed into verbal form.
  • All special characters (eg. punctuation) should be removed. The final text should contain words only
  • Words in a sentence must be separated by whitespace character

-- Labeled audio file ---
Format: wav, PCM 16 bit, single channel, Sampling_rate: 16000.
Silence should be removed from the audio.
Also, each audio should contain only one person speaking.\

2.2 Generate dictionary file

python3 gen_dict.py --transcript_file path/to/transcript.txt --save_dir path/to/save_dir

The dictionary file will be stored at save_dir/dict.ltr.txt. Use the file for fine-tuning and inference.

2.3 Run Fine-tuning on the pretrain model

python3 finetune.py --transcript_file path/to/transcript.txt --pretrain_model path/to/pretrain_checkpoint_best.pt --dict_file path/to/dict.ltr.txt

Where:

  • transcript_file: path to transcript file from step 2.1
  • pretrain_model: path to best model checkpoint from step 1.3
  • dict_file: dictionary file generated from step 2.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 12 hours for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

3. Train a language model

3.1 Prepare text corpus

Collect all texts and put them all together in a single file.
Text file format:

  • One sentence per line
  • Upper case
  • All numbers should be transformed into verbal form.
  • All special characters (eg. punctuation) should be removed. The final text should contain words only
  • Words in a sentence must be separated by whitespace character

Example of a text corpus file for English case:

AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
BEING THIN TOUGH AND OPAQUE
...

Example of a text corpus file for Chinese case:

每 个 人 都 有 他 的 作 战 策 略 直 到 脸 上 中 了 一 拳
这 是 我 年 轻 时 候 住 的 房 子 。
这 首 歌 使 我 想 起 了 我 年 轻 的 时 候 。
...

3.2 Train the language model

python3 train_lm.py --kenlm_path path/to/libs/kenlm --transcript_file path/to/transcript.txt --additional_file path/to/text_corpus.txt --ngram 3 --output_path ./lm

Where:

  • kenlm_path: path to installed kenlm library, after install instruction
  • transcript_file: path to transcript file from step 2.1
  • additional_file: path to text corpus file from step 3.1

The LM model and the lexicon file will be stored at output_path

4. Make prediction on multiple audios programmatically

from stt import Transcriber
transcriber = Transcriber(pretrain_model = 'path/to/pretrain.pt', finetune_model = 'path/to/finetune.pt', 
                          dictionary = 'path/to/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
print(hypos)

Where:

  • pretrain_model: path to best pretrain checkpoint from step 1.3
  • finetune_model: path to best fine-tuned checkpoint from step 2.3
  • dictionary: dictionary file generated from step 2.2
  • lm_lexicon and lm_model: generated from step 3.2

Note: If you running inference in a juyter notebook. Please add these lines above the inference script:

import sys
sys.argv = ['']

Pre-trained models (Pretrain + Fine-tune + LM)

Older version on Vietnamese speech recognition:

https://github.com/mailong25/self-supervised-speech-recognition/tree/vietnamese

Reference:

Paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations: https://arxiv.org/abs/2006.11477
Source code: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

License

MIT

self-supervised-speech-recognition's People

Contributors

mailong25 avatar senriyoshikawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

self-supervised-speech-recognition's Issues

Error in using STT

The Transcriber fails to initialize,

File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 250, in __init__
    self.transcribe([sample_audio_path])
  File "/home/sreyan/interspeech_hindi/self-supervised-speech-recognition/stt.py", line 297, in transcribe
    state['cfg']['model']['w2v_path'] = self.pretrain_model
TypeError: 'Namespace' object does not support item assignment

Validation in finetune.py and Tensorboarding

Hey

Once again, thanks for making a nice wrapper. I understand that your code ultimately calls the wav2vec commands given in the original repository. I have some questions which didn't get resolved there and I would love your suggestions:

  1. Like in pretrain.py, because you call manifest.py, you are able to set some part for validation. However, how do you think one can supply a validation split in finetune.py?
  2. How do I create a nice tensorboard log on the existing hydra logs? Any existing utilities that can help me do it?

Thanks for helping out.

STT problem

When creating an instance of Transcriber class, getting the following error:

usage: data [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[etc]
data
data: error: the following arguments are required: data
An exception has occurred, use %tb to see the full traceback.

%tb output:

SystemExit Traceback (most recent call last)
in ()
----> 1 parser.parse_args()

4 frames
/usr/lib/python3.7/argparse.py in parse_args(self, args, namespace)
1762 # =====================================
1763 def parse_args(self, args=None, namespace=None):
-> 1764 args, argv = self.parse_known_args(args, namespace)
1765 if argv:
1766 msg = _('unrecognized arguments: %s')

/usr/lib/python3.7/argparse.py in parse_known_args(self, args, namespace)
1794 # parse the arguments and exit if there are any errors
1795 try:
-> 1796 namespace, args = self._parse_known_args(args, namespace)
1797 if hasattr(namespace, _UNRECOGNIZED_ARGS_ATTR):
1798 args.extend(getattr(namespace, _UNRECOGNIZED_ARGS_ATTR))

/usr/lib/python3.7/argparse.py in parse_known_args(self, arg_strings, namespace)
2029 if required_actions:
2030 self.error(
('the following arguments are required: %s') %
-> 2031 ', '.join(required_actions))
2032
2033 # make sure all required groups had one option present

/usr/lib/python3.7/argparse.py in error(self, message)
2515 self.print_usage(_sys.stderr)
2516 args = {'prog': self.prog, 'message': message}
-> 2517 self.exit(2, _('%(prog)s: error: %(message)s\n') % args)

/usr/lib/python3.7/argparse.py in exit(self, status, message)
2502 if message:
2503 self._print_message(message, _sys.stderr)
-> 2504 _sys.exit(status)
2505
2506 def error(self, message):

SystemExit: 2

Licensing Information

Hey

Thank you for releasing this wrapper. It will be very helpful if you can add a Licence for your code.

How to calculate WER in testing

Chào anh @mailong25 ,
Em có thử test với một vài file audio có sẵn trong repo này của anh, theo như anh công bố thì WER chỉ tầm 15%, nhưng em test thì WER lại lên đến giá trị 450, mặc dù transcribe đúng ko sai một từ nào, có 1 file transcribe xong sai có 1/11 từ thôi nhưng WER lại lên đến 470, không biết là có nhầm lẫn gì ở đây không ạ
image

Cập nhật cho wav2letter v0.2?

Theo mình hiểu thì code từ đây https://github.com/mailong25/wav2letter.git đang chưa support run/train sử dụng GPU. Long có thể update code lên wav2letter v0.2 stable hiện tại https://github.com/facebookresearch/wav2letter/tree/v0.2 được không?

Hiện tại mình build cả flashlight và wav2letter v0.2 thì mô hình wav2vec load ổn, các file temp xuất ra đều đúng, nhưng đến bước chạy Decoder của wav2letter thì chương trình dừng lại (có bug nhỏ về đổi biến silweight thành silscore thì mình đã sửa).

Cảm ơn,

I am getting an error when installing to Colab

Cloning into 'wav2letter'...
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 6587 (delta 34), reused 72 (delta 19), pack-reused 6468
Receiving objects: 100% (6587/6587), 6.13 MiB | 22.26 MiB/s, done.
Resolving deltas: 100% (4207/4207), done.
/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
env: KENLM_ROOT_DIR=/content/self-supervised-speech-recognition/libs/kenlm
Obtaining file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python
Installing collected packages: wav2letter
Running setup.py develop for wav2letter
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; file='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
/content/self-supervised-speech-recognition/libs

Batch size, resuming from checkpoints and other utilities

Hey

I was browsing through this code and wanted to know a few things:

  1. How can I control the batch size? (Large models will require me to train on very small batch sizes due to limited compute)
  2. How can I resume from a checkpoint, currently there is no utility or function in your codebase?

Decoder using build/Decode will reload model?

Chào a Mai Long, em thấy phần inference nếu mình đang làm như thế thì đang bị load lại model ạ. Không biết, a đã tinh chỉnh phần inference chưa ạ .

Some overflow in pretrain model

Hi Maillong,
I am in the stage of pretrain model (use config/pretraining/wav2vec2_base_librispeech.yaml)
Total duration of unlabeled data is ~100 hours but there are ~60% shorter than 5sec.
The only different setting of environments is that I fix the distributed_world_size from 64 to 1 (I only have one GPU). Others are same.
In the begin of training, there are some WARNING about overflow:
...
2021-01-26 03:01:31 | INFO | fairseq.trainer | begin training epoch 1
2021-01-26 03:02:32 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-01-26 03:03:23 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-01-26 03:04:15 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-01-26 03:10:26 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-01-26 03:56:38 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.
...

Two questions:

  1. Is that means I should tune our lr because the gradient is out-of-range in FP16 or ignore this warning?
  2. Our train_ups is really small =0.02. Is that normal?

Model Performance

Em chào anh Long,
Trước tiên em xin cảm ơn về sự đóng góp của anh cho cộng đồng NLP Việt Nam.
Em có 1 câu hỏi, khi dev repo này xuống dưới những con chip nhỏ ví dụ như ARM cortext thì tốc độ inference có tốt không ạ so với phương pháp truyền thống như MFCC+HMM. Em đang trong quá trình lựa chọn model tuy nhiên lại chưa có điều kiện dev xuống để test thử.
Em cảm ơn anh ạ!

How did you create lst file for train and test

Hi Mr @mailong25,
I'm having issue in generate train and test files. I see in your code that you're using 2 files which has lst extensions. But with my own datasets (for example: VLSP2020), I create my own train and test file and training process seems to work fine until this happened
image
Can you explain this. Thank you so much. Below are my 2 files
test.txt
train.txt

validate step in example log file but not when i'm running it

In the examples folder there is a file called hydra_train_finetune.log. In that file after each epoch or before each epoch I can see that there's a validation step.

[2020-12-19 17:41:16,018][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2020-12-19 17:41:22,153][valid][INFO] - {"epoch": 1, "valid_loss": "729.494", "valid_ntokens": "2521.93", "valid_nsentences": "49.1429", "valid_nll_loss": "14.215", "valid_uer": "123.012", "valid_wer": "100.393", "valid_raw_wer": "100.393", "valid_wps": "6759", "valid_wpb": "2521.9", "valid_bsz": "49.1", "valid_num_updates": "85"}
[2020-12-19 17:41:22,155][fairseq_cli.train][INFO] - begin save checkpoint
[2020-12-19 17:41:22,156][fairseq.trainer][INFO] - Preparing to save checkpoint to checkpoints/checkpoint_best.pt after 85 updates
[2020-12-19 17:41:24,604][fairseq.trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2020-12-19 17:41:25,543][fairseq.checkpoint_utils][INFO] - saved checkpoint checkpoints/checkpoint_best.pt (epoch 1 @ 85 updates, score 100.393) (writing took 3.387641379999991 seconds)
[2020-12-19 17:41:25,543][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below)
[2020-12-19 17:41:25,566][train][INFO] - {"epoch": 1, "train_loss": "887.379", "train_ntokens": "49633.1",

What configs should I take into consideration in order to have that validate step run for my dataset?
I'm running the finetune.py script with the config base_100h.yaml.

Using STT

Hi Mailong,
Thank you for helping us the last time, after we terminate files with under 2s, we had able to finetune the model with wer =16, bit when we using STT with Colab, we come to a problem:
ipykernel_launcher.py: error: unrecognized arguments: -f / mnt/disk2/data
Can you help us how to modify the code to solve this problem? We really need your help.

Finetuning a Vietnamese model

Hi Mailong,
We are doing the group project with wav2vec on Colab and we use Colab to train the model using VLSP dataset since Colab is limited, we use your pre-train model for the finetuning our dataset. Our dataset is around 60-70 hours with the 100h config, the process seem fine at the beginning but when the train loss value reach 400, it's valid loss suddenly increase significantly and also the wer, only the uer seem to decrease. Also the train loss reduce very slow, about 1 point per epoch. Can you help us with the problem.

Thank you very much.

Fine-tuning checkpoints

Hello,

My fine-tuning goes all the way up to epoch 49 (and counting) but no checkpoint is saved. Is this normal? It seems very different from the pre-training where it's saving a checkpoint after each epoch.

Many thanks.

Kind Regards,

WER on testing phase

Hi mr @mailong25 , I'm wondering how do you calculate WER on test audio files? If you does how can I find it in your code?
Thank you

malloc(): mismatching next->prev_size (unsorted)

Hello thank you for this. Everything works fine with me, until the last step when trying to transcribe a file.
I get the notification: malloc(): mismatching next->prev_size (unsorted), aborted.
I also sometimes get: segmentation fault, but without any traceback..
Do you know what the cause of this might be?
Thanks :-)

Pretrain with Japanese audio?

Hi @mailong25,
Thanks for your repo, it's very nice. I want to pre-train model(base-large) with my audio in Japanese, but anytime i do that, i get error: "dataset is empty"? How can i resloved it or how to pretrain model with other language(not English)? Thanks for your help!

Unable to load multiple model

Hi, I want to run multiple models. but after running one instance. i unable to run other instance.
I am getting this error.
ile "/media/administrator/hdd/self-supervised-speech-recognition/stt.py", line 187, in optimize_models
model.cuda()
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/administrator/wav2vec_stt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
please help me out.
Gpu memory consuming for one instance is 2246 gb / 8 gb
Thanks,
@mailong25

sh: 1: fairseq-hydra-train: not found

When I run pretrain.py or finetune.py I get the following error.

python pretrain.py --fairseq_path libs/fairseq --audio_path wav --init_model wav2vec_small.pt
fairseq-hydra-train task.data=/home/bcode/speech-recognition/temp distributed_training.distributed_world_size=1 +optimization.update_freq='[64]' checkpoint.finetune_from_model=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=1200000 --config-dir config/pretraining --config-name wav2vec2_base_librispeech
sh: 1: fairseq-hydra-train: not found

python finetune.py --transcript_file transcript.txt --pretrain_model wav2vec_small.pt --dict_file dict.ltr.txt

100%|████████████████████████████████████| 1831/1831 [00:00<00:00, 15427.24it/s]
fairseq-hydra-train task.data=/home/bcode/speech-recognition/manifest distributed_training.distributed_world_size=1 +optimization.update_freq='[24]' model.w2v_path=/home/bcode/speech-recognition/wav2vec_small.pt dataset.num_workers=8 dataset.max_tokens=2800000 --config-dir config/finetuning --config-name base_1h
sh: 1: fairseq-hydra-train: not found

ValueError in making prediction

Hi, your work is really awesome. I currenly got a ValueError when running prediction by your public Pre-trained models.
Here is the issue code:
image
While i was trying to debug the error via facebookresearch/fairseq#2106, the problem still occured.
Please tell me how to fix it. My computer os is Windows 10 by the way. Many thanks

Wav2letter installation error

Hi there
I am following the instructions in 'Install instruction' section but when I run this

!pip install -e .

I get the following error

file:///content/self-supervised-speech-recognition/libs/wav2letter/bindings/python Installing collected packages: wav2letter Running setup.py develop for wav2letter ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"'; __file__='"'"'/content/self-supervised-speech-recognition/libs/wav2letter/bindings/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output

Any idea how can I fix this issue?
Also I tried to remove build file in the wav2letter repo but not working again.

Dataset

what vietnamese dataset did you use?

[Question] How to improve ASR for production usage

Describe your question
seeking guidelines from the community for improving models for production:

different accents, for example, US/UK/AU English
filler words, for example, um; ah
domain words, for example, company names, finance terms

can finetune on other language?

I have little data and I want to ask if it is possible to finetune your wav2vec pretrained model deriectly on another language, such as chinese, thanks

Strange characters prediction

Hi anh mailong25. Trong quá trình huấn luyện mô hình trên tập vivos em nhận ra khi mô hình càng hội tụ, những đoạn văn bản mô hình dự đoán qua những file em test ngẫu nhiên dần xuất hiện các ký tự lạ được dự đoán xen kẽ nhau qua các từ (em nghĩ có thể do dự đoán các tạp âm).
image

Không biết anh có gặp trường hợp này bao giờ chưa ạ? Nếu được, xin anh cho em giải pháp khắc phục vấn đề này. Em cảm ơn.

How to create the lexicon.txt

Hi @mailong25, I am sorry if this is a basic question. I would like to get pointers on how you generated the lexicon.txt. I have an asr dataset (audio + transcriptions)[https://github.com/csikasote/bemba-language-corpus]. My goal is to develop an ASR using wav2vec and wav2letter, possibly adapt your code to my dataset. Did you generate the lexicon.txt from the transcriptions? I would appreciate your help.

how to do lexicon free decoding

Hi,
To do lexicon-free decoding I set the args.lexicon to False in wsl_decoder.py. But I got an empty string. So Please explain how to do lexicon free decoding. In wsl_decoder.py i have seen this line "lexicon free decoding can only be done with a unit language model" . if that so how can i create unit language model.

Right now I having this below problem in the concurrent lexicon lm model.
For example, In lexicon.txt I have two words tamil and ama. if I send tamil spoken audio to ASR. it predicts as a tamah(no lm). After language model processing it always giving as a ama.
another example:
Now in lexicon.txt I have tamil and am words. If asr predicts as a tamah. the language model giving string as am. I just want to solve this problem. I tried all different value of lm weight or wordgroup parameter but no use. I know that lm_weigt and word group will not influence that much in these scenarios.
Thanks,
Please anyone helpme out.
@SenriYoshikawa
@mailong25

Response empty value?

Hi a mailong25,
E có build source trên google colab, mọi thứ chạy oke, nhưng khi e test với data thì response trả về là ['']. E không tìm được lí do, có phải là do model mk train, hay một lí do nào khác. Mong a giải đáp. Greate Libs!

Import stt is not working!

Hi @mailong25 @SenriYoshikawa ,

I am able to train acoustic model and Kenlm but when i am trying to import stt for transcribing, I am getting the following error in the self-supervised-speech-recognition directory.

Error


ImportError Traceback (most recent call last)
in ()
1 get_ipython().magic('cd /content/self-supervised-speech-recognition')
----> 2 import stt

/content/self-supervised-speech-recognition/stt.py in ()
7 import numpy as np
8 import torch
----> 9 from fairseq import checkpoint_utils, options, progress_bar, tasks, utils
10 from fairseq.data.data_utils import post_process
11 from fairseq.logging.meters import StopwatchMeter, TimeMeter

ImportError: cannot import name 'checkpoint_utils' from 'fairseq' (unknown location)

I tried importing fairseq in the directory i.e. /content/self-supervised-speech-recognition/libs/fairseq and it is working fine. Kindly help me with the work-around for this conflict in the directories. Regards

Error in Making prediction

Hi all

After following Install Instruction and downloading your Pre-trained models I executed this code in colab:

from stt import Transcriber
transcriber = Transcriber(pretrain_model = '/content/vietnamese_wav2vec/pretrain.pt', finetune_model = '/content/vietnamese_wav2vec/finetune.pt', 
                          dictionary = '/content/vietnamese_wav2vec/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = '/content/vietnamese_wav2vec/lexicon.txt', lm_model = '/content/vietnamese_wav2vec/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['/content/1000.wav'])
print(hypos)

But it gives me this error:

usage: ipykernel_launcher.py [-h] [--no-progress-bar]
                             [--log-interval LOG_INTERVAL]
                             [--log-format {json,none,simple,tqdm}]
                             [--tensorboard-logdir TENSORBOARD_LOGDIR]
                             [--wandb-project WANDB_PROJECT]
                             [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
                             [--bf16] [--memory-efficient-bf16] [--fp16]
                             [--memory-efficient-fp16]
                             [--fp16-no-flatten-grads]
                             [--fp16-init-scale FP16_INIT_SCALE]
                             [--fp16-scale-window FP16_SCALE_WINDOW]
                             [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                             [--min-loss-scale MIN_LOSS_SCALE]
                             [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                             [--user-dir USER_DIR]
                             [--empty-cache-freq EMPTY_CACHE_FREQ]
                             [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                             [--model-parallel-size MODEL_PARALLEL_SIZE]
                             [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                             [--profile] [--reset-logging]
                             [--criterion {cross_entropy,label_smoothed_cross_entropy,composite_loss,sentence_ranking,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,wav2vec,masked_lm,ctc,sentence_prediction,nat_loss,model,adaptive_loss,vocab_parallel_cross_entropy}]
                             [--tokenizer {space,moses,nltk}]
                             [--bpe {bert,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,sentencepiece,bytes,characters}]
                             [--optimizer {adamax,adadelta,nag,adagrad,adafactor,adam,sgd,composite,lamb}]
                             [--lr-scheduler {inverse_sqrt,triangular,cosine,fixed,tri_stage,reduce_lr_on_plateau,polynomial_decay,manual,pass_through}]
                             [--scoring {sacrebleu,bleu,chrf,wer}]
                             [--task TASK] [--num-workers NUM_WORKERS]
                             [--skip-invalid-size-inputs-valid-test]
                             [--max-tokens MAX_TOKENS]
                             [--batch-size BATCH_SIZE]
                             [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                             [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                             [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                             [--data-buffer-size DATA_BUFFER_SIZE]
                             [--train-subset TRAIN_SUBSET]
                             [--valid-subset VALID_SUBSET]
                             [--validate-interval VALIDATE_INTERVAL]
                             [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                             [--validate-after-updates VALIDATE_AFTER_UPDATES]
                             [--fixed-validation-seed FIXED_VALIDATION_SEED]
                             [--disable-validation]
                             [--max-tokens-valid MAX_TOKENS_VALID]
                             [--batch-size-valid BATCH_SIZE_VALID]
                             [--curriculum CURRICULUM]
                             [--gen-subset GEN_SUBSET]
                             [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                             [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                             [--distributed-rank DISTRIBUTED_RANK]
                             [--distributed-backend DISTRIBUTED_BACKEND]
                             [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                             [--distributed-port DISTRIBUTED_PORT]
                             [--device-id DEVICE_ID] [--distributed-no-spawn]
                             [--ddp-backend {c10d,no_c10d}]
                             [--bucket-cap-mb BUCKET_CAP_MB]
                             [--fix-batches-to-gpus]
                             [--find-unused-parameters] [--fast-stat-sync]
                             [--broadcast-buffers]
                             [--distributed-wrapper {DDP,SlowMo}]
                             [--slowmo-momentum SLOWMO_MOMENTUM]
                             [--slowmo-algorithm SLOWMO_ALGORITHM]
                             [--localsgd-frequency LOCALSGD_FREQUENCY]
                             [--nprocs-per-node NPROCS_PER_NODE]
                             [--pipeline-model-parallel]
                             [--pipeline-balance PIPELINE_BALANCE]
                             [--pipeline-devices PIPELINE_DEVICES]
                             [--pipeline-chunks PIPELINE_CHUNKS]
                             [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                             [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                             [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                             [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                             [--pipeline-checkpoint {always,never,except_last}]
                             [--zero-sharding {none,os}] [--path PATH]
                             [--post-process [POST_PROCESS]] [--quiet]
                             [--model-overrides MODEL_OVERRIDES]
                             [--results-path RESULTS_PATH] [--beam BEAM]
                             [--nbest NBEST] [--max-len-a MAX_LEN_A]
                             [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                             [--match-source-len] [--unnormalized]
                             [--no-early-stop] [--no-beamable-mm]
                             [--lenpen LENPEN] [--unkpen UNKPEN]
                             [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                             [--score-reference] [--prefix-size PREFIX_SIZE]
                             [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                             [--sampling] [--sampling-topk SAMPLING_TOPK]
                             [--sampling-topp SAMPLING_TOPP]
                             [--constraints [{ordered,unordered}]]
                             [--temperature TEMPERATURE]
                             [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                             [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                             [--diversity-rate DIVERSITY_RATE]
                             [--print-alignment [{hard,soft}]] [--print-step]
                             [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                             [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                             [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                             [--iter-decode-force-max-iter]
                             [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                             [--iter-decode-with-external-reranker]
                             [--retain-iter-history] [--retain-dropout]
                             [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                             [--decoding-format {unigram,ensemble,vote,dp,bs}]
                             [--no-seed-provided] [--save-dir SAVE_DIR]
                             [--restore-file RESTORE_FILE]
                             [--finetune-from-model FINETUNE_FROM_MODEL]
                             [--reset-dataloader] [--reset-lr-scheduler]
                             [--reset-meters] [--reset-optimizer]
                             [--optimizer-overrides OPTIMIZER_OVERRIDES]
                             [--save-interval SAVE_INTERVAL]
                             [--save-interval-updates SAVE_INTERVAL_UPDATES]
                             [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                             [--keep-last-epochs KEEP_LAST_EPOCHS]
                             [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                             [--no-save] [--no-epoch-checkpoints]
                             [--no-last-checkpoints]
                             [--no-save-optimizer-state]
                             [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                             [--maximize-best-checkpoint-metric]
                             [--patience PATIENCE]
                             [--checkpoint-suffix CHECKPOINT_SUFFIX]
                             [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                             [--load-checkpoint-on-all-dp-ranks]
                             [--kspmodel KSPMODEL] [--wfstlm WFSTLM]
                             [--rnnt_decoding_type RNNT_DECODING_TYPE]
                             [--rnnt_len_penalty RNNT_LEN_PENALTY]
                             [--w2l-decoder W2L_DECODER] [--lexicon LEXICON]
                             [--unit-lm] [--kenlm-model KENLM_MODEL]
                             [--beam-threshold BEAM_THRESHOLD]
                             [--beam-size-token BEAM_SIZE_TOKEN]
                             [--word-score WORD_SCORE]
                             [--unk-weight UNK_WEIGHT]
                             [--sil-weight SIL_WEIGHT]
                             [--dump-emissions DUMP_EMISSIONS]
                             [--dump-features DUMP_FEATURES]
                             [--load-emissions LOAD_EMISSIONS] [-s SRC]
                             [-t TARGET] [--load-alignments]
                             [--left-pad-source BOOL] [--left-pad-target BOOL]
                             [--max-source-positions N]
                             [--max-target-positions N]
                             [--upsample-primary UPSAMPLE_PRIMARY]
                             [--truncate-source] [--num-batch-buckets N]
                             [--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
                             [--eval-bleu-detok-args JSON]
                             [--eval-tokenized-bleu]
                             [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
                             [--eval-bleu-args JSON]
                             [--eval-bleu-print-samples]
                             [--force-anneal FORCE_ANNEAL]
                             [--lr-shrink LR_SHRINK]
                             [--warmup-updates WARMUP_UPDATES] [--pad PAD]
                             [--eos EOS] [--unk UNK]
                             data
ipykernel_launcher.py: error: unrecognized arguments: -f /mnt/disks2/data /mnt/disks2/data

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


and %tb gives this:


---------------------------------------------------------------------------

SystemExit                                Traceback (most recent call last)

<ipython-input-46-24d37fe9d36e> in <module>()
      4                           lm_type = 'kenlm',
      5                           lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
----> 6                           lm_weight = 1.5, word_score = -1, beam_size = 50)
      7 hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
      8 print(hypos)

4 frames

/usr/lib/python3.7/argparse.py in exit(self, status, message)
   2502         if message:
   2503             self._print_message(message, _sys.stderr)
-> 2504         _sys.exit(status)
   2505 
   2506     def error(self, message):

SystemExit: 2



Why is this happening?
Could you please help me with that?

finetune.py optimization.update_freq

I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU.

    cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'")

In the wav2vec Readme https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md they say that the base model was trained using 64 V100 GPUs and as I understood if we want to do more training on the base model we should simulate the number of the GPUs they've used.

Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 64/k

Have you found that setting update_freq to be 24/NUM_GPU is better for training or is it a bug?

Pretraining larger models?

"Please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can I pretrain the different versions of the wav2vec with the same code?

is it ok, To do specAugment and also adding noise to audio, then train it out.

Hi @mailong25 , Thanks for your great working especially combining the language model in this asr system. I have one doubt is adding noise to audio will affect the performance. because I searched regarding this augmentation in your GitHub and fairseq wav2vec GitHub but I unable to find any reference regarding the noise augmentation or other augmentation. Why So?

I have read the wav2vec for simplicity it works like a clustering thing using contractive loss. So I feel like even though if we add noise to the original audio anyway it's going cluster the sampled part of the audio to the nearby neighbor embedding.

Thanks

It says Saved but I checked the folder and it is empty.

Hi, It says Saved but I checked the folder and it is empty.

2021-01-29 08:29:52 | INFO | fairseq.trainer | Preparing to save checkpoint to checkpoints/checkpoint_last.pt after 21 updates
2021-01-29 08:29:56 | INFO | fairseq.trainer | Finished saving checkpoint to checkpoints/checkpoint_last.pt
2021-01-29 08:29:56 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/checkpoint_last.pt (epoch 33 @ 21 updates, score 3.125) (writing took 4.057792565999989 seconds)
2021-01-29 08:29:56 | INFO | fairseq_cli.train | end of epoch 33 (average epoch stats below)

Segmentation fault in inference

Hi mailong
I follow your install instruction and now stuck at stage 4 (Make prediction on multiple audios programmatically)
It only shows Segmentation fault (core dumped)
Is that means my wav2letter installation broken?

Hard coding parameters and avoiding argparse

Hey

I was wondering if you have any suggestions on how to get around argparse for wav2vec? Argparse is great for training, but once you have a set of parameters fixed, argparse must be avoided since it also interferes with other things once the model is part of an application.

I know wav2vec itself uses argparse everywhere, but I was wondering if you might have thought of this problem and a workaround. Thanks.

Issue while fine-tuning the wav2vec model

When I run the following script after I pretrained the xlsr model with my own dataset, Im getting error below,

FileNotFoundError: [Errno 2] No such file or directory: '/home/edo/self-supervised-speech-recognition/manifest/dev_other.tsv'

I have generated the dictionary file as well as labeled data according to the tutorial in the page with the following command,
python3 gen_dict.py --transcript_file /home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --save_dir dictionary

However, I run this command:
python3 finetune.py --transcript_file home/edo/self-supervised-speech-recognition/examples/label_audio/asr/transcript.txt --pretrain_model ../models/07-46-31/checkpoints/checkpoint_best.pt --dict_file dictionary/dict.ltr.txt
and get the error above.

Environment

OS: Ubuntu 18.04 LTS
Cuda Version: 11.2

Additional Context

However, I tried to change the model to be used in fine-tuning process with wav2vec_small.pt instead of models/07-46-31/checkpoints/checkpoint_best.pt, I dont why it works.

Am I missing something? But im pretty sure I've followed the instructions. Or it is a bug from the fairseq itself?

Tuning lm_weight, word_score and beam_size

Hey

How do you recommend we tune the parameters in transcribe function: lm_weight, word_score, and beam_size? Normally with things like Deepspeech 2, we use its logits to tune this, but how about wav2vec?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.