Giter Club home page Giter Club logo

av_hubert's Introduction

AV-HuBERT (Audio-Visual Hidden Unit BERT)

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Robust Self-Supervised Audio-Visual Speech Recognition

lip-reading

Introduction

AV-HuBERT is a self-supervised representation learning framework for audio-visual speech. It achieves state-of-the-art results in lip reading, ASR and audio-visual speech recognition on the LRS3 audio-visual speech benchmark.

If you find AV-HuBERT useful in your research, please use the following BibTeX entry for citation.

@article{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    journal = {arXiv preprint arXiv:2201.02184}
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}

License

AV-HuBERT LICENSE AGREEMENT

This License Agreement (as may be amended in accordance with this License Agreement, “License”), between you (“Licensee” or “you”) and Meta Platforms, Inc. (“Meta” or “we”) applies to your use of any computer program, algorithm, source code, object code, or software that is made available by Meta under this License (“Software”) and any specifications, manuals, documentation, and other written information provided by Meta related to the Software (“Documentation”).

By using the Software, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the Software or Documentation (collectively, the “Software Products”), and you must immediately cease using the Software Products.

Pre-trained and fine-tuned models

Please find the checkpoints here

Demo

Run our lip-reading demo using Colab: Open In Colab

Installation

First, create a conda virtual environment and activate it:

conda create -n avhubert python=3.8 -y
conda activate avhubert

Then, clone this directory:

git clone https://github.com/facebookresearch/av_hubert.git
cd avhubert
git submodule init
git submodule update

Lastly, install Fairseq and the other packages:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Load a pretrained model

$ cd avhubert
$ python
>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/checkpoint.pt"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Train a new model

Data preparation

Follow the steps in preparation to pre-process:

  • LRS3 and VoxCeleb2 datasets

Follow the steps in clustering (pre-train only) to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Pre-train an AV-HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name, and the label rate is 100Hz.

To train a model, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Finetune an AV-HuBERT model with Seq2Seq

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.wrd are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

Suppose the test.tsv and test.wrd are the video list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint.

Seq2Seq decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/decode/s2s/test.

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['video'] common.user_dir=`pwd`

The command above uses the default decoding hyperparameter, which can be found in conf/s2s_decode.yaml. override.modalities can be set to ['video'] (for lip reading), or ['audio'] (for ASR) or ['audio','video'] (for audio-visual speech recognition).These parameters can be configured from the command line. For example, to search with a beam size of 20, we can append the command above with generation.beam=20. Important parameters include:

  • generation.beam
  • generation.lenpen

Different test set

If your test data are stored in a different directory with the training data, append the following to the above command.

+override.data=/path/to/test +override.label_dir=/path/to/test

, where /path/to/test contains test.{tsv,wrd}. This is useful when you want to test with the fine-tuned checkpoints we provide.

Test under noisy environment

If you want to test your model under noisy environment, append the following to the above command.

+override.noise_wav=/path/to/noise override.noise_prob=1 override.noise_snr={snr}

{snr} is the signal-to-noise ratio (SNR) and /path/to/noise is a folder containing noise manifest files (/path/to/noise/{valid,test}.tsv). See preparation for setting up this folder.

av_hubert's People

Contributors

bw-shi avatar chevaliernoir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

av_hubert's Issues

How to download data from https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html link in Linux terminal?

Thanks for your great work. When I tried to download the required data, it gives me this:
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-03-23 00:24:26-- (try: 3) http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-03-23 00:24:39-- (try: 4) http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... ^Z
[9]+ Stopped wget http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa username=lrs3104 password=qD7TXt3x
(avhubert) root@047a8a154979:/workspace/av_hubert/avhubert/preparation/data/lrs3/pretrain# wget http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa username==lrs3104 password==qD7TXt3x
--2022-03-23 00:24:48-- http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa
Resolving mm.kaist.ac.kr (mm.kaist.ac.kr)... 143.248.231.15
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-03-23 00:24:59-- (try: 2) http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-03-23 00:25:11-- (try: 3) http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa
Connecting to mm.kaist.ac.kr (mm.kaist.ac.kr)|143.248.231.15|:80... connected.
HTTP request sent, awaiting response... ^Z
[10]+ Stopped wget http://mm.kaist.ac.kr/lip_reading/data3/lrs3_pretrain_partaa username==lrs3104 password==qD7TXt3x
(avhubert) root@047a8a154979:/workspace/av_hubert/avhubert/preparation/data/lrs3/pretrain#
image
Can you give me some tips, while downloading the data?
It does not ask Username & password. Is there some solution?

pip install --editable ./

i was doing the installtion step, and when i used "pip install --editable ./" i got this error:

Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "C:\Users\floro.conda\envs\avhubert\lib\site-packages\pip_vendor\pep517\in_process_in_process.py", line 363, in
main()
File "C:\Users\floro.conda\envs\avhubert\lib\site-packages\pip_vendor\pep517\in_process_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\Users\floro.conda\envs\avhubert\lib\site-packages\pip_vendor\pep517\in_process_in_process.py", line 130, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\floro\AppData\Local\Temp\pip-build-env-k0mueac_\overlay\Lib\site-packages\setuptools\build_meta.py", line 177, in get_requires_for_build_wheel
return self.get_build_requires(
File "C:\Users\floro\AppData\Local\Temp\pip-build-env-k0mueac
\overlay\Lib\site-packages\setuptools\build_meta.py", line 159, in get_build_requires
self.run_setup()
File "C:\Users\floro\AppData\Local\Temp\pip-build-env-k0mueac
\overlay\Lib\site-packages\setuptools\build_meta.py", line 174, in run_setup
exec(compile(code, file, 'exec'), locals())
File "setup.py", line 261, in
os.symlink(os.path.join("..", "examples"), fairseq_examples)
OSError: [WinError 1314] Il privilegio richiesto non appartiene al client: '..\examples' -> 'fairseq\examples'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

How to extract audio-visual features?

Hi, thank you for the work and the colab.
In the colab, the following code snippet shows how to extract visual features.

def extract_visual_feature(video_path, ckpt_path, user_dir, is_finetune_ckpt=False):
  utils.import_user_module(Namespace(user_dir=user_dir))
  models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
  transform = avhubert_utils.Compose([
      avhubert_utils.Normalize(0.0, 255.0),
      avhubert_utils.CenterCrop((task.cfg.image_crop_size, task.cfg.image_crop_size)),
      avhubert_utils.Normalize(task.cfg.image_mean, task.cfg.image_std)])
  frames = avhubert_utils.load_video(video_path)
  print(f"Load video {video_path}: shape {frames.shape}")
  frames = transform(frames)
  print(f"Center crop video to: {frames.shape}")
  frames = torch.FloatTensor(frames).unsqueeze(dim=0).unsqueeze(dim=0).cuda()
  model = models[0]
  if hasattr(models[0], 'decoder'):
    print(f"Checkpoint: fine-tuned")
    model = models[0].encoder.w2v_model
  else:
    print(f"Checkpoint: pre-trained w/o fine-tuning")
  model.cuda()
  model.eval()
  with torch.no_grad():
    # Specify output_layer if you want to extract feature of an intermediate layer
    feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
    feature = feature.squeeze(dim=0)
  print(f"Video feature shape: {feature.shape}")
  return feature

I wonder how I can extract audio-visual features?
can you please give an example? or specifically what to feed into the source['audio'] ? Is it a normalized [-1,1] waveform? or othter sprectral features?
Thank you.

decode issue

hi when i run the decode code
"
python -B infer_s2s.py --config-dir ./conf/av-finetune --config-name base_noise_pt_noise_ft_433h.yaml dataset.gen_subset=test common_eval.path=/home/dhjj/checkpoints/base_noise_pt_noise_ft_433h.pt common_eval.results_path=/home/dhjj/results/test override.modalities=['audio','video'] common.user_dir=pwd override.data=/home/dhjj/lrs3/433h_data/ override.label_dir=/home/dhjj/lrs3/433h_data/
"
i got a issue information as follow
"
omegaconf.errors.ConfigKeyError: Key 'criterion' not in 'InferConfig'
full_key: criterion
reference_type=Optional[Dict[Union[str, Enum], Any]]
object_type=InferConfig
"

but by using the default config code
"
python -B infer_s2s.py --config-dir ./conf --config-name s2s_decode.yaml dataset.gen_subset=test common_eval.path=/home/dhjj/checkpoints/base_noise_pt_noise_ft_433h.pt common_eval.results_path=/home/dhjj/results/test override.modalities=['audio','video'] common.user_dir=pwd override.data=/home/dhjj/lrs3/433h_data/ override.label_dir=/home/dhjj/lrs3/433h_data/
"
i can get the decode result...

CTC decoding script

Dear authors,

thanks a lot for this great work! After reading your paper I would ask 2 questions about the CTC decoding script.
In your paper B.4 you have also conducted experiments with CTC finetuning. In the current repo I didn't find the code for CTC finetuning. I would ask if it's also possible to share your CTC finetuning script?
The second question is that have you also done the experiments with cross entropy loss + CTC loss? as it's widely used in acoustic speech recognition, it is also the default setting in the ESPNet toolkit.

Best regards,
Zhengyang Li

nshard and rank argument

First, thank you for your great work! One question w.r.t. the data preparation.
After reading the README of LRS3 preparation, I still didn't get what the nshard and rank arguments are, and what they are used for. What should the default values of them be if I just simply want to reproduce the results in the paper?

Difference result from the paper when finetuning ASR

Hi, sorry to disturb you again.
I used the provided pre-trained model base_lrs3_iter5.pt, the 30h_data data split, the base_lrs3_30h.yaml config to finetune for ASR, all following the README, and finally decoded with s2s_decode.yaml by only changing override.modalities=['audio']. However I got a WER=9.28 which is much different from 5.4 in the paper.
Could you please suggest where is probably going wrong? Thank you.
P.S. I used 4 GPUs instead of 8 so I changed the update_freq to [2], other configs are untouched.

WER: 9.282103134479271
err / num_ref_words = 918 / 9890

_name: null
beam: 50
nbest: 1
max_len_a: 1.0
max_len_b: 0
min_len: 1
match_source_len: false
unnormalized: false
no_early_stop: false
no_beamable_mm: false
lenpen: 1.0
unkpen: 0.0
replace_unk: null
sacrebleu: false
score_reference: false
prefix_size: 0
no_repeat_ngram_size: 0
sampling: false
sampling_topk: -1
sampling_topp: -1.0
constraints: null
temperature: 1.0
diverse_beam_groups: -1
diverse_beam_strength: 0.5
diversity_rate: -1.0
print_alignment: null
print_step: false
lm_path: null
lm_weight: 0.0
iter_decode_eos_penalty: 0.0
iter_decode_max_iter: 10
iter_decode_force_max_iter: false
iter_decode_with_beam: 1
iter_decode_with_external_reranker: false
retain_iter_history: false
retain_dropout: false
retain_dropout_modules: null
decoding_format: null
no_seed_provided: false

OOM when finetuning using multi-GPUs

What is your question?

Dear authors, thanks a lot for this great work! I'm getting OOM while finetuning avhubert on my own dataset using multi-GPUs, and this error usually happens on non initial epoch:
fairseq-hydra-train --config-dir /my/config --config-name myconfig.yaml hydra.run.dir=../saved_model/20220311_1 common.user_dir=`pwd` distributed_training.ddp_backend=c10d distributed_training.distributed_world_size=4 distributed_training.nprocs_per_node=4

The OOM happens randomly on one GPU:

2022-03-18 21:04:26 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 2; 22.38 GiB total capacity; 21.16 GiB already allocated; 19.94 MiB free; 21.54 GiB reserved in total by PyTorch)
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 2                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 8         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| Active memory         |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   22062 MB |   22078 MB |   61016 MB |   38954 MB |
|       from large pool |   22040 MB |   22060 MB |   60488 MB |   38448 MB |
|       from small pool |      22 MB |     176 MB |     528 MB |     506 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  411642 KB |    7842 MB |  189965 GB |  189965 GB |
|       from large pool |  409546 KB |    7828 MB |  188976 GB |  188976 GB |
|       from small pool |    2096 KB |      14 MB |     989 GB |     989 GB |
|---------------------------------------------------------------------------|
| Allocations           |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     173    |     244    |     572    |     399    |
|       from large pool |     162    |     163    |     308    |     146    |
|       from small pool |      11    |      88    |     264    |     253    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     144    |     214    |    9561 K  |    9560 K  |
|       from large pool |     135    |     161    |    2333 K  |    2333 K  |
|       from small pool |       9    |      58    |    7227 K  |    7227 K  |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 3                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

I have tried to use no_c10d and pytorch_ddp as ddp_backend and tried to downgrade pytorch to 1.9.1 or 1.8.0 according to this issue and also have checked my dataset (using max_tokens instead of batch_size to prevent long sentence) , but these didn't work for me.

What's your environment?

  • fairseq Version : 1.0.0a0
  • PyTorch Version (e.g., 1.0) : 1.10.0
  • OS : Ubuntu 20.04.2 LTS
  • How you installed fairseq (pip, source): pip
  • Python version: 3.8.12
  • CUDA version: 10.1
  • GPU models and configuration: NVIDIA Tesla P40 / 22919MiB *4
  • Any other relevant information: NVIDIA Driver Version: 470.94

Thanks in advance for your comment!

All the best,
An Hsu

fairseq-hydra-train with single-node multiple-gpu training

Hi, there,

When started training on my machine with eight gpus with the command provided on README as follows:

fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

it did started -- but only use device 0 with the following message:

[2022-02-15 19:25:02,218][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2022-02-15 19:25:02,218][fairseq.utils][INFO] - rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB
[2022-02-15 19:25:02,218][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2022-02-15 19:25:02,219][fairseq_cli.train][INFO] - training on 1 devices (GPUs/TPUs)
[2022-02-15 19:25:02,219][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None
[2022-02-15 19:25:02,220][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2022-02-15 19:25:02,221][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2022-02-15 19:25:02,221][fairseq.trainer][INFO] - loading train data for epoch 1

Even if I added CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 before the command, the rest gpus are still in idle.

Any suggestions to solve this?

Thanks in advance.

Cannot make inferences from videos over 30 seconds with Colab example

Thank you for the great work! I have enjoyed experimenting with the visual lipreading example.

Using the Inference with the model code example in Colab I am able to return text predictions fine with my own preprocessed ROI videos that are under 500 total frames at 30fps; anything longer than that and I receive the error:
Screenshot 2022-05-26 at 21 15 44

It seems to happen with any video over roughly 500 frames or over 30 seconds long, is there a way to configure this example to make inferences for longer videos?

Cheers,

Zepp

About the video feature extractor ResNet

Thanks for sharing this great piece of research work!

A question about the paper: Regarding the video feature extractor (the modified ResNet-18), does it train together with the transformer during the pre-training process, or its weight is fixed (frozen) during pre-training? Thanks!

[Easy Question] Sequence length

Hi,
I was wondering what is the largest sequence the model can deal with and if fairseq pipeline can deal with automatic batching in case a sequence is larger than the max seq of the model. Namely, can I directly drop for example a 3-min video? or should I do chunks myself.

Using an untrained model, equivalent to the pre-trained model

TLDR: How do I load an untrained AV-HuBERT model?

Hi,

Congratulations on this very successful project and thanks for sharing the code. I experimented now with loading the pre-trained model, as explained in the README, and it works well. I would like to load an equivalent, untrained model, to check the difference. There are instructions on running commands to train a model from scratch, but I simply want to load the model itself, as everything else will be done in my code. So, is there an equivalent to this line

models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])

to load this same model configuration, but untrained (so without loading the checkpoint)? Apologies in advance for my unfamiliarity with your codebase and fairseq as a whole.

Thanks,
Rodrigo

Pretrain data processing problems

Hi, authors
When I was processing the pretrain data, I encountered some .mp4 files and .txt files that were empty. How do you deal with these files?
Also, during the Data preparation stage, ${nshard} is used to slice all the videos, but only the ${rank}-th shard is processed later, I'm confused about the meaning of slices.

Convert AV-HuBERT Model into ONNX Format

Hi, thanks for your great work! I'm trying to export AV-HuBERT model (.pt format) to .onnx format like this:

models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])
export_model = models[0].eval()
dummy_input = {'source': {'audio': torch.zeros(1, 104, 500),
                          'video': torch.zeros(1, 1, 500, 112, 112)},
                          'padding_mask': torch.full((1, 500), False)}
torch.onnx.export(export_model, 
                  dummy_input, 
                  './test.onnx',
                  opset_version=10,
                  do_constant_folding=True,
                  input_names=['input'],
                  output_names=['output'])

But received this error:

/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py:363: UserWarning: Skipping _decide_input_format
 -1
  warnings.warn("Skipping _decide_input_format\n {}".format(e.args[0]))
Traceback (most recent call last):
  File "/home/offset/Downloads/pycharm-community-2021.3.1/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/__init__.py", line 316, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py", line 107, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py", line 724, in _export
    _model_to_graph(model, args, verbose, input_names,
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py", line 493, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py", line 437, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/onnx/utils.py", line 388, in _trace_and_get_graph_from_model
    torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/jit/_trace.py", line 1166, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/jit/_trace.py", line 127, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/jit/_trace.py", line 118, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/offset/conda/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1090, in _slow_forward
    result = self.forward(*input, **kwargs)
TypeError: forward() takes 1 positional argument but 2 were given

Is it right to use torch.onnx.export() to export the model to onnx format? I have noticed prepare_for_onnx_export_() in fairseq source code, but I have no idea how to use it.

Thanks in advance for your comment!

All the best,
An Hsu

Pseudo-labels of self-training

Dear authors,

After reading your paper and looking into your public code, I would ask a question about the pseudo-labels of your self-training experiments. In Table 1 of your paper, the best result (26.9% WER) is achieved by AV-HuBERT + Self-Training. My understanding is that the pseudo-labels are the pseudo-labels of VoxCeleb2, not of LRS3, is it correct? Then the VoxCeleb2 with pseudo-labels and LRS3 with ground-truth labels are used for finetuning the AV-HuBERT to achieve 26.9% WER.

Thanks to your nicely organized code we can successfully run the finetuning experiments without self-training and reach a similar result. In order to reproduce the experiment with self-training, I would ask, if it is possible to share your pseudo-labels?

Thanks a lot in advance!

Best regards,
Zhengyang

Decode an AV-HuBERT model

Hello When i run the code
python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name
dataset.gen_subset=test common_eval.path=/path/to/checkpoint
common_eval.results_path=/path/to/experiment/decode/s2s/test
override.modalities=['video'] common.user_dir=pwd
I got the following prompt
No such file or directory: '/checkpoint/bshi/data/lrs3//exp/ls-hubert/tune-modality/finetune_bpe/unigram1000//test.wrd'
Because my computer's performance is not enough, I skip the previous data preprocessing operation. Could you provide me with the files test.tsv and test.wrd train.tsv,valid.tsv train.wrd,valid.wrd?
Thank you so so much!

Error while training a new model

Hi,
After preparing the dataset to pretrain a new model I am using the command given

  fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd` 

The /path/to/label consists of {train,valid}.km along with {train,valid}.npy , {train,valid}.len ,dict.km.txt and dict.mfcc.txt
But, while running the command I get an error saying

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/label/valid.mfcc'

Am I missing any step to generate the valid.mfcc or is it the same as valid.npy which contains the mfcc values.

PS: I am using the conf/pretrain/base_lrs3_iter1.yaml configuration file.

Finetuning parameter mismatch between paper and configs

Hi, thanks for providing such extensive code and models for avhubert, setting up the finetuning worked like a charm! 🙏

However I have a few questions towards some of the hyperparameters from the provided configs, as I am in progress of resimulating a lipreading baseline for the VOX pretrained BASE S2S transformer. More specifically I am using the base_vox_30h.yaml config file:

In the following sections I found some parameters that do not match the values from the paper.

1. Issue

distributed_training:
ddp_backend: c10d
find_unused_parameters: true
distributed_world_size: 8
distributed_port: 29671
nprocs_per_node: 8

Here the distributed_world_size is set to 8, meaning that finetuning is done using 8 GPUs. In the paper it is said that the BASE setup is trained on 32 GPUs. Is this only true for pretraining and can I assume that finetuning has been done on 8 GPUs? Does update_freq always stays at [1] ?

2. Issue
So in the paper in Section A.4 we find this paragraph on finetuning the S2S model:

In S2S, the pre-trained model (i.e., encoder) is frozen for the first N% updates . N is 100 and 50 for 30h and 433h labeled setting
respectively. The entire model is trained for 18K/45K steps in the 30h/433h setting. Both models are trained with Adam, with the learning rate being warmed up for the first P % of updates to a peak of 0.001 and linearly decayed. P is tuned among {10, 30, 50}. All hyperparamters are tuned on the validation set.

For the 30h finetuning setup I would assume max_update: 18000, freeze_finetune_updates: 18000, and warmup_steps:[1800, 5400, 9000]. In the config we can find the following values:

optimization:
max_update: 30000
lr: [0.001]
sentence_avg: true
update_freq: [1]

freeze_finetune_updates: 24000

lr_scheduler:
_name: tri_stage
warmup_steps: 10000

So in order to resimulate your results as close as possible should I stick to the paper or to the provided config files?

Thanks a lot in advance!

sbatch error Invalid job array specification during "MUSAN data preparation"

Hi authors, thank you for sharing such nice work!

I have downloaded the MUSAN dataset, and now try to do the data preparation. First I got the argument slurm_partition using command scontrol show partition, then I directly run the python command in the terminal, I met such issue:

(avhubert) [huyuchen@ntu-sce-headnode preparation]$ python musan_prepare.py --musan /home3/huyuchen/raw_data/musan --nshard 4 --slurm_partition defq
Split raw audio
sbatch: error: Batch job submission failed: Invalid job array specification
subprocess.CalledProcessError: Command '['sbatch', '/home3/huyuchen/pytorch_workplace/av_hubert/avhubert/preparation/tmphs_2jdk9/submission_file_147b1d1bf46b441aa7f60bca08a3f447.sh']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "musan_prepare.py", line 122, in
jobs = executor.map_array(split_musan, [args.musan for _ in ranks], ranks, [args.nshard for _ in ranks])
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/submitit/core/core.py", line 701, in map_array
return self._internal_process_submissions(submissions)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/submitit/core/core.py", line 864, in _submit_command
output = utils.CommandFunction(command_list, verbose=False)() # explicit errors
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/submitit/core/utils.py", line 350, in call
raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

Alternately, I try to use slurm to run it, using command sbatch -o 1.log submit_musan.sh, where the submit_musan.sh looks like:

#!/usr/bin/env bash
cmd="slurm.pl --quiet --nodelist=node03 --gpu 1 --num-threads 8"
source activate avhubert
$cmd log/1.log
python musan_prepare.py --musan /home3/huyuchen/raw_data/musan --nshard 4 --slurm_partition defq

But same issue appears as above, can you help see what happened? Thank you!

`trim_video_frame` generates only empty folders

Hi, when I run step-2 in lrs3_prepare.py, while the progress bar proceeds normally (no error or warnings) and the speed being slow, the outputs seem nowhere to find. There are only empty folders generated under short-pretrain.

So far the progress bar looks like this 44%|████▍ | 52029/118516 [5:20:19<8:22:04, 2.21it/s] and what I got were 2778 empty folders.

Hope you can give me some help, thanks!

On starting the second iteration ...

Hi, there,

I've successfully finished first iteration of in pre-training AV-HuBERT BASE on LRS3 without MUSAN.

After clustering based on features extracted by the checkpoint of first iteration, the training was issued by the following command:

fairseq-hydra-train --config-dir ./conf/pretrain --config-name base_lrs3_iter2.yaml task.data=/lrs3/30h_data task.label_dir=/lrs3/km model.label_rate=100 hydra.run.dir=/lrs3/experiments common.user_dir=`pwd`

then, an error emerged:

[2022-02-28 17:03:28,515][fairseq_cli.train][INFO] - task: AVHubertPretrainingTask
[2022-02-28 17:03:28,515][fairseq_cli.train][INFO] - model: AVHubertModel
[2022-02-28 17:03:28,515][fairseq_cli.train][INFO] - criterion: AVHubertCriterion
[2022-02-28 17:03:28,517][fairseq_cli.train][INFO] - num. shared model params: 102,844,288 (num. trained: 102,844,288)
[2022-02-28 17:03:28,519][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0)
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-hydra-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 76, in cli_main
    hydra_main()
  File "/usr/local/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 344, in call_main
    torch.multiprocessing.spawn(
  File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/train.py", line 124, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=1)
  File "/home/Code/av_hubert/avhubert/hubert_pretraining.py", line 244, in load_dataset
    self.datasets[split] = AVHubertDataset(
  File "/home/Code/av_hubert/avhubert/hubert_dataset.py", line 178, in __init__
    self.audio_root, self.names, inds, tot, self.sizes = load_audio_visual(manifest_path, max_keep_sample_size, min_keep_sample_size, frame_rate=sample_rate, label_paths=label_paths, label_rates=self.label_rates)
  File "/home/Code/av_hubert/avhubert/hubert_dataset.py", line 73, in load_audio_visual
    f"max_keep={max_keep}, min_keep={min_keep}, "
ValueError: max() arg is an empty sequence

The dumped labels were saved at ${km_path} as explained here:

├── test_0_1.km
├── test.mfcc -> test_0_1.km
├── train_0_1.km
├── train.mfcc -> train_0_1.km
├── valid_0_1.km
└── valid.mfcc -> valid_0_1.km

Any suggestions would be appreciated.

LRS3 data

Hi, I try to run the preprocessing part, may I ask the LRS3 data you use is the original youtube videos or the roughly cropped face data made by VGG team (the one needs a filled form for request)? or it doesnt matter which one to use? thanks

Audio/Video data augmentation

Hi Authors,

Thank you for good sharing. Since you have added noise on input speech with 0.25 probability, I wonder if you have deployed data augmentation on the input image? (like the image flip that CV community usually do) If yes, how can I use it on the avhubert?

Thank you~

How to properly overwrite the args/configs from command line

I want to override distributed_training.nprocs_per_node from the command line, I did something like

fairseq-hydra-train --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml\
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/ common.user_dir=`pwd` \
  +override.distributed_training.distributed_world_size=${nnodes} \
  +override.distributed_training.distributed_rank=${node_rank}
  +override.distributed_training.nprocs_per_node=${nproc_per_node} \
  +override.distributed_training.distributed_init_method='env://' \
  +override.distributed_training.distributed_port=12356 \
  +override.optimization.update_freq=[${update_freq}]

and I got +override.distributed_training.nprocs_per_node=4: command not found error.
And other values printed in Namespace seemed also not changed (distributed_port is still 29671 as specified in the cfg file).

How can I correctly override these args from the command line? Many thanks!

configuration files for the "no pre-training" setups

Hi authors,

Thank you for sharing such nice research work! Thanks to your previous help, I have successfully completed the LRS3 and MUSAN data preparation.

I am now intersted in directly finetune the AVSR system (without pretraining, because of computing resource limits), and hope to reproduce the following highlighted systems in the paper "Robust Self-Supervised Audio-Visual Speech Recognition":

image

I wonder if you can share the config files for these four systems?

If inconvenient, can you guide me some details on how to modify the existed config files in directory conf/av-finetune/? I am new in this field and don't know how to do this, hope you can help me:)

Thank you very much!!

Questions about data loading

Hi, excellent work!

I'm trying to understand a couple of details regarding the data loading for the self-supervised pre-training stage. From my understanding (please correct me if I'm wrong), in the default setting, each mini-batch contains sequences of variable length, potentially ranging from max_sample_size=5 (corresponding to 0.2 secs at 25fps) to max_sample_size=500 (20 secs). Is any "uniform batching" performed such that each batch has sequences of approximately equal length and thus the number of padded zeros are minimised, to speed up training? I couldn't spot it in the code.

Also, in lrs3_prepare.py, it seems that utterances longer than 15 secs are trimmed, whereas in vox_prepare, no utterances are trimmed. Does this mean that when both LRS3 and VoxCeleb2 are used to pre-train the model, the longest possible LRS3 sequence is 15 secs whereas the longest possible VoxCeleb2 sequence is 20 secs (determined by max_sample_size)?

fairseq-hydra-train with multi-nodes distributed training

Hi, is there any instruction on multiple nodes multiple GPUs distributed training with hydra train?
https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training
The fairseq documentation seems to be out-of-date, where hydra does not expect the local_rank argument passed by torch.distributed.launch.
I tried replace torch.distributed.launch by torchrun which solved the local_rank issue but still didn't seem to make everything correct.
Here is the command I tried, and got RuntimeError: Socket Timeout

export MASTER_ADDR=${master_addr}
export MASTER_PORT=12356
export WORLD_SIZE=${world_size}
export RANK=${node_rank}
torchrun --nnodes ${nnodes} --nproc_per_node ${nproc_per_node} $(which fairseq-hydra-train) --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/ common.user_dir=`pwd` \
  +override.distributed_training.distributed_world_size=${world_size} \
  +override.distributed_training.nprocs_per_node=${nproc_per_node} \
  +override.distributed_training.distributed_port=12356 \
  +override.optimization.update_freq=[${update_freq}] \

Problem in finetuning

Thanks for sharing this research work!
I tried to finetune the lrs3 checkpoint but I got this error:
File "~/av_hubert/avhubert/hubert_asr.py", line 474, in build_model
del state['model']['mask_emb']
KeyError: 'mask_emb'

and after comment this line, this error occurred :
File "~/av_hubert/avhubert/hubert_asr.py", line 386, in forward
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'extract_finetune'

Count number of frames per clip Problem

for rank in $(seq 0 $((nshard - 1)));do cat ${lrs3}/nframes.audio.${rank}; done > ${lrs3}/nframes.audio
for rank in $(seq 0 $((nshard - 1)));do cat ${lrs3}/nframes.video.${rank}; done > ${lrs3}/nframes.video

Hello
Will these two lines of code actually work in the terminal?

Cython error during pre-training

Hi,
I have been trying to train the model using the following command,

fairseq-hydra-train --config-dir /home/jupyter/aaryan/av_hubert/avhubert/conf/pretrain --config-name base_lrs3_iter1.yaml \
task.data=/home/jupyter/aaryan/av_hubert/avhubert/lrs3/30h_data \
task.label_dir=/home/jupyter/aaryan/av_hubert/avhubert/features model.label_rate=100 \
hydra.run.dir=/home/jupyter/aaryan/av_hubert/avhubert/test_run common.user_dir=`pwd`

While running this command I am facing this error.

Traceback (most recent call last):
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/data/data_utils.py", line 312, in batch_by_size
    from fairseq.data.data_utils_fast import (
ImportError: /home/jupyter/aaryan/av_hubert/fairseq/fairseq/data/data_utils_fast.cpython-38-x86_64-linux-gnu.so: failed to map segment from shared object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq_cli/train.py", line 155, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/checkpoint_utils.py", line 261, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/trainer.py", line 596, in get_train_iterator
    batch_iterator = self.task.get_batch_iterator(
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/tasks/fairseq_task.py", line 286, in get_batch_iterator
    batch_sampler = dataset.batch_by_size(
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
    return data_utils.batch_by_size(
  File "/home/jupyter/aaryan/av_hubert/fairseq/fairseq/data/data_utils.py", line 318, in batch_by_size
    raise ImportError(
ImportError: Please build Cython components with: `python setup.py build_ext --inplace`

The error mentions to try rebuilding the Cython components.
When I go to the fairseq directory and rebuild the components using the command python setup.py build_ext --inplace given above. I get this:

running build_ext
/opt/conda/envs/avh/lib/python3.8/site-packages/torch/utils/cpp_extension.py:370: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-cpython-38/fairseq/libbleu.cpython-38-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-cpython-38/fairseq/data/data_utils_fast.cpython-38-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-cpython-38/fairseq/data/token_block_utils_fast.cpython-38-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-cpython-38/fairseq/libbase.cpython-38-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-cpython-38/fairseq/libnat.cpython-38-x86_64-linux-gnu.so -> fairseq

Even after rebuilding the cython components the training command still returns the same error.
Any idea what is going wrong and how to resolve this?

ValueError: need at least one array to stack

Hi author, I am finetuning the avhubert from srcatch (no pretraining, don't load pretrained models), I met this bug randomly (sometimes it work okay until trainng process finish, sometimes it bug at 7-th/8-th epoch, here it bug at 1-st epoch):

[2022-03-05 11:27:22,366][fairseq.trainer][INFO] - begin training epoch 1
[2022-03-05 11:27:22,366][fairseq_cli.train][INFO] - Start iterating over samples
[2022-03-05 11:31:42,858][train_inner][INFO] - {"epoch": 1, "update": 0.126, "loss": "164.307", "nll_loss": "8.41", "total": "362.38", "n_correct": "23.52", "ppl": "340.12", "accuracy": "6.49", "wps": "407.1", "ups": "1.12", "wpb": "362.4", "bsz": "19.1", "num_updates": "200", "lr": "2.98e-05", "gnorm": "63.672", "loss_scale": "128", "train_wall": "175", "gb_free": "38", "wall": "262"}
Traceback (most recent call last):
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/distributed/utils.py", line 351, in call_main
join=True,
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq_cli/train.py", line 180, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq_cli/train.py", line 287, in train
for i, samples in enumerate(progress):
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/logging/progress_bar.py", line 191, in iter
for i, obj in enumerate(self.iterable, start=self.n):
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/data/iterators.py", line 56, in next
x = next(self._itr)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/data/iterators.py", line 509, in _chunk_iterator
for x in itr:
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/data/iterators.py", line 56, in next
x = next(self._itr)
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/data/iterators.py", line 637, in next
raise item
File "/home3/huyuchen/pytorch_workplace/av_hubert/fairseq/fairseq/data/iterators.py", line 567, in run
for item in self._source:
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 5.
Original Traceback (most recent call last):
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home3/huyuchen/pytorch_workplace/av_hubert/avhubert/hubert_dataset.py", line 349, in getitem
video_feats, audio_feats = self.load_feature(self.names[index])
File "/home3/huyuchen/pytorch_workplace/av_hubert/avhubert/hubert_dataset.py", line 277, in load_feature
video_feats = self.load_video(video_fn) # [T, H, W, 1]
File "/home3/huyuchen/pytorch_workplace/av_hubert/avhubert/hubert_dataset.py", line 299, in load_video
feats = custom_utils.load_video(os.path.join(self.audio_root, audio_name))
File "/home3/huyuchen/pytorch_workplace/av_hubert/avhubert/utils.py", line 23, in load_video
frames = np.stack(frames)
File "<array_function internals>", line 6, in stack
File "/home3/huyuchen/anaconda3/envs/avhubert/lib/python3.7/site-packages/numpy/core/shape_base.py", line 423, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Ended (code 256) at Sat Mar 5 11:37:54 +08 2022, elapsed time 799 seconds

It seems that bug exists during the video data loading, so I wonder if it is problem of any package version (like numpy)?

My conda env is: torch-1.9.1, python-3.7.10, numpy-1.20.2.

Hope you can help me, thank you very much!

reproductivity of the first iter.

Hi,

Thanks for your great work!
I'm trying to pretrain the iter1 using target driven from MFCC, running on 433h LRS3 and finetuned a s2s model on 30h LRS3.
However, my first iter result seems too bad, only get a WER of 80%, which is far from the WER 71.5% of your ctc's result (
table2, row 1, colum 1).
Do you have any suggestion that something I may did wrong when running the pretraining ?
I only used 10% percent data for running k-means, is this possibly cause the result so bad?

WER on base_noise_pt_noise_ft_30h.pt

I'm trying to get decoding result from your avsr-fintuend model (avhubert_pretrained/model/lrs3_vox/avsr/base_noise_pt_noise_ft_30h.pt)

(I thought) the configuration is not wrong, but i couldn't get same result on my system.

image

The c-wer of downloaded avsr-fituned (PT type=Noisy,FT_type=Noisy) shows 4.29% on my own system.

Did i miss something?

inference command :
python -B infer_s2s.py --config-dir conf --config-name s2s_decode.yaml dataset.gen_subset=test common_eval.path=./multimodal/avhubert_pretrained/model/lrs3_vox/avsr/base_noise_pt_noise_ft_30h.pt common_eval.results_path=fb_base_noise_pt_noise_ft_30h override.modalities=['video','audio'] common.user_dir=pwd override.data=./multimodal/lrs3/30h_data/ override.label_dir=./multimodal/lrs3/30h_data

s2s_decode.yaml: same as github

image

Thank you for your consideration. ;)

Job not requeued because: timed-out and not checkpointable?

When i prepare data ,and use "python count_frames.py"command ,error occur:

(avhubert) dragon@dragon-System-Product-Name:~/Project/av_hubert/avhubert/preparation$ python count_frames.py --root /data/dataset/LRS3/ --manifest /data/dataset/LRS3/test_list.txt --nshard 1 \

--slurm_partition cpu
Traceback (most recent call last):
File "count_frames.py", line 43, in
fids = [ln.strip() for ln in open(args.manifest).readlines()]
FileNotFoundError: [Errno 2] No such file or directory: '/data/dataset/LRS3/test_list.txt'
^C
(avhubert) dragon@dragon-System-Product-Name:~/Project/av_hubert/avhubert/preparation$ python count_frames.py --root /data/dataset/LRS3/ --manifest /data/dataset/LRS3/file_finetune.list --nshard 1 --slurm_partition cpu
33303 files
Traceback (most recent call last):
File "count_frames.py", line 68, in
num_frames = [job.result() for job in jobs]
File "count_frames.py", line 68, in
num_frames = [job.result() for job in jobs]
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/submitit/core/core.py", line 264, in result
r = self.results()
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/submitit/core/core.py", line 292, in results
raise job_exception # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:


Traceback (most recent call last):
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/submitit/core/utils.py", line 122, in result
self._result = self.function(*self.args, **self.kwargs)
File "count_frames.py", line 19, in count_frames
num_frames_audio = len(wavfile.read(wav_fn)[1])
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/scipy/io/wavfile.py", line 547, in read
file_size, is_big_endian = _read_riff_chunk(fid)
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/scipy/io/wavfile.py", line 437, in _read_riff_chunk
str1 = fid.read(4) # File signature
File "/home/dragon/anaconda3/envs/avhubert/lib/python3.8/site-packages/submitit/core/job_environment.py", line 211, in checkpoint_and_try_requeue
raise utils.UncompletedJobError(message)
submitit.core.utils.UncompletedJobError: Job not requeued because: timed-out and not checkpointable.

How to fine-tune with my own dataset

Hello,

Congratulation on this project, it's amazing!!!

I want to test the Single-modal Visual HuBERT because my project is focused on lip reading without audio and I want to use checkpoints of Finetuned Models for Visual Speech Recognition with my own dataset that isn't in English. . For this reason, I would like to ask a question about the section of Finetune an AV-HuBERT model with Seq2Seq.

image

  • what is the configuration file and how to generate it ?.
  • how is tokenizer_bpe_model ? Could you give me an example to generate it myself with my dataset ?, because I can't use step 4 in LRS3 preprocessing to generate it, as my dataset is made up of frames and not videos.
  • And there are two .tsv files in /path/to/data, one for train and another for validation or only one .tsv file ?

On the other hand, the .tsv file stores different paths that contain the frames of videos and .wrd file stores different paths that contain the sentences, right?

For example, if I have this organization for training data.

image

The .tsv file should look like this,
train_dataset/data_speaker/speaker0/speaker0_01
train_dataset/data_speaker/speaker0/speaker0_02
train_dataset/data_speaker/speaker0/....
train_dataset/data_speaker/speaker1/speaker1_01
train_dataset/data_speaker/speaker1/speaker1_02
train_dataset/data_speaker/speaker1/...
.
.
.
train_dataset/data_speaker/speakerN/...

and wrd file,
train_dataset/transcriptions_speaker/speaker0/transcription_speaker0_01.txt
train_dataset/transcriptions_speaker/speaker0/transcription_speaker0_02.txt
train_dataset/transcriptions_speaker/speaker0/....
train_dataset/transcriptions_speaker/speaker1/transcription_speaker1_01.txt
train_dataset/transcriptions_speaker/speaker1/transcription_speaker1_01.txt
train_dataset/transcriptions_speaker/speaker1/...

right?

Thanks a lot in advance!

No dictionary error when inference provided finetuned model

hi ,i want to inference the provided finetuned model,but only to get the follow error:-)
Traceback (most recent call last):
File "infer_model.py", line 49, in
hypo = predict(mouth_roi_path, ckpt_path,user_dir)
File "infer_model.py", line 23, in predict
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
File "/home/fairseq/fairseq/checkpoint_utils.py", line 469, in load_model_ensemble_and_task
model = task.build_model(cfg.model, from_checkpoint=True)
File "/home/fairseq/fairseq/tasks/fairseq_task.py", line 335, in build_model
model = models.build_model(cfg, self, from_checkpoint)
File "/home/fairseq/fairseq/models/init.py", line 106, in build_model
return model.build_model(cfg, task)
File "/home/av_hubert/avhubert/hubert_asr.py", line 469, in build_model
encoder_ = task_pretrain.build_model(w2v_args.model)
File "/home/fairseq/fairseq/tasks/fairseq_task.py", line 335, in build_model
model = models.build_model(cfg, self, from_checkpoint)
File "/home/fairseq/fairseq/models/init.py", line 106, in build_model
return model.build_model(cfg, task)
File "/home/av_hubert/avhubert/hubert.py", line 439, in build_model
model = AVHubertModel(cfg, task.cfg, task.dictionaries, **kwargs)
File "/home/av_hubert/avhubert/hubert_pretraining.py", line 196, in dictionaries
return self.state.dictionaries
File "/home/fairseq/fairseq/tasks/fairseq_task.py", line 41, in getattr
self._state[name] = self._factoriesname
File "/home/av_hubert/avhubert/hubert_pretraining.py", line 200, in load_dictionaries
dictionaries = [
File "/home/av_hubert/avhubert/hubert_pretraining.py", line 201, in
Dictionary.load(f"{label_dir}/dict.{label}.txt")
File "/home/fairseq/fairseq/data/dictionary.py", line 226, in load
d.add_from_file(f)
File "/home/fairseq/fairseq/data/dictionary.py", line 239, in add_from_file
raise fnfe
File "/home/fairseq/fairseq/data/dictionary.py", line 236, in add_from_file
with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
FileNotFoundError: [Errno 2] No such file or directory: '/checkpoint/bshi/data/lrs3//video/hubert/stitch-iters/envox-iter4-l12c2000//dict.km.txt'

so, how can i get the dictionary or how can i inference the finetuned model your released ?

model register problem with multiple pre-trained models

Hi,

Thanks for your wonderful work.

I want load multiple pre-trained model in the same task. But when I try to load another pre-trained model, a model registration bug is shown as following:

Traceback (most recent call last):
File "hubert_asr.py", line 489, in load_vision_model
model = task.build_model(cfg.model)
File "/home3/chenchen/research/hubert/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
model = models.build_model(cfg, self)
File "/home3/chenchen/research/hubert/fairseq/fairseq/models/init.py", line 89, in build_model
assert model is not None, (
AssertionError: Could not infer model type from {'_name': 'av_hubert_seq2seq',.......,Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'hubert', 'hubert_ctc', 'transformer_lm', 'av_hubert', 'av_hubert_ctc']) Requested model type: av_hubert_seq2seq.

I think it because the 'av_hubert_seq2seq' is not registered into the model list, and how can I add it?

Thanks!

Questions about pre-training an AV-HuBERT model.

Hi, thanks for the work.
I was trying to understand your code to pre-train an AV-HuBERT model based on your readme instructions,
and I have a few simple questions.

(1) Is positional embedding not used during pre-training?
In hubert.py, neither conv_pos nor conv_pos_groups was used.
so the positional embedding was not used? if so, any reason?
(2) resent was not using a pre-trained one as a feature extractor during pre-training?
In hubert.py:

        resnet = ResEncoder(relu_type=cfg.resnet_relu_type, weights=cfg.resnet_weights)
        self.feature_extractor_audio = SubModel(resnet=None, input_dim=cfg.audio_feat_dim, cfg=sub_cfg)
        self.feature_extractor_video = SubModel(resnet=resnet, input_dim=resnet.backend_out, cfg=sub_cfg)

It seems resnet_weights was not set in any config files, so the resnet would be trained from scratch in the pre-training phase, right?
I thought a resnet pretrained on ImageNet would be freezed and used as a feature extractor in this context, though I am not familiar with the literature.

Thanks in advance for your comment!

LRS3 433h pretrain configuration

Hi,

I'm following your work on LRS3 433h, starting from MFCC feature.
However, I found finetuned (LRS3 30h) lipreading WER is high, even when I finished the 4th iter (WER 81%).

I suppose the reason is modality drop is 0 in the first 4 pretrain iteration, causing mismatch during pretraining (always using
audio + video) and finetuning (using video).
Could you please check if the configuration is right?

Thx!

how can i use your model to extract the feature?

great work,your job provided the good representation for the lip,how can i use your model to extract the feature,for example i want to use your model extract a lip image's feature,can you provided some example code

load pretrained model error !

When I load pretrained model "finetune-model.pt" follow your demo:
checkpoint_utils.load_model_ensemble_and_task([ckpt_path]);

it run error:

` File "/home/amax/av_hubert/avhubert/demo.py", line 47, in predict
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
File "/home/amax/av_hubert/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task
model = task.build_model(cfg.model)
File "/home/amax/av_hubert/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
model = models.build_model(cfg, self)
File "/home/amax/av_hubert/fairseq/fairseq/models/init.py", line 88, in build_model
assert model is not None, (
AssertionError: Could not infer model type from {'_name': 'av_hubert_seq2seq', 'w2v_path': '/checkpoint/bshi/data/lrs3//model-ckpt/base-vox/pretrain/av.pt', 'apply_mask': False, 'mask_selection': 'static', 'mask_length': 10, 'mask_other': 0, 'mask_prob': 0.75, 'mask_channel_selection': 'static', 'mask_channel_length': 64, 'mask_channel_other': 0, 'mask_channel_prob': 0.5, 'layerdrop': 0.1, 'dropout': 0.0, 'activation_dropout': 0.1, 'attention_dropout': 0.0, 'feature_grad_mult': 1.0, 'decoder_layers': 6, 'decoder_dropout': 0.1, 'decoder_attention_dropout': 0.0, 'decoder_activation_dropout': 0.1, 'freeze_finetune_updates': 22500, 'share_decoder_input_output_embed': True, 'decoder_normalize_before': True, 'w2v_args': {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 200, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1337, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '/private/home/bshi/code/fairseq-py/examples/av_hubert/model', 'empty_cache_freq': 10000, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 32, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://learnfair1212:29671', 'distributed_port': 29671, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'no_c10d', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 6, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 1000, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 5, 'validate_interval_updates': 10000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 1000, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 800000, 'stop_time_hours': 0.0, 'clip_norm': 10.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.002], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 25000, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 32}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'av_hubert', 'label_rate': 25, 'skip_masked': False, 'skip_nomask': False, 'mask_prob_image': 0.3, 'mask_length_image': 5, 'mask_prob_audio': 0.8, 'mask_length_audio': 10, 'extractor_mode': 'default', 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'final_dim': 256, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'dropout': 0.1, 'attention_dropout': 0.1, 'feature_grad_mult': 0.1, 'untie_final_proj': True, 'activation_dropout': 0.0, 'layer_norm_first': True, 'audio_feat_dim': 104, 'modality_dropout': 0.5, 'audio_dropout': 0.5, 'modality_fuse': 'concat', 'selection_type': 'same_seq', 'masking_type': 'input'}, 'task': {'_name': 'av_hubert_pretraining', 'data': '/checkpoint/bshi/data/lrs3//avsvox/en-vox-multimodal-tsv/', 'label_dir': '/checkpoint/bshi/data/lrs3//video/hubert/stitch-iters/envox-iter4-l12c2000/', 'labels': ['km'], 'label_rate': 25, 'sample_rate': 25, 'max_sample_size': 2000, 'min_sample_size': 5, 'pad_audio': False, 'random_crop': True, 'normalize': True, 'input_modality': 'image', 'image_aug': True, 'stack_order_audio': 4, 'max_trim_sample_size': 400}, 'criterion': {'_name': 'av_hubert', 'pred_masked_weight': 1.0, 'pred_nomask_weight': 1.0, 'loss_weights': [10]}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, 'tpu': False, 'lr': [0.002]}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 64000, 'force_anneal': None, 'end_learning_rate': 0.0, 'power': 1.0, 'total_num_update': 800000, 'lr': [0.002]}, 'scoring': None, 'bpe': None, 'tokenizer': None, 'job_logging_cfg': {'version': 1, 'formatters': {'simple': {'format': '[%(asctime)s][%(name)s][%(levelname)s] - %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'stream': 'ext://sys.stdout'}, 'file': {'class': 'logging.FileHandler', 'formatter': 'simple', 'filename': 'hydra_train.log'}}, 'root': {'level': 'INFO', 'handlers': ['console', 'file']}, 'disable_existing_loggers': False}}}. Available models: dict_keys(['transformer_lm', 'wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'hubert', 'hubert_ctc', 'av_hubert']) Requested model type: av_hubert_seq2seq

Process finished with exit code 1`

How can i fix it, thanks !

Are there fine-tuned models for ASR provided?

I downloaded the fine-tuned model from the "Finetuned Models for Visual Speech Recognition" section with the name "base_lrs3_30h.pt" and ran the decoding script for it.

While I was able to decode the model trained by myself, when decoding the downloaded checkpoint (aforementioned), I got the following error. Is this because the checkpoint I downloaded is for lip-reading but not ASR or some other reason?
Thanks!

Traceback (most recent call last):                                                                                            
  File "infer_s2s.py", line 285, in hydra_main
    distributed_utils.call_main(cfg, main)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "infer_s2s.py", line 90, in main
    return _main(cfg, h)
  File "infer_s2s.py", line 221, in _main
    hypos = task.inference_step(
  File "/gs/hs0/tga-tslab/bowen/av_hubert/fairseq/fairseq/tasks/fairseq_task.py", line 527, in inference_step
    return generator.generate(
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/sequence_generator.py", line 188, in generate
    return self._generate(sample, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/sequence_generator.py", line 258, in _generate
    encoder_outs = self.model.forward_encoder(net_input)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/sequence_generator.py", line 772, in forward_encoder
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/sequence_generator.py", line 772, in <listcomp>
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "/gs/hs0/tga-tslab/bowen/av_hubert/fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript
    return self.forward_non_torchscript(net_input)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
    return self.forward(**encoder_input)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/hubert_asr.py", line 386, in forward
    x, padding_mask = self.w2v_model.extract_finetune(**w2v_args)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/hubert.py", line 704, in extract_finetune
    features_audio = self.forward_features(src_audio, modality='audio') # features: [B, F, T]
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/hubert.py", line 541, in forward_features
    features = extractor(source)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/av_hubert/avhubert/hubert.py", line 327, in forward
    x = self.proj(x.transpose(1, 2))
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (3000x26 and 104x768)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.