Hello, Congratulation on this project, it's amazing!!! <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, The configuration file specifies the hyperparameters as we

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="us

<a class="user-mention notranslate" data-hover

How to fine-tune with my own dataset,about facebookresearch/av_hubert

Comments (22)

chevalierNoir commented on September 10, 2024 1

@sif99 In pre-training, we assume paired text is not available. Our current code does not support having text as a third modality in pre-training.

from av_hubert.

chevalierNoir commented on September 10, 2024

Hi,

The configuration file specifies the hyperparameters as well as paths to data. You can find examples of config files for fine-tuning here.
tokenizer_bpe_model is the path to the bpe tokenizer. You only need your training text (i.e., a list of sentences) to generate that with gen_subword.py. Here is an example.
Yes. train.tsv and valid.tsv are for train and validation.
The format of tsv file is as follows, where each field is separated by tab.

/ ## root directory, first line
id1 /path/to/video1 /path/to/audio1 number-frames-video1 number-frames-audio1
id2 /path/to/video2 /path/to/audio2 number-frames-video2 number-frames-audio2
...

Each line of wrd file contains the text transcription corresponding to the line (except first line of root directory) in tsv file:

sentence-1
sentence-2
...

As our current data loader is based on videos, you will have to convert your jpgs to videos. The video fps should be 25 by default. If you don't have the audio track, you can leave dummy strings for audio path and number of audio frames.

from av_hubert.

YadiraRoCa commented on September 10, 2024

Thank you for your answer!

Could you share the code of Single-modal Visual HuBERT?

from av_hubert.

chevalierNoir commented on September 10, 2024

Single-modal visual HuBERT is for pre-training. Did you want to use your own dataset for fine-tuning or pre-training?

from av_hubert.

YadiraRoCa commented on September 10, 2024

I would like to use Single-modal visual HuBERT for fine-tuning. Is it possible?
Because the checkpoints of Finetuned Models for Visual Speech Recognition were obtained from AV-HuBERT model. But I'm interested in V-HuBERT.

from av_hubert.

chevalierNoir commented on September 10, 2024

Just to clarify, the difference between AV-HuBERT and single-modal visual HuBERT only lies in the pre-training stage (w/o text labels), while both can be used for downstream tasks with single modality (e.g., lip reading with images only). If I understand correctly, what you probably want to do is training a lip-reading model on your own dataset using some pre-trained model. For that purpose, you can just download a pre-trained checkpoint (without fine-tuning) from here and fine-tune it on your own data following the fine-tuning instructions above.

However, if you do want to repeat the entire process (i.e., pre-training + fine-tuning + decoding) on your own data with single-modal visual HuBERT, you would have to prepare the cluster labels for pre-training. The feature we use for clustering in the paper is HoG and we use the scikit-image to extract them. The clustering process is same as AV-HuBERT , which you can find here. The pre-training command is same as AV-HuBERT except that task.modalities=['video']. The fine-tuning and decoding are exactly same with AV-HuBERT.

from av_hubert.

YadiraRoCa commented on September 10, 2024

okey, thank you for the clarification

from av_hubert.

sif99 commented on September 10, 2024

@chevalierNoir first of all I'm impressed in this project . I'm intrested also in fine-tuning for AVR and lip reading .
Also , is it possible for the pre-training part to add the text as a third modality ?

from av_hubert.

sif99 commented on September 10, 2024

@chevalierNoir Thank you for your response . Actually , my idea is to add the text and to pretrain the model with 3 modalities in order to get better results . Like in data preparation , when we trim the audio and the video , we also can trim the text . For the clustering part(AV-HuBERT Label Preparation), I'm not sure about the required modifications for the different steps . Thank you in advance for your return.

from av_hubert.

chevalierNoir commented on September 10, 2024

@chevalierNoir Thank you for your response . Actually , my idea is to add the text and to pretrain the model with 3 modalities in order to get better results . Like in data preparation , when we trim the audio and the video , we also can trim the text . For the clustering part(AV-HuBERT Label Preparation), I'm not sure about the required modifications for the different steps . Thank you in advance for your return.

Yes. That's possible for LRS3, which has ground-truth transcriptions and the alignment. It depends on how you utilize the text. If you want to predict the text from video, then you can just add an additional loss during training and the clustering part remains unchanged.

from av_hubert.

sif99 commented on September 10, 2024

@chevalierNoir Actually for the clustering part , after the first iteration I thought that modification is required here to extract the features for the 3 modalities . Also , to pretrain the model with text , I thought that it has to be masked as same as audio and video masking ?

from av_hubert.

chevalierNoir commented on September 10, 2024

@chevalierNoir Actually for the clustering part , after the first iteration I thought that modification is required here to extract the features for the 3 modalities . Also , to pretrain the model with text , I thought that it has to be masked as same as audio and video masking ?

If you use text as a an additional input modality, yes then you have to modify the code by adding a text encoder. During clustering, you probably need the text as well (change here). Masking text should be necessary, though we have not tried it before. You also need to figure out what sub-word units to use and how to get its alignment as only word-level alignment is provided in LRS3.

from av_hubert.

YadiraRoCa commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error:
File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model
task_pretrain = tasks.setup_task(w2v_args.task)
File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task
cfg = merge_with_parent(dc(), cfg)
File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent
merged_cfg = OmegaConf.merge(dc, cfg)
File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge
target.merge_with(*others[1:])
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with
self._format_and_raise(key=None, value=None, cause=e)
File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise
format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
_raise(ex, cause)
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with
self._merge_with(*others)
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with
BaseContainer._map_merge(self, other)
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge
dest[key] = src._get_node(key)
File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem
self._format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise
format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
_raise(ex, cause)
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'
full_key: input_modality
reference_type=Optional[AVHubertPretrainingConfig]
object_type=AVHubertPretrainingConfig

I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )

Should I add this key or is this key the same as 'task.modalities'?

from av_hubert.

chevalierNoir commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error: File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model task_pretrain = tasks.setup_task(w2v_args.task) File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task cfg = merge_with_parent(dc(), cfg) File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent merged_cfg = OmegaConf.merge(dc, cfg) File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge target.merge_with(*others[1:]) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with self._format_and_raise(key=None, value=None, cause=e) File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with self._merge_with(*others) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with BaseContainer._map_merge(self, other) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge dest[key] = src._get_node(key) File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem self._format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig' full_key: input_modality reference_type=Optional[AVHubertPretrainingConfig] object_type=AVHubertPretrainingConfig

I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )

Should I add this key or is this key the same as 'task.modalities'?

The input_modality in AVHubertConfig is a deprecated argument and can be removed. But keeping it there won't affect the current model training. If you follow the fine-tuning command here, it should work.

from av_hubert.

javadpeyman commented on September 10, 2024

hello
I want to use your code for another language. I created a tokenizer for my textual data with gen_subword.py (I checked that and it's correct). but in the LabelEncoderS2SToken in hubert_pretraining.py, all the tokens will map to 3 (unknown). the characters of my dataset are like the Arabic language. Do I have to change the "s2s_tokenizer" or does it support all of the languages? @chevalierNoir

from av_hubert.

YadiraRoCa commented on September 10, 2024

@chevalierNoir I removed the input_modality in AVHubertConfig, but the error remains the same.
I follow the fine-tuning command:
fairseq-hydra-train --config-dir ~ /av_hubert/avhubert/conf/finetune/ --config-name large_lrs3_433h.yaml
task.data=~ /data_hubert/433h_data/tsv_path task.label_dir= ~ /data_hubert/433h_data/wrd_path
task.tokenizer_bpe_model=~ /data_hubert/spm1000/spm_unigram1000.model
model.w2v_path= ~ /data_hubert/checkpoints/large_lrs3_iter5.pt
hydra.run.dir=~ /data_hubert/experiment/finetune/ common.user_dir=pwd

from av_hubert.

chevalierNoir commented on September 10, 2024

hello I want to use your code for another language. I created a tokenizer for my textual data with gen_subword.py (I checked that and it's correct). but in the LabelEncoderS2SToken in hubert_pretraining.py, all the tokens will map to 3 (unknown). the characters of my dataset are like the Arabic language. Do I have to change the "s2s_tokenizer" or does it support all of the languages? @chevalierNoir

Hi,

The tokenizer we use is unigram sentencepiece. It should also work for arabic languages. It's probably the issue of the text label file (i.e., *.wrd). Please check whether the text in the label file can be read and encoded correctly here.

from av_hubert.

chevalierNoir commented on September 10, 2024

@chevalierNoir I removed the input_modality in AVHubertConfig, but the error remains the same. I follow the fine-tuning command: fairseq-hydra-train --config-dir ~ /av_hubert/avhubert/conf/finetune/ --config-name large_lrs3_433h.yaml task.data=~ /data_hubert/433h_data/tsv_path task.label_dir= ~ /data_hubert/433h_data/wrd_path task.tokenizer_bpe_model=~ /data_hubert/spm1000/spm_unigram1000.txt model.w2v_path= ~ /data_hubert/checkpoints/large_lrs3_iter5.pt hydra.run.dir=~ /data_hubert/experiment/finetune/ common.user_dir=pwd

Did you use the same fairseq as is provided in the repo? There will be an error if a more recent version of fairseq is used.

from av_hubert.

YadiraRoCa commented on September 10, 2024

The problem was related to the version of fairseq. It works now, thank you so much @chevalierNoir

from av_hubert.

Rongjiehuang commented on September 10, 2024

Hi @chevalierNoir, may it be possible to use the recent fairseq to fine-tune AV-Hubert? Because some of my implementations are based on a recent version.

from av_hubert.

chevalierNoir commented on September 10, 2024

Hi @chevalierNoir, may it be possible to use the recent fairseq to fine-tune AV-Hubert? Because some of my implementations are based on a recent version.

It should be OK. But you may need to update some arguments depending on the version you use.

from av_hubert.

CCTN-BCI commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error: File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model task_pretrain = tasks.setup_task(w2v_args.task) File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task cfg = merge_with_parent(dc(), cfg) File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent merged_cfg = OmegaConf.merge(dc, cfg) File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge target.merge_with(*others[1:]) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with self._format_and_raise(key=None, value=None, cause=e) File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with self._merge_with(*others) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with BaseContainer._map_merge(self, other) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge dest[key] = src._get_node(key) File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem self._format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig' full_key: input_modality reference_type=Optional[AVHubertPretrainingConfig] object_type=AVHubertPretrainingConfig
I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )
Should I add this key or is this key the same as 'task.modalities'?

The input_modality in AVHubertConfig is a deprecated argument and can be removed. But keeping it there won't affect the current model training. If you follow the fine-tuning command here, it should work.

I still have an error now!
I follow the words on readme.md as follows to load a pre-trained model:

$ cd avhubert
$ python

import fairseq
import hubert_pretraining, hubert
ckpt_path = "/path/to/the/checkpoint.pt"
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
model = models[0]

The error is as follows:
omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'
full_key: input_modality
reference_type=Optional[AVHubertPretrainingConfig]
object_type=AVHubertPretrainingConfig

How should I remove the argument 'input_modality'? Thank you very much!

from av_hubert.

How to fine-tune with my own dataset about av_hubert HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent