Giter Club home page Giter Club logo

Comments (22)

chevalierNoir avatar chevalierNoir commented on September 10, 2024 1

@sif99 In pre-training, we assume paired text is not available. Our current code does not support having text as a third modality in pre-training.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

Hi,

  1. The configuration file specifies the hyperparameters as well as paths to data. You can find examples of config files for fine-tuning here.
  2. tokenizer_bpe_model is the path to the bpe tokenizer. You only need your training text (i.e., a list of sentences) to generate that with gen_subword.py. Here is an example.
  3. Yes. train.tsv and valid.tsv are for train and validation.
  4. The format of tsv file is as follows, where each field is separated by tab.
/ ## root directory, first line
id1 /path/to/video1 /path/to/audio1 number-frames-video1 number-frames-audio1
id2 /path/to/video2 /path/to/audio2 number-frames-video2 number-frames-audio2
...

Each line of wrd file contains the text transcription corresponding to the line (except first line of root directory) in tsv file:

sentence-1
sentence-2
...

As our current data loader is based on videos, you will have to convert your jpgs to videos. The video fps should be 25 by default. If you don't have the audio track, you can leave dummy strings for audio path and number of audio frames.

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

Thank you for your answer!

Could you share the code of Single-modal Visual HuBERT?

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

Single-modal visual HuBERT is for pre-training. Did you want to use your own dataset for fine-tuning or pre-training?

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

I would like to use Single-modal visual HuBERT for fine-tuning. Is it possible?
Because the checkpoints of Finetuned Models for Visual Speech Recognition were obtained from AV-HuBERT model. But I'm interested in V-HuBERT.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

Just to clarify, the difference between AV-HuBERT and single-modal visual HuBERT only lies in the pre-training stage (w/o text labels), while both can be used for downstream tasks with single modality (e.g., lip reading with images only). If I understand correctly, what you probably want to do is training a lip-reading model on your own dataset using some pre-trained model. For that purpose, you can just download a pre-trained checkpoint (without fine-tuning) from here and fine-tune it on your own data following the fine-tuning instructions above.

However, if you do want to repeat the entire process (i.e., pre-training + fine-tuning + decoding) on your own data with single-modal visual HuBERT, you would have to prepare the cluster labels for pre-training. The feature we use for clustering in the paper is HoG and we use the scikit-image to extract them. The clustering process is same as AV-HuBERT , which you can find here. The pre-training command is same as AV-HuBERT except that task.modalities=['video']. The fine-tuning and decoding are exactly same with AV-HuBERT.

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

okey, thank you for the clarification

from av_hubert.

sif99 avatar sif99 commented on September 10, 2024

@chevalierNoir first of all I'm impressed in this project . I'm intrested also in fine-tuning for AVR and lip reading .
Also , is it possible for the pre-training part to add the text as a third modality ?

from av_hubert.

sif99 avatar sif99 commented on September 10, 2024

@chevalierNoir Thank you for your response . Actually , my idea is to add the text and to pretrain the model with 3 modalities in order to get better results . Like in data preparation , when we trim the audio and the video , we also can trim the text . For the clustering part(AV-HuBERT Label Preparation), I'm not sure about the required modifications for the different steps . Thank you in advance for your return.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

@chevalierNoir Thank you for your response . Actually , my idea is to add the text and to pretrain the model with 3 modalities in order to get better results . Like in data preparation , when we trim the audio and the video , we also can trim the text . For the clustering part(AV-HuBERT Label Preparation), I'm not sure about the required modifications for the different steps . Thank you in advance for your return.

Yes. That's possible for LRS3, which has ground-truth transcriptions and the alignment. It depends on how you utilize the text. If you want to predict the text from video, then you can just add an additional loss during training and the clustering part remains unchanged.

from av_hubert.

sif99 avatar sif99 commented on September 10, 2024

@chevalierNoir Actually for the clustering part , after the first iteration I thought that modification is required here to extract the features for the 3 modalities . Also , to pretrain the model with text , I thought that it has to be masked as same as audio and video masking ?

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

@chevalierNoir Actually for the clustering part , after the first iteration I thought that modification is required here to extract the features for the 3 modalities . Also , to pretrain the model with text , I thought that it has to be masked as same as audio and video masking ?

If you use text as a an additional input modality, yes then you have to modify the code by adding a text encoder. During clustering, you probably need the text as well (change here). Masking text should be necessary, though we have not tried it before. You also need to figure out what sub-word units to use and how to get its alignment as only word-level alignment is provided in LRS3.

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error:
File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model
task_pretrain = tasks.setup_task(w2v_args.task)
File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task
cfg = merge_with_parent(dc(), cfg)
File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent
merged_cfg = OmegaConf.merge(dc, cfg)
File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge
target.merge_with(*others[1:])
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with
self._format_and_raise(key=None, value=None, cause=e)
File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise
format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
_raise(ex, cause)
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with
self._merge_with(*others)
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with
BaseContainer._map_merge(self, other)
File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge
dest[key] = src._get_node(key)
File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem
self._format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise
format_and_raise(
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
_raise(ex, cause)
File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'
full_key: input_modality
reference_type=Optional[AVHubertPretrainingConfig]
object_type=AVHubertPretrainingConfig

I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )

Should I add this key or is this key the same as 'task.modalities'?

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error: File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model task_pretrain = tasks.setup_task(w2v_args.task) File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task cfg = merge_with_parent(dc(), cfg) File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent merged_cfg = OmegaConf.merge(dc, cfg) File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge target.merge_with(*others[1:]) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with self._format_and_raise(key=None, value=None, cause=e) File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with self._merge_with(*others) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with BaseContainer._map_merge(self, other) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge dest[key] = src._get_node(key) File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem self._format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig' full_key: input_modality reference_type=Optional[AVHubertPretrainingConfig] object_type=AVHubertPretrainingConfig

I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )

Should I add this key or is this key the same as 'task.modalities'?

The input_modality in AVHubertConfig is a deprecated argument and can be removed. But keeping it there won't affect the current model training. If you follow the fine-tuning command here, it should work.

from av_hubert.

javadpeyman avatar javadpeyman commented on September 10, 2024

hello
I want to use your code for another language. I created a tokenizer for my textual data with gen_subword.py (I checked that and it's correct). but in the LabelEncoderS2SToken in hubert_pretraining.py, all the tokens will map to 3 (unknown). the characters of my dataset are like the Arabic language. Do I have to change the "s2s_tokenizer" or does it support all of the languages? @chevalierNoir

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

@chevalierNoir I removed the input_modality in AVHubertConfig, but the error remains the same.
I follow the fine-tuning command:
fairseq-hydra-train --config-dir ~ /av_hubert/avhubert/conf/finetune/ --config-name large_lrs3_433h.yaml
task.data=~ /data_hubert/433h_data/tsv_path task.label_dir= ~ /data_hubert/433h_data/wrd_path
task.tokenizer_bpe_model=~ /data_hubert/spm1000/spm_unigram1000.model
model.w2v_path= ~ /data_hubert/checkpoints/large_lrs3_iter5.pt
hydra.run.dir=~ /data_hubert/experiment/finetune/ common.user_dir=pwd

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

hello I want to use your code for another language. I created a tokenizer for my textual data with gen_subword.py (I checked that and it's correct). but in the LabelEncoderS2SToken in hubert_pretraining.py, all the tokens will map to 3 (unknown). the characters of my dataset are like the Arabic language. Do I have to change the "s2s_tokenizer" or does it support all of the languages? @chevalierNoir

Hi,

The tokenizer we use is unigram sentencepiece. It should also work for arabic languages. It's probably the issue of the text label file (i.e., *.wrd). Please check whether the text in the label file can be read and encoded correctly here.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

@chevalierNoir I removed the input_modality in AVHubertConfig, but the error remains the same. I follow the fine-tuning command: fairseq-hydra-train --config-dir ~ /av_hubert/avhubert/conf/finetune/ --config-name large_lrs3_433h.yaml task.data=~ /data_hubert/433h_data/tsv_path task.label_dir= ~ /data_hubert/433h_data/wrd_path task.tokenizer_bpe_model=~ /data_hubert/spm1000/spm_unigram1000.txt model.w2v_path= ~ /data_hubert/checkpoints/large_lrs3_iter5.pt hydra.run.dir=~ /data_hubert/experiment/finetune/ common.user_dir=pwd

Did you use the same fairseq as is provided in the repo? There will be an error if a more recent version of fairseq is used.

from av_hubert.

YadiraRoCa avatar YadiraRoCa commented on September 10, 2024

The problem was related to the version of fairseq. It works now, thank you so much @chevalierNoir

from av_hubert.

Rongjiehuang avatar Rongjiehuang commented on September 10, 2024

Hi @chevalierNoir, may it be possible to use the recent fairseq to fine-tune AV-Hubert? Because some of my implementations are based on a recent version.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

Hi @chevalierNoir, may it be possible to use the recent fairseq to fine-tune AV-Hubert? Because some of my implementations are based on a recent version.

It should be OK. But you may need to update some arguments depending on the version you use.

from av_hubert.

CCTN-BCI avatar CCTN-BCI commented on September 10, 2024

@chevalierNoir Thank you for your previous answers. Currently, I'm trying to finetune the lrs3 checkpoint and I'm using AV-HuBERT Large LRS3 (in section: Pretrained Models). But I got this error: File "~ /av_hubert/avhubert/hubert_asr.py", line 465, in build_model task_pretrain = tasks.setup_task(w2v_args.task) File "~ /fairseq/fairseq/tasks/init.py", line 39, in setup_task cfg = merge_with_parent(dc(), cfg) File "~ /fairseq/fairseq/dataclass/utils.py", line 490 , in merge_with_parent merged_cfg = OmegaConf.merge(dc, cfg) File "~ /.local/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 321, in merge target.merge_with(*others[1:]) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 331, in merge_with self._format_and_raise(key=None, value=None, cause=e) File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 329, in merge_with self._merge_with(*others) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 347, in _merge_with BaseContainer._map_merge(self, other) File "~ /.local/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 314, in _map_merge dest[key] = src._get_node(key) File "~ /.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 258, in setitem self._format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 629, in format_and_raise _raise(ex, cause) File "~ /.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig' full_key: input_modality reference_type=Optional[AVHubertPretrainingConfig] object_type=AVHubertPretrainingConfig
I checked the class AVHubertPretrainingConfig(FairseqDataclass) in hubert_pretraining.py and there isn't this key. Furthermore, I observed that this key is called in hubert.py (line 67 input_modality: str = II("task.input_modality") )
Should I add this key or is this key the same as 'task.modalities'?

The input_modality in AVHubertConfig is a deprecated argument and can be removed. But keeping it there won't affect the current model training. If you follow the fine-tuning command here, it should work.

I still have an error now!
I follow the words on readme.md as follows to load a pre-trained model:

$ cd avhubert
$ python

import fairseq
import hubert_pretraining, hubert
ckpt_path = "/path/to/the/checkpoint.pt"
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
model = models[0]

The error is as follows:
omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'
full_key: input_modality
reference_type=Optional[AVHubertPretrainingConfig]
object_type=AVHubertPretrainingConfig

How should I remove the argument 'input_modality'? Thank you very much!

from av_hubert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.