yuangongnd / cav-mae Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 20.0 13.21 MB

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".

License: BSD 2-Clause "Simplified" License

Shell 14.75% Python 85.25%

audio audio-processing computer-vision multimodal multimodal-deep-learning

cav-mae's People

Contributors

Stargazers

Watchers

cav-mae's Issues

Question for contrastive loss weight in the paper

I have a question regarding the weights used in CAV-MAE. It seems like the $\lambda_c$ could play an important role in the optimization. I understand it is due to the gradient scale but It is surprising that the ablation study for CAV (contrastive loss only) still requires $\lambda_c$ to be $0.1$ or $0.01$. I am wondering what happened if $\lambda_c$ is set as 1? Will it lead to overfitting issue?

Best,

Kun

Some confuse about this paper and implement

I have some confuse about this paper.

   The first one is that contrastive learning loss function is consists of two parts (audio-to-visual similar distant and visual-to-audio similar distant) in usual. What is the reason of using the single visual-to-audio similar distant in this paper?

   And second one is that what is the implement of "frame aggregation"? In other word, how can I get the image frame from the whole video?

   The third one is what is the designing purpose of modality type embedding Ea and Ev?

Multi-gpu pre-training

Dear Yuan, thank you for releasing the source code for your great project. It looks like the current code only supports pre-training on a single GPU (e.g., missing DDP data sampler) while your paper indicates that the released models were pre-trained on 4 GPUs.

If possible, could you please update the code to support the multi-gpu pre-training?

Thank you.

Just suggesting a small change to Loading model for Finetuning Example

In the markdown for CAV-MAE it should be: audio_model = CAVMAEFT(label_dim=n_class, modality_specific_depth=11)
instaed of: audio_model = models.CAVMAEFT(label_dim=n_class, modality_specific_depth=11)

Full code below:

import torch
from models import CAVMAEFT
model_path = 'the path to your model location'
n_class = 527 # 527 for audioset finetuned models, 309 for vggsound finetuned models

CAV-MAE model without decoder

audio_model = models.CAVMAEFT(label_dim=n_class,
modality_specific_depth=11)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mdl_weight = torch.load(model_path, map_location=device)
audio_model = torch.nn.DataParallel(audio_model) # it is important to convert the model to dataparallel object as all weights are saved in dataparallel format (i.e., in module.xxx)
miss, unexpected = audio_model.load_state_dict(mdl_weight, strict=False)
print(miss, unexpected) # check if all weights are correctly loaded, if you are loading a model with decoder, you will see decoders are unexpected and missed are newly initialized classification heads

Returns error: model is undefined

Instead it should be:

import torch
from models import CAVMAEFT
model_path = 'the path to your model location'
n_class = 527 # 527 for audioset finetuned models, 309 for vggsound finetuned models

CAV-MAE model without decoder

audio_model = CAVMAEFT(label_dim=n_class,
modality_specific_depth=11)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mdl_weight = torch.load(model_path, map_location=device)
audio_model = torch.nn.DataParallel(audio_model) # it is important to convert the model to dataparallel object as all weights are saved in dataparallel format (i.e., in module.xxx)
miss, unexpected = audio_model.load_state_dict(mdl_weight, strict=False)
print(miss, unexpected) # check if all weights are correctly loaded, if you are loading a model with decoder, you will see decoders are unexpected and missed are newly initialized classification heads

About the video part, could you release the experimental code?

Hi author, thanks for your great work!

As you responded to reviewer Ee8Z on https://openreview.net/forum?id=QPtMRyk5rb, and as shown in Figure 4, you mentioned conducting experiments with multiple video frames. However, in the code, I can see that your model receives an image, not video, as the input. Could you release the part of the code that processes video?

Acquiring checkpoints of VGGSound (audio), VGGSound (video)

Hi Yuan,

Could you please release the checkpoints of VGGSound (audio) and VGGSound (video) or send me a copy of them? The checkpoints will help me to reproduce the results (59.5 and 47.0) of Table 1.
Besides, what is the setting of 'CAV-MAE-Scale++'? I cannot find the meaning of '++' in your paper. If it varies from the '+' version, could you please send me the copy of CAV-MAE-Scale+ to reproduce the results (19.8) in Table1?

Best regards,

installation

Create virtual environment using conda:

conda create --name venv python=3.8 -y
conda activate venv

install compatible packages

pip install timm==0.4.5
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
pip install numpy==1.21.6
pip install scikit-learn

Where is contrastive loss implemented? How are the positive and negative samples defined?

I have a question regarding the code and the paper. I can't Identify where the contrastive loss code is, and how the positive and negative samples are defined. Looking at your paper SSAST has helped give me a vague idea of how contrastive loss might have been implemented (seemingly by matching masked patches), but I would like to further look into the code and understand your implementation. Could you possibly give me some more explanation on where and how this is implemented?

Zero-shot Code

Thanks for the great work! Can you share the code of zero-shot transfer on the ESC-50 and Urban sound 8k? Thanks.

Finetune CAVMAE on ESC50

Hi Yuan, did you finetune CAVMAE on ESC50 dataset? Could you advise me what is the training pipline? Thank you very much.

Error when loading the CAV-MAE model

Hello,

I am trying to fine-tune CAV-MAE for an audio classification task, and I loaded the model according to the provided snippet. However, when I do so I get the following error:

`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_27/285712233.py in
4 audio_model = CAVMAE(audio_length=1024,
5 modality_specific_depth=11,
----> 6 norm_pix_loss=True, tr_pos=False)
7
8 mdl_weight = torch.load(model_path, map_location=CFG.device)

/kaggle/working/cav-mae/src/models/cav_mae.py in init(self, img_size, audio_length, patch_size, in_chans, embed_dim, modality_specific_depth, num_heads, decoder_embed_dim, decoder_depth, decoder_num_heads, mlp_ratio, norm_layer, norm_pix_loss, tr_pos)
93
94 # audio-branch
---> 95 self.blocks_a = nn.ModuleList([Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in range(modality_specific_depth)])
96 # visual-branch
97 self.blocks_v = nn.ModuleList([Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in range(modality_specific_depth)])

/kaggle/working/cav-mae/src/models/cav_mae.py in (.0)
93
94 # audio-branch
---> 95 self.blocks_a = nn.ModuleList([Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in range(modality_specific_depth)])
96 # visual-branch
97 self.blocks_v = nn.ModuleList([Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in range(modality_specific_depth)])

/kaggle/working/cav-mae/src/models/cav_mae.py in init(self, dim, num_heads, mlp_ratio, qkv_bias, qk_scale, drop, attn_drop, drop_path, act_layer, norm_layer)
41 self.norm1_v = norm_layer(dim)
42 self.attn = Attention(
---> 43 dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
44 # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
45 self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

TypeError: init() got an unexpected keyword argument 'qk_scale'`

I get a similar error when trying to load the CAV-MAE-FT model for AudioSet.

Video Only results on AudioSet-20K

Hi,

I can reproduce the results of finetuning on Audioset 20K with the multimodal setting.
However, I tried to get the visual-only baseline from the multimodal pre-trained weights. I can only get 13.11 (missingvideoonly setting) and 12.7 (videoonly).

Pretrained ViT-B-16 on ImageNet-21K with multimodal script can get ~9.5mAP.
Pretrained CAV-MAE with audioonly script can get ~17.6 mAP.

Do you have the script for training videoonly baseline?

Thank you

Which epoch of pre-trained models should I use?

Hi,

I just noticed that the script you provided use audio_model.21. Does that mean you use the model at 21th epoch?

Because you pre-train the model with epochs, it would be a little bit confusing to me.

Thank you

Eval data not used in evaluation stage?

Hi, thank you for this great work,
I'm reproducing this code, and I have a small confusion regarding the evaluation stage.
Why isn't this "args.data_eval"? I'm wondering why args.data_val is used for the evaluation stage.
https://github.com/YuanGongND/cav-mae/blob/68fe8c2a3917dc2926e41f796bfdcb331a64b42c/src/run_cavmae_ft.py#L196C1-L196C108

Usage of audio-modality components for visual embeddings

Hey! In the following line of code from the decoder of CAV-MAE:

cav-mae/src/models/cav_mae.py

Line 328 in 6cb02fe

 v_ = torch.cat([x[:, self.patch_embed_a.num_patches-int(mask_a[0].sum()):, :], mask_tokens_v], dim=1) # no cls token 

, you are using audio components to create the visual inputs to the decoder. Was this a deliberate choice? (To me, it looks like you copy-pasted the code from the audio modality and forgot to make some changes). Thanks!

Question about some irregular videos in AudioSet-20k

Hi,

I tried to follow the finetuning protocol on AudioSet-20k, and I have downloaded ~18k training samples. However, I found that some videos are irregular according to the official '.csv' file, i.e., either less than 10s or the start-end time exceed the total length. Could you please tell me about how to preprocess these irregular videos?

Some examples are attached below,
Tr7pmnO3eHo, 100.000, 110.000, "/m/03cl9h,/m/04rlf,/m/09x0r,/m/0ytgt" (The length of the video is only 2s in the Web)
d7vfbyFl5kc, 0.000, 3.000, "/m/0c1dj,/t/dd00121" (The specified time bucket is less than 10s)

Thank you

retrieval evaluation

Hi, just want to make sure I understand it correctly. Based on the retrieval.py, only the middle frame (frame_use = 5) of the video will be used. Do I understand it correctly?

Also, I am wondering what is the reason for using the for loop to compute the similarity one by one? It turns out to be very slow. Can we just do something similar to the CLIP zero shot retrieval? I tried both implementation and the result looks same to me but much faster.

traintest_ft.py 中缺少 calculate_stats 函数

Question Regarding stat calculation of dataset

cav-mae/src/dataloader.py

Lines 87 to 96 in 68fe8c2

 # dataset spectrogram mean and std, used to normalize the input 

 self.norm_mean = self.audio_conf.get('mean') 

 self.norm_std = self.audio_conf.get('std') 

 # skip_norm is a flag that if you want to skip normalization to compute the normalization stats using src/get_norm_stats.py, if Ture, input normalization will be skipped for correctly calculating the stats. 

 # set it as True ONLY when you are getting the normalization stats. 

 self.skip_norm = self.audio_conf.get('skip_norm') if self.audio_conf.get('skip_norm') else False 

 if self.skip_norm: 

 print('now skip normalization (use it ONLY when you are computing the normalization stats).') 

 else: 

 print('use dataset mean {:.3f} and std {:.3f} to normalize the input.'.format(self.norm_mean, self.norm_std))

Hello, I would like some help regarding how can I get norm and std stats from audioset. I see this code here has some purpose in getting normalization stats, but it isn't very clear to me how I can get normalization stats following the comments.

How to download MSR-VTT datatset?

Hi, I like your marvelous work including a large number of experiments.
I'm working on a project based on your work, and I want to ask you a question.
You reported the training and zero-shot retrieval results on MSR-VTT dataset.
However, I cannot find any github or link for downloading the dataset.
So may I ask how to download MSR-VTT dataset, and how did you preprocess it? (is there any github code that you referred to?)
Thank you.

-Kyeongha

Could you release the checkpoints pretrained on Kinetics 400

Hi Yuan, in the paper you mentioned that "Specifically, we train the model on Kinetics-400 (K400) dataset and report the top-1 accuracy on Kinetics-Sounds", I'm wondering if it is possible that you could release the checkpoints pretrained on K400 dataset for action recognition tasks. It appears that the repo currently provides checkpoints only for the event classification tasks. Thanks a lot!

BOM Considerations When Extracting Your Video frames & Audio

when I tried loading my own data by using src/preprocess/extract_{audio/video_fram}.py, I encountered some problem such that
(the path in my csv file): No such file or directory
is displayed despite the fact that the path actually exists and it's correctly written in the csv file.

This was due to BOM, and solved by inserting encoding="utf-8-sig" as an argment of np.loadtxt in the files.

I write this here because the error message does not directly show the cause of the problem, which may confuse the user.

How can i get the video and audio pairs of audioset?

Not found the sample_video_extract_list.csv

Hi，Dr Gong!

Thanks for release the code of CAV-MAE. It's not difficult to see that it is a great work!
My only question is, what you mentioned in the README

Both scripts are simple, you will need to prepare a csv file containing a list of video paths (see for an example src/preprocess/sample_video_extract_list.csv) but I haven't found it. For those who are first to this field, it may be difficult to follow.

Can you upload it? Thanks a lot!

Pretraining cav-mae on K400

Hi Dr. Gong, thank you very much for your amazing work. I would like to pretrain the cav-mae model on K400 dataset. Could I know about whether there is something needs to be modified or changed?

what is the validation set for finetuning?

In your paper, you mentioned using Audioset-2M and Audioset-20k for fine-tuning during your experiments. However, I am curious about the process of splitting the training and validation data during the fine-tuning stage.

I know that in VGGsound dataset, they have already split training and validation samples.

Could you kindly elaborate on how this division was carried out for Audioset? Since in the fine-tuning stage, we always need to split the dataset into training and validation data. Do you use Audioset-2M or Audioset-20k to finetune and use Audioset-Eval to validate?

Audio Event Classification resulting tensor has all negative values

Using the following pretrained model for audio tagging (based on the AudioSet ontology):

Pretrained Model Pretrain Data Finetune Data Performance
CAV-MAE-Scale+ AudioSet-2M (multi-modal) AudioSet-2M (audio) 46.6 mAP

when I get the result from CAVMAEFT using mode='audioonly', the entire tensor of length 527 has negative values, and it doesn't seem to sum to 1. When I argsort and look at the top few tags (inspite of the negative), it seems somewhat correct (on unseen audio data).

Is getting an entire array of negative values (which doesn't sum to 1) expected?

some problem about finetuning

Dear Dr Gong,

I tried to fine-tune the audio encoder on ESC50 dataset using the pretrained CAV-MAE model. But the performance is far from expectation. I listed all the details and tricks that I have used. I wonder if I missed anything during finetuning.

I train your audio-MAE (one branch of CAV-MAE) model (using VIT-B and batch size = 256) on K400 training dataset for 200 epochs and then load the model and do fine-tune on ESC50.

ESC50 dataset has 5 seconds audio. Thus I use the parameter: num_melbin = 128, target length = 512, instead of target length = 1024 for K400 and AudioSet (the length of audio is 10s ). I do not load the positional embedding part in the pertaining model since there is a mismatch in positional embedding .

During finetuning, I followed the same data augmentation masking frequency domain and masking time domain with hyper parameter in your "SSAST" folder for ESC50. I also use the "scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=lr_patience, verbose=True)" to tune the learning rate. I also tried several fixed learning rate and head learning rate ratio with learning decay. However, the validation accuracy keeps around 60% and can hardly goes up.

I found another paper VICREG. In their appendix, they just use supervised learning for ESC50 with RESNET18 backbone without any pretraining model and can get 72.7% accuracy (See the image below). I just wonder if there is anything that I missed for finetuning.

	# dataset spectrogram mean and std, used to normalize the input
	self.norm_mean = self.audio_conf.get('mean')
	self.norm_std = self.audio_conf.get('std')
	# skip_norm is a flag that if you want to skip normalization to compute the normalization stats using src/get_norm_stats.py, if Ture, input normalization will be skipped for correctly calculating the stats.
	# set it as True ONLY when you are getting the normalization stats.
	self.skip_norm = self.audio_conf.get('skip_norm') if self.audio_conf.get('skip_norm') else False
	if self.skip_norm:
	print('now skip normalization (use it ONLY when you are computing the normalization stats).')
	else:
	print('use dataset mean {:.3f} and std {:.3f} to normalize the input.'.format(self.norm_mean, self.norm_std))

yuangongnd / cav-mae Goto Github PK

cav-mae's People

Contributors

Stargazers

Watchers

Forkers

cav-mae's Issues

CAV-MAE model without decoder

CAV-MAE model without decoder

Recommend Projects

Recommend Topics

Recommend Org