Hi Dr. Gong, thank you very much for your amazing work. I would like to pretrain the c

hi, This should be a question to the cvfoundation. <p dir="auto"

This repo contains a sample to extract audio track from video: <a href="https:/

This repo contains a sample to extract audio track from video: <a

Pretraining cav-mae on K400 about cav-mae HOT 18 CLOSED

yuangongnd commented on August 20, 2024

Pretraining cav-mae on K400

from cav-mae.

Comments (18)

YuanGongND commented on August 20, 2024

hi Kai,

Thanks for your kind words.

Have you checked Appendix.C of the paper? I suggest to start from an AudioSet pretrained checkpoint. You may need to tune the learning rate.

-Yuan

from cav-mae.

kaiw7 commented on August 20, 2024

Hi Yuan, thank you so much for your reply. Could I know whether you adopt both visual and audio features for action recognition? Also, could you please share the downloading link of K400 for reproduction?

from cav-mae.

YuanGongND commented on August 20, 2024

hi,

CAV-MAE itself by default is a multi-modal audio-visual model. We didn't do special adaptation for K400.

The K400 dataset is at https://www.deepmind.com/open-source/kinetics.

You may find some direct download link like https://github.com/cvdfoundation/kinetics-dataset.

-Yuan

from cav-mae.

kaiw7 commented on August 20, 2024

hi,

CAV-MAE itself by default is a multi-modal audio-visual model. We didn't do special adaptation for K400.

The K400 dataset is at https://www.deepmind.com/open-source/kinetics.

You may find some direct download link like https://github.com/cvdfoundation/kinetics-dataset.

-Yuan

Hi Yuan, I tried the dataset downloaded from cvdfoundation. When I extract audio waveform from video, I met this issue: FileNotFoundError: No such file or directiory: ./iEzpnawqqZ8_000036_000046_intermediate.wav. When I open this video (id: iEzpnawqqZ8), I didn't hear anything. Could I know about whether it is caused by the silent video? And how did you solve this issue? Many thanks.

from cav-mae.

YuanGongND commented on August 20, 2024

hi,

This should be a question to the cvfoundation.

I honestly cannot remember (and I should not speak for them), just processed 10+ datasets in the past month and cannot recall all details.

-Yuan

from cav-mae.

YuanGongND commented on August 20, 2024

This repo contains a sample script to extract audio track from video: https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/extract_audio.py

from cav-mae.

kaiw7 commented on August 20, 2024

This repo contains a sample script to extract audio track from video: https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/extract_audio.py

Yes, I used this script to extract the audio. I guess the error is caused by that some video files don't contain audio stream. In your experiment for K400, did you use the train and val files for finetuninng and evaulation? I am wondering if the K400 results in the paper are based on the val data ?

from cav-mae.

YuanGongND commented on August 20, 2024

I just checked, we used the validation set (not test set) in our code, this is consistent with that indicated in the paper.

In general, you can trust the paper, when I wrote it, I double checked with the codebase to make sure it is accurate. But with the time, my memory gets blurry on details.

When I check the code base, I found I did replace the corrupted video as indictated in the cvfoundation repo.

News: users found ~1400 corrupted videos. A replacement for the vast majority can be found here.

Is this related to your issue?

from cav-mae.

YuanGongND commented on August 20, 2024

Btw, our dataloader does tolerate missing audio / visual frames. So if it is just a single file problem, you can probably ignore it.

cav-mae/src/dataloader.py

Lines 236 to 240 in a7a658a

 try: 

 fbank = self._wav2fbank(datum['wav'], mix_datum['wav'], mix_lambda) 

 except: 

 fbank = torch.zeros([self.target_length, 128]) + 0.01 

 print('there is an error in loading audio')

from cav-mae.

kaiw7 commented on August 20, 2024

Btw, our dataloader does tolerate missing audio / visual frames. So if it is just a single file problem, you can probably ignore it.

cav-mae/src/dataloader.py

Lines 236 to 240 in a7a658a

try:

fbank = self._wav2fbank(datum['wav'], mix_datum['wav'], mix_lambda)

except:

fbank = torch.zeros([self.target_length, 128]) + 0.01

print('there is an error in loading audio')

Hi Yuan, thank you very much for your sharing. I met the error when I extract the audio waveforms off-line using the ‘extract_audio.py’. I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors. Just to confirm your 'extract_audio.py' aims to receiving videos with two-channel audios by default?

from cav-mae.

kaiw7 commented on August 20, 2024

Also, could you please provide the path list of k400 video files you used for training and testing ?

from cav-mae.

kaiw7 commented on August 20, 2024

Hi Yuan, could I know whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset (with both audio-visual branch or visual-only branch)? Many thanks.

from cav-mae.

YuanGongND commented on August 20, 2024

hi,

For your first question.

cav-mae/src/preprocess/extract_audio.py

Line 29 in a7a658a

os.system('sox {:s} {:s} remix 1'.format(output_f_1, output_f_2))

Should be able to process 1-channel and multi-channel audios, you can do your test.

whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset

Yes, in the research code base, we use same code for K400 and KS as AS/VGG, the research code is then polished to released in the current form. So I assume there's minimal / no change needed for these two files for K400/KS.

I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors.

No audio track will lead to an error, but as I said above, the dataload will ignore it. Single-channel audio track should be fine, see my first point. I cannot recall if I get the error (I processed 10+ datasets in the past month and a lot more before), but K400/KS are quite commonly used datasets. I don't think there's major problem with them.

Also, could you please provide the path list of k400 video files you used for training and testing ?

I tend to release anything that is "clean" and double-checked rather than researchy code/data to avoid misleading. With limited time I have, I prioritize things in the main manuscript (AS/VGGSound). Another reason is K400 is dominated by the visual modality, i.e., the viusal branch is much better than audio branch, and close to combining audio and visual branches.

-Yuan

from cav-mae.

kaiw7 commented on August 20, 2024

hi,

For your first question.

cav-mae/src/preprocess/extract_audio.py

Line 29 in a7a658a

os.system('sox {:s} {:s} remix 1'.format(output_f_1, output_f_2))

Should be able to process 1-channel and multi-channel audios, you can do your test.

whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset

Yes, in the research code base, we use same code for K400 and KS as AS/VGG, the research code is then polished to released in the current form. So I assume there's minimal / no change needed for these two files for K400/KS.

I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors.

No audio track will lead to an error, but as I said above, the dataload will ignore it. Single-channel audio track should be fine, see my first point. I cannot recall if I get the error (I processed 10+ datasets in the past month and a lot more before), but K400/KS are quite commonly used datasets. I don't think there's major problem with them.

Also, could you please provide the path list of k400 video files you used for training and testing ?

I tend to release anything that is "clean" and double-checked rather than researchy code/data to avoid misleading. With limited time I have, I prioritize things in the main manuscript (AS/VGGSound). Another reason is K400 is dominated by the visual modality, i.e., the viusal branch is much better than audio branch, and close to combining audio and visual branches.

-Yuan

Hi Yuan, thank you very much for your kind reply. Could you please share the downloading link of AudioSet-20K, Kinetics-Sounds 32 and dataset for retrieval and Inpainting ?

from cav-mae.

YuanGongND commented on August 20, 2024

AudioSet-20K: This is Youtube video, we cannot provide direct download, all video ids we used are released, see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists
Kinetics-Sounds 32: This is a subset of K400, check the 32 classes at the footnote of page 14 of the paper.
Dataset for retrieval: see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists
and Inpainting? By default, we use VGGSound eval set (indepedent from the training set AudioSet) for inpainting experiments.

from cav-mae.

kaiw7 commented on August 20, 2024

AudioSet-20K: This is Youtube video, we cannot provide direct download, all video ids we used are released, see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists

Kinetics-Sounds 32: This is a subset of K400, check the 32 classes at the footnote of page 14 of the paper.

Dataset for retrieval: see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists

and Inpainting? By default, we use VGGSound eval set (indepedent from the training set AudioSet) for inpainting experiments.

Thank you very much for your sharing. That means I can extract the AudioSet-20K based on your provided video-ids after I download full AudioSet? Could I know how much memory the AudioSet will occupy? (since I would like to estimate if it will exceed my drive)

from cav-mae.

YuanGongND commented on August 20, 2024

That means I can extract the AudioSet-20K based on your provided video-ids after I download full AudioSet?

Yes

Could I know how much memory the AudioSet will occupy? (since I would like to estimate if it will exceed my drive)

Depend on the sampling rate / bit rate, you get an estimate based on a few samples you download.

from cav-mae.

kaiw7 commented on August 20, 2024

Hi Yuan, I am trying to use the 'cav-mae-scale++.pth' checkpoint to finetune KS-32 training set and evaluate the results on validation dataset. I use your shared 'run_cavmae_ft.py' and 'traintest_ft.py' with your required the data format. I use the hyperparameters in 'run_cavmae_ft_full.sh' and modify 'ftmodel=videoonly' and batch size due to gpu memory, and set the 'loss type' as CE and 'metric type' as acc. Could I know if it is correct for KS-32 finetuning? And also I would appreciate it if you can share the training hyperparameters for k400 and ks-32. Thank you very much.

from cav-mae.

Pretraining cav-mae on K400 about cav-mae HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	try:
	fbank = self._wav2fbank(datum['wav'], mix_datum['wav'], mix_lambda)
	except:
	fbank = torch.zeros([self.target_length, 128]) + 0.01
	print('there is an error in loading audio')