Giter Club home page Giter Club logo

Comments (18)

YuanGongND avatar YuanGongND commented on August 20, 2024

hi Kai,

Thanks for your kind words.

Have you checked Appendix.C of the paper? I suggest to start from an AudioSet pretrained checkpoint. You may need to tune the learning rate.

-Yuan

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

Hi Yuan, thank you so much for your reply. Could I know whether you adopt both visual and audio features for action recognition? Also, could you please share the downloading link of K400 for reproduction?

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

hi,

CAV-MAE itself by default is a multi-modal audio-visual model. We didn't do special adaptation for K400.

The K400 dataset is at https://www.deepmind.com/open-source/kinetics.

You may find some direct download link like https://github.com/cvdfoundation/kinetics-dataset.

-Yuan

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

hi,

CAV-MAE itself by default is a multi-modal audio-visual model. We didn't do special adaptation for K400.

The K400 dataset is at https://www.deepmind.com/open-source/kinetics.

You may find some direct download link like https://github.com/cvdfoundation/kinetics-dataset.

-Yuan

Hi Yuan, I tried the dataset downloaded from cvdfoundation. When I extract audio waveform from video, I met this issue: FileNotFoundError: No such file or directiory: ./iEzpnawqqZ8_000036_000046_intermediate.wav. When I open this video (id: iEzpnawqqZ8), I didn't hear anything. Could I know about whether it is caused by the silent video? And how did you solve this issue? Many thanks.

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

hi,

This should be a question to the cvfoundation.

I honestly cannot remember (and I should not speak for them), just processed 10+ datasets in the past month and cannot recall all details.

-Yuan

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

This repo contains a sample script to extract audio track from video: https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/extract_audio.py

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

This repo contains a sample script to extract audio track from video: https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/extract_audio.py

Yes, I used this script to extract the audio. I guess the error is caused by that some video files don't contain audio stream. In your experiment for K400, did you use the train and val files for finetuninng and evaulation? I am wondering if the K400 results in the paper are based on the val data ?

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

I just checked, we used the validation set (not test set) in our code, this is consistent with that indicated in the paper.

In general, you can trust the paper, when I wrote it, I double checked with the codebase to make sure it is accurate. But with the time, my memory gets blurry on details.

When I check the code base, I found I did replace the corrupted video as indictated in the cvfoundation repo.

News: users found ~1400 corrupted videos. A replacement for the vast majority can be found here.

Is this related to your issue?

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

Btw, our dataloader does tolerate missing audio / visual frames. So if it is just a single file problem, you can probably ignore it.

cav-mae/src/dataloader.py

Lines 236 to 240 in a7a658a

try:
fbank = self._wav2fbank(datum['wav'], mix_datum['wav'], mix_lambda)
except:
fbank = torch.zeros([self.target_length, 128]) + 0.01
print('there is an error in loading audio')

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

Btw, our dataloader does tolerate missing audio / visual frames. So if it is just a single file problem, you can probably ignore it.

cav-mae/src/dataloader.py

Lines 236 to 240 in a7a658a

try:
fbank = self._wav2fbank(datum['wav'], mix_datum['wav'], mix_lambda)
except:
fbank = torch.zeros([self.target_length, 128]) + 0.01
print('there is an error in loading audio')

Hi Yuan, thank you very much for your sharing. I met the error when I extract the audio waveforms off-line using the ‘extract_audio.py’. I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors. Just to confirm your 'extract_audio.py' aims to receiving videos with two-channel audios by default?

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

Also, could you please provide the path list of k400 video files you used for training and testing ?

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

Hi Yuan, could I know whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset (with both audio-visual branch or visual-only branch)? Many thanks.

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

hi,

For your first question.

os.system('sox {:s} {:s} remix 1'.format(output_f_1, output_f_2))

Should be able to process 1-channel and multi-channel audios, you can do your test.

whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset

Yes, in the research code base, we use same code for K400 and KS as AS/VGG, the research code is then polished to released in the current form. So I assume there's minimal / no change needed for these two files for K400/KS.

I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors.

No audio track will lead to an error, but as I said above, the dataload will ignore it. Single-channel audio track should be fine, see my first point. I cannot recall if I get the error (I processed 10+ datasets in the past month and a lot more before), but K400/KS are quite commonly used datasets. I don't think there's major problem with them.

Also, could you please provide the path list of k400 video files you used for training and testing ?

I tend to release anything that is "clean" and double-checked rather than researchy code/data to avoid misleading. With limited time I have, I prioritize things in the main manuscript (AS/VGGSound). Another reason is K400 is dominated by the visual modality, i.e., the viusal branch is much better than audio branch, and close to combining audio and visual branches.

-Yuan

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

hi,

For your first question.

os.system('sox {:s} {:s} remix 1'.format(output_f_1, output_f_2))

Should be able to process 1-channel and multi-channel audios, you can do your test.

whether the 'traintest_ft.py' and 'run_cavmae_ft.py' can be used for finetuning the k400 or other action dataset

Yes, in the research code base, we use same code for K400 and KS as AS/VGG, the research code is then polished to released in the current form. So I assume there's minimal / no change needed for these two files for K400/KS.

I am not sure if you met this similar issue when you proprocessed the K400 dataset since it seems that the some downloaded videos files do not contain audio stream or only contain one audio stream so as to occur errors.

No audio track will lead to an error, but as I said above, the dataload will ignore it. Single-channel audio track should be fine, see my first point. I cannot recall if I get the error (I processed 10+ datasets in the past month and a lot more before), but K400/KS are quite commonly used datasets. I don't think there's major problem with them.

Also, could you please provide the path list of k400 video files you used for training and testing ?

I tend to release anything that is "clean" and double-checked rather than researchy code/data to avoid misleading. With limited time I have, I prioritize things in the main manuscript (AS/VGGSound). Another reason is K400 is dominated by the visual modality, i.e., the viusal branch is much better than audio branch, and close to combining audio and visual branches.

-Yuan

Hi Yuan, thank you very much for your kind reply. Could you please share the downloading link of AudioSet-20K, Kinetics-Sounds 32 and dataset for retrieval and Inpainting ?

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024
  1. AudioSet-20K: This is Youtube video, we cannot provide direct download, all video ids we used are released, see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists

  2. Kinetics-Sounds 32: This is a subset of K400, check the 32 classes at the footnote of page 14 of the paper.

  3. Dataset for retrieval: see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists

  4. and Inpainting? By default, we use VGGSound eval set (indepedent from the training set AudioSet) for inpainting experiments.

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024
  1. AudioSet-20K: This is Youtube video, we cannot provide direct download, all video ids we used are released, see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists
  2. Kinetics-Sounds 32: This is a subset of K400, check the 32 classes at the footnote of page 14 of the paper.
  3. Dataset for retrieval: see https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists
  4. and Inpainting? By default, we use VGGSound eval set (indepedent from the training set AudioSet) for inpainting experiments.

Thank you very much for your sharing. That means I can extract the AudioSet-20K based on your provided video-ids after I download full AudioSet? Could I know how much memory the AudioSet will occupy? (since I would like to estimate if it will exceed my drive)

from cav-mae.

YuanGongND avatar YuanGongND commented on August 20, 2024

That means I can extract the AudioSet-20K based on your provided video-ids after I download full AudioSet?

Yes

Could I know how much memory the AudioSet will occupy? (since I would like to estimate if it will exceed my drive)

Depend on the sampling rate / bit rate, you get an estimate based on a few samples you download.

from cav-mae.

kaiw7 avatar kaiw7 commented on August 20, 2024

Hi Yuan, I am trying to use the 'cav-mae-scale++.pth' checkpoint to finetune KS-32 training set and evaluate the results on validation dataset. I use your shared 'run_cavmae_ft.py' and 'traintest_ft.py' with your required the data format. I use the hyperparameters in 'run_cavmae_ft_full.sh' and modify 'ftmodel=videoonly' and batch size due to gpu memory, and set the 'loss type' as CE and 'metric type' as acc. Could I know if it is correct for KS-32 finetuning? And also I would appreciate it if you can share the training hyperparameters for k400 and ks-32. Thank you very much.

from cav-mae.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.