qiuqiangkong / audioset_tagging_cnn Goto Github PK

License: MIT License

Python 97.99% Shell 2.01%

audioset_tagging_cnn's Introduction

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This repo contains code for our paper: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition [1]. A variety of CNNs are trained on the large-scale AudioSet dataset [2] containing 5000 hours audio with 527 sound classes. A mean average precision (mAP) of 0.439 is achieved using our proposed Wavegram-Logmel-CNN system, outperforming the Google baseline of 0.317 [3]. The PANNs have been used for audio tagging and sound event detection. The PANNs have been used to fine-tune several audio pattern recoginition tasks, and have outperformed several state-of-the-art systems.

Environments

The codebase is developed with Python 3.7. Install requirements as follows:

pip install -r requirements.txt

Audio tagging using pretrained models

Users can inference the tags of an audio recording using pretrained models without training. Details can be viewed at scripts/0_inference.sh First, downloaded one pretrained model from https://zenodo.org/record/3987831, for example, the model named "Cnn14_mAP=0.431.pth". Then, execute the following commands to inference this audio:

CHECKPOINT_PATH="Cnn14_mAP=0.431.pth"
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_mAP%3D0.431.pth?download=1
MODEL_TYPE="Cnn14"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path="resources/R9_ZSCveAHg_7s.wav" \
    --cuda

Then the result will be printed on the screen looks like:

Speech: 0.893
Telephone bell ringing: 0.754
Inside, small room: 0.235
Telephone: 0.183
Music: 0.092
Ringtone: 0.047
Inside, large room or hall: 0.028
Alarm: 0.014
Animal: 0.009
Vehicle: 0.008
embedding: (2048,)

If users would like to use 16 kHz model for inference, just do:

CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth"   # Trained by a later code version, achieves higher mAP than the paper.
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_16k_mAP%3D0.438.pth?download=1
MODEL_TYPE="Cnn14_16k"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging \
    --sample_rate=16000 \
    --window_size=512 \
    --hop_size=160 \
    --mel_bins=64 \
    --fmin=50 \
    --fmax=8000 \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path='resources/R9_ZSCveAHg_7s.wav' \
    --cuda

Sound event detection using pretrained models

Some of PANNs such as DecisionLevelMax (the best), DecisionLevelAvg, DecisionLevelAtt) can be used for frame-wise sound event detection. For example, execute the following commands to inference sound event detection results on this audio:

CHECKPOINT_PATH="Cnn14_DecisionLevelMax_mAP=0.385.pth"
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_DecisionLevelMax_mAP%3D0.385.pth?download=1
MODEL_TYPE="Cnn14_DecisionLevelMax"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path="resources/R9_ZSCveAHg_7s.wav" \
    --cuda

The visualization of sound event detection result looks like:

Please see https://www.youtube.com/watch?v=QyFNIhRxFrY for a sound event detection demo.

For those users who only want to use the pretrained models for inference, we have prepared a panns_inference tool which can be easily installed by:

pip install panns_inference

Please visit https://github.com/qiuqiangkong/panns_inference for details of panns_inference.

Train PANNs from scratch

Users can train PANNs from scratch as follows.

1. Download dataset

The scripts/1_download_dataset.sh script is used for downloading all audio and metadata from the internet. The total size of AudioSet is around 1.1 TB. Notice there can be missing files on YouTube, so the numebr of files downloaded by users can be different from time to time. Our downloaded version contains 20550 / 22160 of the balaned training subset, 1913637 / 2041789 of the unbalanced training subset, and 18887 / 20371 of the evaluation subset.

For reproducibility, our downloaded dataset can be accessed at: link: https://pan.baidu.com/s/13WnzI1XDSvqXZQTS-Kqujg, password: 0vc2

The downloaded data looks like:

dataset_root
├── audios
│    ├── balanced_train_segments
│    |    └── ... (~20550 wavs, the number can be different from time to time)
│    ├── eval_segments
│    |    └── ... (~18887 wavs)
│    └── unbalanced_train_segments
│         ├── unbalanced_train_segments_part00
│         |    └── ... (~46940 wavs)
│         ...
│         └── unbalanced_train_segments_part40
│              └── ... (~39137 wavs)
└── metadata
     ├── balanced_train_segments.csv
     ├── class_labels_indices.csv
     ├── eval_segments.csv
     ├── qa_true_counts.csv
     └── unbalanced_train_segments.csv

2. Pack waveforms into hdf5 files

The scripts/2_pack_waveforms_to_hdf5s.sh script is used for packing all raw waveforms into 43 large hdf5 files for speed up training: one for balanced training subset, one for evaluation subset and 41 for unbalanced traning subset. The packed files looks like:

workspace
└── hdf5s
     ├── targets (2.3 GB)
     |    ├── balanced_train.h5
     |    ├── eval.h5
     |    └── unbalanced_train
     |        ├── unbalanced_train_part00.h5
     |        ...
     |        └── unbalanced_train_part40.h5
     └── waveforms (1.1 TB)
          ├── balanced_train.h5
          ├── eval.h5
          └── unbalanced_train
              ├── unbalanced_train_part00.h5
              ...
              └── unbalanced_train_part40.h5

3. Create training indexes

The scripts/3_create_training_indexes.sh is used for creating training indexes. Those indexes are used for sampling mini-batches.

4. Train

The scripts/4_train.sh script contains training, saving checkpoints, and evaluation.

WORKSPACE="your_workspace"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/main.py train \
  --workspace=$WORKSPACE \
  --data_type='full_train' \
  --window_size=1024 \
  --hop_size=320 \
  --mel_bins=64 \
  --fmin=50 \
  --fmax=14000 \
  --model_type='Cnn14' \
  --loss_type='clip_bce' \
  --balanced='balanced' \
  --augmentation='mixup' \
  --batch_size=32 \
  --learning_rate=1e-3 \
  --resume_iteration=0 \
  --early_stop=1000000 \
  --cuda

Results

The CNN models are trained on a single card Tesla-V100-PCIE-32GB. (The training also works on a GPU card with 12 GB). The training takes around 3 - 7 days.

Validate bal mAP: 0.005
Validate test mAP: 0.005
    Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics.pkl
    Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics_2019-09-21_04-05-05.pickle
iteration: 0, train time: 8.261 s, validate time: 219.705 s
------------------------------------
...
------------------------------------
Validate bal mAP: 0.637
Validate test mAP: 0.431
    Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics.pkl
    Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics_2019-09-21_04-05-05.pickle
iteration: 600000, train time: 3253.091 s, validate time: 1110.805 s
------------------------------------
Model saved to /workspaces/pub_audioset_tagging_cnn_transfer/checkpoints/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/600000_iterations.pth
...

An mean average precision (mAP) of 0.431 is obtained. The training curve looks like:

Results of PANNs on AudioSet tagging. Dash and solid lines are training mAP and evaluation mAP, respectively. The six plots show the results with different: (a) architectures; (b) data balancing and data augmentation; (c) embedding size; (d) amount of training data; (e) sampling rate; (f) number of mel bins.

Performance of differernt systems

Top rows show the previously proposed methods using embedding features provided by Google. Previous best system achieved an mAP of 0.369 using large feature-attention neural networks. We propose to train neural networks directly from audio recordings. Our CNN14 achieves an mAP of 0.431, and Wavegram-Logmel-CNN achieves an mAP of 0.439.

Plot figures of [1]

To reproduce all figures of [1], just do:

wget -O paper_statistics.zip https://zenodo.org/record/3987831/files/paper_statistics.zip?download=1
unzip paper_statistics.zip
python3 utils/plot_for_paper.py plot_classwise_iteration_map
python3 utils/plot_for_paper.py plot_six_figures
python3 utils/plot_for_paper.py plot_complexity_map
python3 utils/plot_for_paper.py plot_long_fig

Fine-tune on new tasks

After downloading the pretrained models. Build fine-tuned systems for new tasks is simple!

MODEL_TYPE="Transfer_Cnn14"
CHECKPOINT_PATH="Cnn14_mAP=0.431.pth"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/finetune_template.py train \
    --sample_rate=32000 \
    --window_size=1024 \
    --hop_size=320 \
    --mel_bins=64 \
    --fmin=50 \
    --fmax=14000 \
    --model_type=$MODEL_TYPE \
    --pretrained_checkpoint_path=$CHECKPOINT_PATH \
    --cuda

Here is an example of fine-tuning PANNs to GTZAN music classification: https://github.com/qiuqiangkong/panns_transfer_to_gtzan

Demos

We apply the audio tagging system to build a sound event detection (SED) system. The SED prediction is obtained by applying the audio tagging system on consecutive 2-second segments. The video of demo can be viewed at:
https://www.youtube.com/watch?v=7TEtDMzdLeY

FAQs

If users came across out of memory error, then try to reduce the batch size.

Cite

[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894.

Reference

[2] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M. and Ritter, M., 2017, March. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776-780, 2017

[3] Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B. and Slaney, M., 2017, March. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131-135, 2017

External links

Other work on music transfer learning includes:
https://github.com/jordipons/sklearn-audio-transfer-learning
https://github.com/keunwoochoi/transfer_learning_music

audioset_tagging_cnn's People

Contributors

Stargazers

Watchers

Forkers

yijiuzai liushenme leetsinghua punchyou redscv entn-at vickyching steelbin nicofarr ekunish wuchaowei2012 wanghelin1997 beachboysqq ml-illustrated anaselmhamdi 553566286 splinter21 wuqiangch jac002020 qiaoyinglin19 lukewys zclccc hvt1609 vanova phoenix9032 jvdahemad guitarmind bz6102365 daywatch balrajashwath jeremmyzong mikful aihill dengbohhxx gopi-durgaprasad shasha-lin lorinsweeney dcastillost tenglang123 zeroo1 kamalsky tontsam nan-wang mo5mami dsilkersahin anashas manojkl changbin-jeon adeyinka-hub wesbz yuangongnd piperod atag34 wuseguang motus flaber123 aprilsin wwymak machengnan ankitshah009 xiongmaoxia andrewyurick vyoz talgold9 jonnor makogarei zaburo-ch xfguo-ucas dung-n-tran liroda nyctalope-de-tarascon wubinbai fytrace vancause faithkaixuan pppku jouvencia richermans jimmy-inl sungbohsun qoboty lvchigo xinhaomei gandolfxu philippschw ummaruje proling1994 seismozhou zanilzanzan ishine hwijune arceushui callzhang chatcharoen world2vec k-bs onfireai jnaranjo-alcazar anitalp jonathan-leroux

audioset_tagging_cnn's Issues

Binarizing output values

Hi Qiuqiang,

I would like to know what is the best way to binarize the linear predicted probabilities in a way that :

0 : audio label is absent
1: audio label is present

If you have any suggestion for binarization issue , it would be great to know it.

And one more question about clipwise_output , as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of period of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?

It would be great for me to get your answers for above mentioned questions.

Anar Sultani

Last dropout is disconnected from fc_audioset layer

It appears that, in all CNN models, the last dropout, i.e., embedding = F.dropout(x, p=0.5, training=self.training), is actually disconnected from the output linear layer, i.e., self.fc_audioset(x).
Indeed, the forward method of these models reads:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(x))

By reading the arXiv paper, it seems that the last dropout should have instead connected the 2048-embedding layer to the 527-output layer. Indeed, the paper reads:

"Dropout [38] is applied after each downsampling operation and fully connected layers to prevent systems from overfitting."

Therefore, I expected to see the following:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(embedding))

Am I missing something?

Thank you,
Alessandro

ERROR - code is too big

When I run panns-reference with CPU, it shows "ERROR - code is too big".

Is panns-reference only available on GPU? Why does this error occur when using the CPU?

Coud not find file "unbalanced_train_segments_part38_partial.z01" in dataset

I am reproducing your paper recently.
But after downloading your dataset, I found that the dataset is missing this file.

"unbalanced_train_segments_part38_partial.z01"

It seems that this file has not been uploaded. Could you please upload this missing file?

Provide code for metrics calculation mAP and mAUC

Could you please provide a code for the metrics used in the paper?
Thank you

convert the sound detection event predicting image into csv (Pandas format)

`def plot_sound_event_detection_result(framewise_output):
"""Visualization of sound event detection result.

Args:
  framewise_output: (time_steps, classes_num)
"""
out_fig_path = 'results/sed_result.png'
os.makedirs(os.path.dirname(out_fig_path), exist_ok=True)

classwise_output = np.max(framewise_output, axis=0) # (classes_num,)

idxes = np.argsort(classwise_output)[::-1]
idxes = idxes[0:5]

ix_to_lb = {i : label for i, label in enumerate(labels)}
lines = []
for idx in idxes:
    line, = plt.plot(framewise_output[:, idx], label=ix_to_lb[idx])
    lines.append(line)

plt.legend(handles=lines)
plt.xlabel('Frames')
plt.ylabel('Probability')
plt.ylim(0, 1.)
plt.savefig(out_fig_path)
print('Save fig to {}'.format(out_fig_path))

convert this into pandas format (timestamp,class_Name) on which particular time which kind of classes are predicting?

[asdf problem] Hello, is this typo in "utils/plot_statistics.py " ?

Hello,

first thank you for the good reference

please check your code "utils/plot_statistics.py "

line 1961, there is text "asdf"

thank you

Is it fully compatible with mixed precision ?

Hello,

I thank you for sharing the weights and experiment of your papers, it is a very good work and very helpful.

I am experimenting your Wavegram_Logmel_Cnn14 model on a custom dataset and I have seen some issue when I am using mixed precision in pytorch 1.6 with the layer LogmelFilterBank. In fact, I get sometimes nan values in the forward output of this layer which makes nan value in the loss function later.
I was wondering if you have an idea why ? I do not have this issue when I am not using mixed precision.

The sample rate of CNN14_emb128_mAP=0.412.pth

thanks for your work. I want to know the sample rate of the CNN14_emb128_mAP=0.412.pth.

numba.decorators ModuleNotFoundError

Hi. I got this error when trying to run pytorch/inference.py .
I installed the required packages by running pip install -r requirements.txt

Below is the traceback:

Traceback (most recent call last): File "pytorch/inference.py", line 6, in <module> import librosa File "/usr/local/lib/python3.7/dist-packages/librosa/__init__.py", line 12, in <module> from . import core File "/usr/local/lib/python3.7/dist-packages/librosa/core/__init__.py", line 109, in <module> from .time_frequency import * # pylint: disable=wildcard-import File "/usr/local/lib/python3.7/dist-packages/librosa/core/time_frequency.py", line 10, in <module> from ..util.exceptions import ParameterError File "/usr/local/lib/python3.7/dist-packages/librosa/util/__init__.py", line 71, in <module> from . import decorators File "/usr/local/lib/python3.7/dist-packages/librosa/util/decorators.py", line 9, in <module> from numba.decorators import jit as optional_jit ModuleNotFoundError: No module named 'numba.decorators'

Author‘s age

secret？

IndexError: index 0 is out of bounds for axis 0 with size 0

I am getting the error when I running the keras_main.py.
The error occur in line [for batch_data_dict in train_loader: ].
Do you have any suggestion?

please create an ipynb file

It would be awesome to see the flow of the pre-trained model in an Ipython notebook

Input wav's time length for model "Wavegram_Logmel_Cnn14"?

Is there a requirement for Input wav's time length? 4s or 2s or any time?

I think this line code "x = torch.cat((x, a1), dim=1)" decide time length should be a certain value，right?

ERROR: -0BIyqJj9ZU: YouTube said: Invalid parameters.

Hi,

While downloading the wavform, I am getting the following errors ,

ERROR: -0BIyqJj9ZU: YouTube said: Invalid parameters.
root : INFO 5 -0CamVQdP_Y start_time: 0.0, end_time: 6.0

Kindly help.

KeyError: 'framewise_output'

Hi,

When I run
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path="resources/7061-6-0-0.wav" --cuda
I got an error saying

Traceback (most recent call last):
  File "pytorch/inference.py", line 202, in <module>
    sound_event_detection(args)
  File "pytorch/inference.py", line 132, in sound_event_detection
    framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0]

Then if I print batch_output_dict I see that the keys are: dict_keys(['clipwise_output', 'embedding']). Am I missing something ?

Thanks

framewise output loss

Can you show me framewise_output loss?

Can the model be used on Android mobile terminal?

Hi,

I would like to know if the model can be transferred to mobile terminal?

Can you provide DEMO for iOS and Android mobile devices?

I don't know yet how to rewrite this code (https://github.com/pytorch/ios-demo-app) to realize the recognition of sound events.

panns_transfer_to_gtzan数据集链接失效

您好，感谢您的出色工作。当我想复您的工程时，发现数据集下载链接http://marsyas.info/downloads/datasets.html已失效。您可以重新发下吗？

Change License to Reflect Proper Authors

Hello, I am interested in leveraging the great work you folks have done here. However, the current MIT License appears to just be a copy of the one used for the AngularJS project and thus doesn't reflect that the copyright holders are the authors of the associated paper "Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley". Updating this to would be greatly appreciated!

Can't download the AudioSet

When using youtube-dl to download the AudioSet, it return an exception:
OSError: ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

It seems that youtube has already ban my ip. Have any suggestion?

Is it possible to put the spec_augmenter in front of the nn.BatchNorm2d

Hi,
SpecAugmentation masks a block of consecutive time steps or mel frequency channels. But why the order is input, BatchNorm, spec_augmenter? Is there any reason for it. Can i adjust the order to input, spec_augmenter, BatchNorm?

Thanks

Can not reproducing the result of audio_tagging result of mobilenetv1 in the PANNs paper, is there any tricks when training?

Hello, thanks for providing the source code and traning data.
I have download the audioset dataset from Baidu network disk you provided, and train the mobilenetv1 model from scratch following the steps you mentioned in "Train PANNs from scratch". But the problem is, I can not reproducing your training result which you provided.(MobileNetV1_mAP=0.389.pth)
When my training iteration reaches 234000, the LOSS is still 1.1358, and the Validate bal mAP is 0.005 and Validate Test mAP is 0.005. It seems that the two mAP never changed and the model can not convergent.
would you please give me some guidance? Is there any tricks when traning the model?

Looking forward for your reply~ thank you

SED in unavailable for some models

Even if the we can use the sound_event_detection on the model "Cnn14_DecisionLevelMax_mAP=0.385.pth" with the command :
python pytorch/inference.py sound_event_detection --model_type="Cnn14_DecisionLevelMax" --checkpoint_path="models\Cnn14_DecisionLevelMax_mAP=0.385.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

The models "MobileNetV1_mAP=0.389.pth" and "Wavegram_Logmel_Cnn14_mAP=0.439.pth" does not work with command :
python pytorch/inference.py sound_event_detection --model_type="Wavegram_Logmel_Cnn14" --checkpoint_path="models\Wavegram_Logmel_Cnn14_mAP=0.439.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

Indeed, the 'framewise_output' is not given by the model raising the error :
Traceback (most recent call last): File "pytorch/inference.py", line 202, in <module> sound_event_detection(args) File "pytorch/inference.py", line 132, in sound_event_detection framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0] KeyError: 'framewise_output'

Assertion error and low MAP on bal/eval set

I am getting the assertion error while running your script to create hdf5 files. It occurs in float32_to_int16() conversion. Here is a simplified version.

def float32_to_int16(x):
    assert np.max(np.abs(x)) <= 1.
    return (x * 32767.).astype(np.int16)

aud, sr = librosa.core.load(wav_files[0], sr=32000, mono=True)
aud = float32_to_int16(aud)

print (np.max(np.abs(aud)))
>>> 1.0048816

Some of my audio files are out of range. If I comment out the assertion then everything works. Will it be correct to remove the assertion?

I am also getting a low MAP scores on balanced set and evaluation set by using your trained models.

The ResNet38 
bal set :: 0.52
eval set :: 0.37

CNN10 
bal set :: 0.48
eval set :: 0.32

Do you think the above issue has anything to do with it? I mean I prepare the data by commenting out the assertion.

What's the input size of CNN

Hello,

I try to print the input size of each layer, take the Cnn14 model code for example:

use function librosa.load to load audio wav. [1, 32000]
spectrogram_extractor: [1, 1, 1001, 513]
logmel_extractor: [1, 1, 1001, 64]

I have three questions:

Different audio has different length, for example, some audio may be [1, 32000], others may be [1, 294198], so they have different size after spectrogram_extractor. Why can you input different size of tensor into CNN? Or have you reshape them into the same size?
How do you input a (1001, 64) size( not the same width and length) into CNN?
I test your model , the accuracy is really high. I try to extract audio features using mfcc, and train the audioset on VGGNet, but the accuracy is about 50%. So how do you improve your model‘s accuracy?

Looking forward to your reply. Thank you.

object of type 'NoneType' has no len()

How to Resolve this issue

Balanced/unbalanced training

Are the models trained on balanced trainset? Did you use unbalanced data for training?

First float32_to_int16, and then int16_to_float32?

during training, you transform the waveform from float32 to int16, and then back to float32. could you tell me why ?

but in pytorch/inference.py , you don't do this. could you tell me why ?

Why transpose(1, 3) before BatchNorm?

May I ask why do transpose(1, 3) before BN? Is it intended to do batch normalization for each frequency bin, what is the advantage for this? Thanks.

x = x.transpose(1, 3)
x = self.bn0(x)
x = x.transpose(1, 3)

Confusion about Finetune

Hi bro,
When I use the model for finetune training, the training task is the type of guns. I tried to change the lr and epochs, and the results were bad. Then I use a simple Vgg16 structure, and it can achieve good results. Could you please answer my confusion? Many thanks!

Variable Length Sequences

Hi,
How to use your CNN14 network with batches of input audio sequences of variable lengths? Also, is there a recommended length for audio input to the pretrained Cnn14_16k_mAP=0.438.pth?

Can use 32k mobilenetv2 model to fineturn 16k mobilenetv2 model?

It seems no 16k mobilenetv2 pretrain model provide. Thank you.

Could i change the input size of wav files?

Hi, there.
I have a question about the input size of wav files.
So, I'm doing some work on a transfer learning task based on your pretrained model.
In config.py, you set

sample_rate = 32000
clip_samples = sample_rate * 10     # Audio clips are 10-second

I'm wondering : could i change this two number?
If i changed them, does it means I can't use your pretrained model for next steps?

Other Pretrained Models

Really Amazing stuff there
Can you provide other pretrained models too like mobilnets for audio tagging
Thank You

have u ever tried transformer based model

recently several works of audio classification and recognition tasks are based transformer based model and work good. Have u ever tried transformer

Pretrained Cnn14 16kHz wrong shape errors

After downloading Cnn14_16k_mAP=0.438.pth and following these instructions:

MODEL_TYPE="Cnn14_16k"
CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth"   # Trained by a later version of code, achieves higher mAP than the paper.
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path='resources/R9_ZSCveAHg_7s.wav' --cuda

I get the following error:

Traceback (most recent call last):
  File "pytorch/inference.py", line 201, in <module>
    audio_tagging(args)
  File "pytorch/inference.py", line 42, in audio_tagging
    model.load_state_dict(checkpoint['model'])
  File "/home/*user*/anaconda3/envs/onseilake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
	size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

Thank you for open sourcing everything!

class_labels_indices.csv is missing

Hey guys!

Thanks for sharing the code, but running the inference, this error pops up:

Traceback (most recent call last):
File "pytorch/inference_template.py", line 26, in
import config
File "/Users/admin/Desktop/audioset_tagging_cnn/pytorch/../utils/config.py", line 8, in
with open('metadata/class_labels_indices.csv', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'metadata/class_labels_indices.csv'

Could you pls. add that file?

Thx a lot,
Max

How can i download the DataSet ?

Hi, when I run runme.sh, I got many errors likes this : sh: 1: youtube-dl: not found

So could you tell me another way to download this large DataSet?

Thank you!

code to plot "log spectrogram"+class probabilities

Hi I wanted to ask if you could please provide me with the code for your visualization. I would really like to reproduce your plot with other audios.

In detail: The visualization of sound event detection with the log spectrogram on the top and the class probabilities in the bottom. The image can be found in resources/sed_R9_ZSCveAHg_7s.png

That would be really great!
Thanks in advance.
Lydia

The procedure

Hi, I was confused about the procedure of the experiment although I had looked through the README.md. Could you list out the steps of the experiment? Thanks a lot.

Literature pointers for better understanding the `Cnn14_DecisionLevelAtt` model

Hello, Thanks for the awesome repo.

I am new to Audio & SED domain. I have been using your arch for one of the recent Kaggle competition and getting decent result. Therefore, I would like to better understand details of Cnn14_DecisionLevelAtt

I have read the PANNs paper, but it mostly focuses on the CNN feature extractor part. I am interested in understanding why things are done in the way they are for the Cnn14_DecisionLevelAtt model ( basically everything beside the CNN feature extractor ). Can you point me to some write-ups that explains this ?

Thanks

Feeding long audio data vs second-by-second or smaller chunks

Dear authors,

Thanks for the great work!

I would like to ask a question that is there any potential difference between feeding audio data that is typically 20-90 seconds long vs slicing it in chunks or running second-by-second predictions. I fed the CNN14 model with audio data that is typically 20-90 seconds long and after getting linear predicted probabilities I checked feature importance, it was almost near to 0 for all the audio labels.
And after binarizing them with threshold=0.3 it was clear that support was extremely low for 525/527 labels(except Speech & Music)

Now I am thinking that maybe feeding the model with second-by-second audio data may increase the accuracy because with sec-by-sec data each instance has the chance to be monophonic which may lead us to better results.

I would like to know your opinion about the above-mentioned thoughts if possible.

Best Regards

The procedure

Get embedding not classification

Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.

DecisionLevelMax with mobilenet2

Is there are DecisionLevelMax type model for MobileNetV2?

Shape doesn't match when inferencing Cnn14_16k model

Great work! And appreciate for sharing!

When I run this code according to readme:

python pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type="Cnn14_16k" --checkpoint_path="Cnn14_16k_mAP=0.438.pth" --audio_path='resources/R9_ZSCveAHg_7s.mp3'

raise error:

Traceback (most recent call last):
File "pytorch/inference.py", line 201, in
audio_tagging(args)
File "pytorch/inference.py", line 42, in audio_tagging
model.load_state_dict(checkpoint['model'])
File "/home/zongbowen/anaconda2/envs/tensorflow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

Transfer learning for a few classes

Hey, thanks for the great work.

I want to fine-tune your pre-trained models for less classes than 527.
Can you please guide me?

I have run finetune_template.

GPU number: 1 Load pretrained model successfully! Process finished with exit code 0

That's the only output.

Also tried to train from scratch with just 2 classes.
but I got several errors because of indexing.
I just followed runme.sh for training from scratch.

Thx

batchnorm1d doesn't seem to be used in attention block

In class AttBlock(nn.Module) the __init__ has

self.bn_att = nn.BatchNorm1d(n_out)

but the forward doesn't seem to be using it.

Also, temperature variable does not seem to be used.

Can these be removed without affecting the learning?

qiuqiangkong / audioset_tagging_cnn Goto Github PK

audioset_tagging_cnn's Introduction

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Environments

Audio tagging using pretrained models

Sound event detection using pretrained models

Train PANNs from scratch

1. Download dataset

2. Pack waveforms into hdf5 files

3. Create training indexes

4. Train

Results

Performance of differernt systems

Plot figures of [1]

Fine-tune on new tasks

Demos

FAQs

Cite

Reference

External links

audioset_tagging_cnn's People

Contributors

Stargazers

Watchers

Forkers

audioset_tagging_cnn's Issues

Recommend Projects

Recommend Topics

Recommend Org