Learning intra-modal and cross-modal embeddings

This repo is an implementation of the paper Objects that sound. It learns suitable embeddings from unlabelled audio and video pairs which can be used for both intra-modal (image-2-image or audio-2-audio) or cross-modal (image-2-audio or audio-2-image) query retrieval. This is done by training a tailored network (AVENet) for the audio-visual correspondence task.

Dataset and dataloader

We used the Audioset dataset to train and evaluate our model. Due to resource constraints, we chose a subset of 50 classes from all the possible classes. The classes are:

Accordion, Acoustic guitar, Bagpipes, Banjo, Bass drum, Bell, Church bell, Bicycle bell, Bugle. Cello, Chant, Child singing, Chime, Choir, Clarinet, Cowbell, Cymbal, Dental drill, Dentist’s drill, Drill, Drum, Drum kit, Electric guitar, Electric piano, Electronic organ, Female singing, Filing (rasp), Flute, French horn, Gong, Guitar, Hammer, Harmon- ica, Harp, Jackhammer, Keyboard (musical), Male singing, Mandolin, Marimba, xylophone, Musical ensemble, Orchestra, Piano, Power tool, Sanding, Sawing, Saxophone, Sitar, Tabla, Trumpet, Tuning fork, Wind chime

There are 165865 videos to download. We downloaded a subset of ~46k videos (40k train, 3k validation, 3k test). The dataset was highly skewed. Here is a distribution of all the videos across all classes among the 40k videos.

This was potentially bad because one class will be learnt very well, and the others would be just classified as random. As it turns out, the training procedure is such that it is quite robust with respect to points which have low fractions in the training data. Some problems with the dataset are:

Many of the videos were less than 10 seconds in length. They have been handled in the dataloader by sampling from only the frames available.
Some videos had no or poor quality audio and didn't have relevant frames (blank screen and guitar playing, or just an album cover and the instruments playing in the audio).
Some of the videos had too many classes associated with them which result in a lot of noise. For example, a video with the sound of bells also had the sound of a human shouting in the background. In such cases, we cant really distinguish between the sounds and hence such examples make training difficult.
The distribution over classes is highly skewed (check above).
Also even inside the same class, there were many videos with extremely different audio samples, that even humans couldn't classify as same or different.

Spectrogram analysis

Looking at the log spectrograms of different classes, we do see some subtle differences between the different classes. For example, electric organ has big incisions into high frequency across time, dental drill almost covers all frequencies, chanting can be seen to have a periodic repetition in frequency pattern.

Training

The training is performed using the parameters and implementational details mentioned in the paper. It took us ~3 days to train it. Here is the training plot:

The accuracy, however is just a proxy for learning good representations of the audio and video frames. As we see later, we get some good results and some unexpected robustness. Note: We DO NOT use the labels of the videos in any way during training, and only use them for evaluation. That means, that the network is able to semantically correlate the audio with the instruments, which was the ultimate purpose of training with the given constraints. The representations are learnt using an unsupervised method, which make the results interesting, and this method ensures faster query retrieval.

Results

Image to image retrieval

We select a random frame from a video as a query image, and then check the Euclidean distance between the two representation vectors and select the top 5 among them. Since the top match will always be the query image itself, we show the top 4 excluding the query. Also, some of the queries have frames close to each other, so some results may be redundant.

Audio to image retrieval

We select a random 1-second audio clip from the video, and find its distance between the video embeddings. A random frame from the video of the query audio is also shown. Note that the video contains multiple classes, and the plots show only one of them. Hence, we attach the audio as well for manual analysis.

To-Do

Include more results
Include other two types of queries (image-2-audio and audio-2-audio)
Work on localization
Upload trained models
Improve documentation

Feel free to report bugs and improve the code by submitting a PR.

Getting zero accuracy on running in "main" mode

I am getting zero accuracy when I run

python mainNN.py --mode main

Do I need to run it on some other mode first to train the model?

I already had AudioSet downloaded with me so is this problem anyhow dependent on the variation in data downloading procedure?

I am running the code on small number of files and not the whole dataset

I am using Google Colab.

Using batch size: 64 Using lr: 0.000025 Loaded dataloader and loss function. Optimizer loaded. Epoch: 0, Subepoch: 0, Loss: 0.782957, Valloss: 0.804846, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 10, Loss: 0.670178, Valloss: 0.813841, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 20, Loss: 0.767996, Valloss: 0.848316, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 30, Loss: 0.774370, Valloss: 0.745560, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 40, Loss: 0.789229, Valloss: 0.763325, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 50, Loss: 0.721946, Valloss: 0.747408, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 0, Loss: 0.685583, Valloss: 0.750512, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 10, Loss: 0.820786, Valloss: 0.750623, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 20, Loss: 0.732163, Valloss: 0.698070, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 30, Loss: 0.753967, Valloss: 0.691314, batch_size: 64, acc: 0.000000, valacc: 0.000000

Please help me out. Thanks

rohitrango / objects-that-sound Goto Github PK

objects-that-sound's Introduction

Learning intra-modal and cross-modal embeddings

Dataset and dataloader

Spectrogram analysis

Training

Results

Image to image retrieval

Audio to image retrieval

To-Do

objects-that-sound's People

Contributors

Stargazers

Watchers

Forkers

objects-that-sound's Issues

Recommend Projects

Recommend Topics

Recommend Org