Giter Club home page Giter Club logo

objects-that-sound's Introduction

Learning intra-modal and cross-modal embeddings

This repo is an implementation of the paper Objects that sound. It learns suitable embeddings from unlabelled audio and video pairs which can be used for both intra-modal (image-2-image or audio-2-audio) or cross-modal (image-2-audio or audio-2-image) query retrieval. This is done by training a tailored network (AVENet) for the audio-visual correspondence task.

Dataset and dataloader

We used the Audioset dataset to train and evaluate our model. Due to resource constraints, we chose a subset of 50 classes from all the possible classes. The classes are:

Accordion, Acoustic guitar, Bagpipes, Banjo, Bass drum, Bell, Church bell, Bicycle bell, Bugle. Cello, Chant, Child singing, Chime, Choir, Clarinet, Cowbell, Cymbal, Dental drill, Dentist’s drill, Drill, Drum, Drum kit, Electric guitar, Electric piano, Electronic organ, Female singing, Filing (rasp), Flute, French horn, Gong, Guitar, Hammer, Harmon- ica, Harp, Jackhammer, Keyboard (musical), Male singing, Mandolin, Marimba, xylophone, Musical ensemble, Orchestra, Piano, Power tool, Sanding, Sawing, Saxophone, Sitar, Tabla, Trumpet, Tuning fork, Wind chime

There are 165865 videos to download. We downloaded a subset of ~46k videos (40k train, 3k validation, 3k test). The dataset was highly skewed. Here is a distribution of all the videos across all classes among the 40k videos.

skew-dataset

This was potentially bad because one class will be learnt very well, and the others would be just classified as random. As it turns out, the training procedure is such that it is quite robust with respect to points which have low fractions in the training data. Some problems with the dataset are:

  • Many of the videos were less than 10 seconds in length. They have been handled in the dataloader by sampling from only the frames available.
  • Some videos had no or poor quality audio and didn't have relevant frames (blank screen and guitar playing, or just an album cover and the instruments playing in the audio).
  • Some of the videos had too many classes associated with them which result in a lot of noise. For example, a video with the sound of bells also had the sound of a human shouting in the background. In such cases, we cant really distinguish between the sounds and hence such examples make training difficult.
  • The distribution over classes is highly skewed (check above).
  • Also even inside the same class, there were many videos with extremely different audio samples, that even humans couldn't classify as same or different.

Spectrogram analysis

Looking at the log spectrograms of different classes, we do see some subtle differences between the different classes. For example, electric organ has big incisions into high frequency across time, dental drill almost covers all frequencies, chanting can be seen to have a periodic repetition in frequency pattern.

spectrogram

Training

The training is performed using the parameters and implementational details mentioned in the paper. It took us ~3 days to train it. Here is the training plot:

training

The accuracy, however is just a proxy for learning good representations of the audio and video frames. As we see later, we get some good results and some unexpected robustness. Note: We DO NOT use the labels of the videos in any way during training, and only use them for evaluation. That means, that the network is able to semantically correlate the audio with the instruments, which was the ultimate purpose of training with the given constraints. The representations are learnt using an unsupervised method, which make the results interesting, and this method ensures faster query retrieval.

Results

Image to image retrieval

We select a random frame from a video as a query image, and then check the Euclidean distance between the two representation vectors and select the top 5 among them. Since the top match will always be the query image itself, we show the top 4 excluding the query. Also, some of the queries have frames close to each other, so some results may be redundant.

imim1 imim2 imim3 imim4 imim5

Audio to image retrieval

We select a random 1-second audio clip from the video, and find its distance between the video embeddings. A random frame from the video of the query audio is also shown. Note that the video contains multiple classes, and the plots show only one of them. Hence, we attach the audio as well for manual analysis.

Query Sound auim1 Query Sound auim2 Query Sound auim3 Query Sound auim4 Query Sound auim5 Query Sound auim6 Query Sound auim7 Query Sound auim8 Query Sound auim9 Query Sound auim10

To-Do

  • Include more results
  • Include other two types of queries (image-2-audio and audio-2-audio)
  • Work on localization
  • Upload trained models
  • Improve documentation

Feel free to report bugs and improve the code by submitting a PR.

objects-that-sound's People

Contributors

rohitrango avatar samar97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

objects-that-sound's Issues

Cross entropy

Hi,

I just randomly bumped into this - nice work.

I briefly saw the code and have one quick suggestion - typically it's a good idea to exploit functions which merge softmax and cross entropy loss into one operation during training, this is for numerical stability (softmax involves an 'exp', and then cross entropy a 'log'). I'm not familiar with pytorch but this seems relevant:
https://pytorch.org/docs/master/nn.html#torch.nn.BCEWithLogitsLoss

Regards,
Relja

ffmpeg - a pre-requisite to download

Hi Rohit,

ffmpeg could be listed out as a pre-requisite to download and split, using the file get_video_audio.py.

Exit status of the subprocess which runs ffmpeg to extract audio and video is not captured and hence it took some time to figure out that ffmpeg not being installed in the system was the issue.

With Regards,
Shilpa

Self-supervision but actually using labels

Hi,
the work should rely on self supervision, where no labels are used for training, as stated in the README:

We DO NOT use the labels of the videos in any way during training, and only use them for evaluation.

However, labels are actually used in chosing the negative pairs for the contrastive loss. This is either a bug or not fair. However I reckon choosing negative pairs randomly should not make much difference, given the high number of classes (approx p=1/50 of taking a wrong pair).

csv file means?

Hi,I want to ask a question.The file such as 'unbalanced_train_segments_fitered.csv' includes name,time and label.What does the label mean?Such as '/m/0342h',how to decode it into characters?

Getting zero accuracy on running in "main" mode

I am getting zero accuracy when I run

python mainNN.py --mode main

Do I need to run it on some other mode first to train the model?

I already had AudioSet downloaded with me so is this problem anyhow dependent on the variation in data downloading procedure?

I am running the code on small number of files and not the whole dataset

I am using Google Colab.

Using batch size: 64 Using lr: 0.000025 Loaded dataloader and loss function. Optimizer loaded. Epoch: 0, Subepoch: 0, Loss: 0.782957, Valloss: 0.804846, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 10, Loss: 0.670178, Valloss: 0.813841, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 20, Loss: 0.767996, Valloss: 0.848316, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 30, Loss: 0.774370, Valloss: 0.745560, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 40, Loss: 0.789229, Valloss: 0.763325, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 0, Subepoch: 50, Loss: 0.721946, Valloss: 0.747408, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 0, Loss: 0.685583, Valloss: 0.750512, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 10, Loss: 0.820786, Valloss: 0.750623, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 20, Loss: 0.732163, Valloss: 0.698070, batch_size: 64, acc: 0.000000, valacc: 0.000000 Epoch: 1, Subepoch: 30, Loss: 0.753967, Valloss: 0.691314, batch_size: 64, acc: 0.000000, valacc: 0.000000

Please help me out. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.