johnmartinsson / bird-species-classification Goto Github PK

Using convolutional neural networks to build and train a bird species classifier on bird song data with corresponding species labels.

License: MIT License

Python 99.60% Shell 0.40%

bird-species-classification bird-audio-detection bird-song-data master-thesis convolutional-neural-networks deep-residual-network bird-clef

bird-species-classification's People

Contributors

Stargazers

Watchers

bird-species-classification's Issues

Write tests

A test suite needs to be implemented to better assure that everything is working as expected.

I installed the bird module required for this program, but it doesn't seem to be working. Loader, preprocessing, and several other packages in bird seem to be missing/unable to be imported. The conflict seems to be that I installed the wrong bird package which was meant for denoising audio in an algorithmic process., but I can't seem to find the required bird package. How can I install the right bird package?

Thesis Introduction

Background
Goals
Challenges
Outline

Baseline Network Model

Implement a baseline convolutional neural network model.

"Figure 5 shows a visual representation of our neural network architecture. The network contains 5 convolutional layer, each followed by a max-pooling layer. We insert one dense layer before the final soft-max layer. The dense layer contains 1024 and the soft-max layer 1000 units, generating a probability for each class. We use batch normalization before every convolutional and before the dense layer. The convolutional layers use a rectify activation function. Drop-out is used on the input layer (probability 0.2), on the dense layer (probability 0.4) and on the soft-max layer (probability 0.4). As a cost function we use the single label categorical cross entropy function (in the log domain)."

Architecture

Four times:

BachNormalization
Convolution Num. Filters = 64, 128, 256, 256
Convolution Kernel Sizes = 5x5, 5x5, 5x5, 3x3
Convolution Stride Size = 1x1
ReLU Activation
MaxPooling with 2x2 Kernels and Stride Size 2x2

Fully Connected

Dropout(40%)
Dense Layer with 1024 units
Dropout(40%)
SoftMax Layer with 19 units

Pitch Shift Data Augmentation

Implement methods that given time-frequency data, will shift the data in the frequency domain.

The augmented data will be used to encourage pitch-shift invariance when training the neural network.

Implementation details:

small shift of around 5%
wrap-around to preserve complete information

Sprengel et al, 2016

"In a review of different augmentation methods [12] showed that pitch shifts (vertical shifts) also helped reducing the classification error. We found that, while a small shift (about 5%) seemed to help, a larger shift was not beneficial. Again we used a wrap-around method to preserve the complete information."

Explain Weight Initialization

Explain how weight initialization works, and why it is important.

Default in Keras: Understanding the difficulty of training deep feedforward neural networks (Glotrot 2010)

Alternative explanation: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

Thesis Methods

Deep Residual Neural Networks
Multiple-Width Frequency-Delta Data

Tutorial Model

Implement, train, and evaluate a tutorial network model in keras which solves the MNIST problem.

Version of the pre-requisites

Can you mention the version of the pre-requisites on which the code works?

I can understand it will take some time and effort to update the repo reflect the changes in API.

I found 2 issues in trying to run the code out of box.

First, was in signal_processing.py where librosa.stft requires numpy array in float which I managed to resolve by just adding a line 'wave = wave.astype(float)'

Second, was found in conf.ini which didn't had the optimizer mentioned I believe 'sgd' is used.

Residual Functions

Modify the convolutional neural network such that it learns residual functions, i.e., it will become a deep residual network.

Signal / Noise Separation

Add a method which separates structured sound from noise.

Input: time-frequency data
Output: Intervals with noise, intervals with structure, intervals with neither.

Implementation details:

Signal mask

compute the spectrogram of the whole wave file
- pass signal through STFT, using Hanning window functino (size 512, 75% overlap).
- normalize: divide every element by the maximum value, s.t., all values in [0, 1].
select all pixels in the spectrogram that are three times bigger than the row median, and three times bigger than the column media. Set these pixels to 1, all other to 0.
apply a binary erosion and dilation filter, 4 by 4 filter produced best results.
create indicator vector with as many elements as there are columns in spectrogram
- set i:th element in indicator to 1 if i:th column contains at least one 1, otherwise set to 0
- smooth indicator vector by applying two more binary dilation filters (4 by 1).
scale indicator vector to the length of the original sound file.
use scaled indicator vector as mask to extract signal.

Noise mask
Same as signal, but select pixels larger than 2.5 times the row/column median, and invert the vector at the end.

Everything else is considered to contain no relevant information. The use of dilation filters ensures that the number of generated intervals are kept to a minimum.

Identity Mappings

Modify the neural network to include identity mappings.

Compute Spectrogram Correctly

It seems that the current use of scipy.signal.spectrogram is not sufficient, and that information is lost. Correct this by a manual implementation of spectrogram computations, or by understanding the scipy.signal.spectrogram method completely so that the desired results can be achieved.

References:

Residual Network Model

Implement a deep residual neural network model.

Thesis Previous Work

Missing import matplotlib.pyplot as plt

Missing import matplotlib.pyplot as plt:

The code for data analysis uses plt for plotting but does not import it explicitly.
Please make sure to include import matplotlib.pyplot as plt at the beginning of the code.

Same Class and Noise Addition

Implement additive data augmentation methods. These will be used to encourage the network to see more combined species samples, and to see more variations of noise/structure.

Methods:

load multiple random noise samples
load multiple random signal samples of the same class
additively combine multiple samples of time-frequency data

Same Class

"We follow [14] and add sound files that correspond to the same class. Adding is a simple process because each sound file can be represented by a single vector. If one of the sound files is shorter than the other we repeat the shorter one as many times as it is necessary. After adding two sound files, we re-normalize the result to preserve the original maximum amplitude of the sound files. The operation describes the effect of multiple birds (of the same species) singing at the same time. Adding files improves convergence because the neural network sees more important patterns at once, we also found a slight increase in the accuracy of the system (see Table 1)." (Sprengel et al, 2016)

Adding Noise

"One of the most important augmentation steps is to add background noise. In Section 2.1 we described how we split each file into a signal and noise part. For every signal sample we can choose an arbitrary noise sample (since the background noise should be independent of the class label) and add it on top of the original training sample at hand. As for combining same class audio files, this operation should be done in the time domain by adding both sound files and repeating the smaller one as often as necessary. We can even add multiple noise samples. In our test we found that three noise samples added on top of the signal, each with a dampening factor of 0.4 produces the best results. This means that, given enough training time, for a single training sample we eventually add every possible background noise which decreases the generalization error." (Sprengel et al, 2016)

Wrong variable in preprocess_birdclef.py

xml_paths should be replaced by xml_roots in https://github.com/johnmartinsson/bird-species-classification/blob/master/preprocess_birdclef.py#L36.

Maybe you should run your code once to test it?

Explain arguments for scripts.

The naming of some of the arguments for the optparser can be confusing. Should go through and clean this up.

Divide Spectrograms into Chunks

Implement a method which divides the spectrograms into equal chunks.

Implementation details:

Split spectrogram into chunks of equal size (length 512).

Motivation:

Need fixed sized input for the neural network architecture.
- Allow to pad only the last part, and keep step size constant
Each chunk can be used as a unique sample for training (since "empty" parts have been removed)
Network can make multiple predictions per sound file, and average them to generate a final prediction.

ValueError in create_dataset.py

python create_dataset.py --src_dir=<my_src_dir> --dst_dir=<my_dst_dir> --valid_percentage=20
copying sound classes...
splitting train/validation...
Traceback (most recent call last):
  File "create_dataset.py", line 99, in <module>
    main()
  File "create_dataset.py", line 91, in main
    replace=False)
  File "mtrand.pyx", line 1161, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:18155)
ValueError: Cannot take a larger sample than population when 'replace=False'

Baseline Evaluation

Evaluate the baseline classifier with respect to:

Area Under Curve (AUC)
Mean Average Precision (MAP)

Bird Classifier Evaluation

Evaluate the (improved) classifier with respect to:

Mean Average Precision
Area Under Curve

Rewrite to follow Keras 2.0 API

Multi-Width Frequency-Delta Data Augmentation

Implement the multi-width frequency-delta data augmentation.

Time Shift Data Augmentation

Implement a method which randomly shifts the time-frequency input data in the time domain.

The shifted time data will be used as additional samples when training the neural network in order to encourage time-shift invariance.

Implementation details:

split spectrogram in two parts (at random)
place the second part in front of the first

Sprengel et al, 2016

"Every time we present the neural network with a training example, we shift it in time by a random amount. In terms of the spectrogram this means that we cut it into two parts and place the second part in front of the first (wrap around shifts). This creates a sharp corner where the end of the second part meets the beginning of the first part but all the information is preserved. With this augmentation we force the network to deal with irregularities in the spectrogram and also, more importantly, teach the network that bird songs/calls appear at any time, independent of the bird species."

No argument parsing in preprocess_birdclef.py

$ python preprocess_birdclef.py --xml_dir=<path-to-xml-dir> \
                                --wav_dir=<path-to-wav-dir> \
                                --output_dir=<path-to-output-dir>

still use hardcoded path...

Thesis Abstract

Abstract

Normalize Input Data

Input data should have zero mean and identity variance.

Note: mean, and variance should be computed from the training set ONLY, and then subtracted from the training, validation and test sets.

http://cs231n.github.io/neural-networks-2/#datapre

weight files seems to be missing

Possible issue may arise due to updates in "wave" file reading libraries

johnmartinsson / bird-species-classification Goto Github PK

bird-species-classification's People

Contributors

Stargazers

Watchers

Forkers

bird-species-classification's Issues

Architecture

Sprengel et al, 2016

Implementation details:

Same Class

Adding Noise

Sprengel et al, 2016

Recommend Projects

Recommend Topics

Recommend Org