What is Audify?
Audify - An Audio Classification project built by an Artificial Neural Network model that accurately categorized audio samples based on their content.
Why is Audify Unique? How does it help the real world?
Sound monitoring -> By accurately classifying urban sounds, such as car horns, sirens, and jackhammers, the audio classification model can be used in real-time sound monitoring applications. It can assist city planners, environmental agencies, and policymakers in understanding noise patterns, identifying areas with excessive noise levels, and implementing measures to mitigate noise pollution. Public Safety and Security -> Ability to classify audio signals in real-time can contribute to public safety and security. For example, the audio classification model can be integrated into surveillance systems to automatically detect and recognize critical sounds like gunshots or alarms.
The dataset consists of 8732 audio files in WAV format, the dataset includes 10 low-level classes and a number of files respectively
- Numpy
- Pandas
- Scikit
- Keras
- TensorFlow
- Librosa
- Flask
- Pickle
- Audio preprocessing
- Audio classification
- Audio feature extraction
- Deep Learning - model building
In this project the feature considered is MFCC (Mel Frequency Cepstral Coefficients), MFCCs are a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel frequency scale. The Mel frequency scale is a perceptual scale of frequency that is based on the human ear's response to sound. The power spectrum of the sound is divided into a number of frequency bands that are equally spaced on the Mel scale, and the energy in each band is then summed and logarithmically compressed. This log-compressed spectrum is then transformed using a Discrete Cosine Transform (DCT) to produce the MFCCs. MFCC is one of the commonly used features that has been used in a variety of applications, especially in voice signal processing such as speaker recognition, voice recognition, and gender identification
The following code is used to extract the MFCC value from each file
the extracted MFCC values are made into Data Frame
The created data set is split into 70-30 for Training and Testing of the Model
The Shape of the Training and Testing Dataset
In the project, we built a fully connected network with an input layer,2 hidden layers, and an output layer using sequential API. The first layer consists of 100 neurons and takes the input of 40 features with a dropout of 50% and uses RELU activation function. The second layer consists of 200 neurons with a dropout of 50% and uses RELU activation function to extract more complex features than the first layer. The third layer is a dense layer with 100 neurons and a ReLU activation function, again extracting more complex features. the last layer is another dense layer with a number of output classes (10) neurons and a softmax activation function.
We have trained the model for 100 epochs with a batch size of 50
From training and testing the model we obtained the accuracy of
The following paper: Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the 22nd ACM international conference on Multimedia. 2014.
The research paper comes to a conclusion with the use of SVM and Random Forest model having high accuracy of approximately 73%, From the Neural Network Architecture we used in the project we got a improved accuracy of 83.81 % and 78.28 % for training and testing respectively.
We used the Flask web framework of Python for the front end and the pickle module for loading the model.
By uploading the sample audio file, the model classifies the sample and gives this result on the result page by highlighting the predicted class.