- Project to detect accent of an individual in spoken english.
- Use Machine Learning models to train 4 classes of accents and predict the output of a given audio file.
- Motivation
- Tools Used
- Dataset
- Data Analysis and Pre-Processing.
- Audio Processing.
- Data Preparation
- Training Machine Learning Models.
- Model Performance Comparision
- Training Neural Network.
- Techniques to handle imbalance in the dataset.
- Future enhancements.
The motivation behind developing this project is, since I am living as an Indian student in The US, I wanted to improve my american accent to learn more about the culture and blend in. But there was no tool online that could tell me my accent and how well it is. That is when I decided I want to create a tool that could help people like me identify and suggest improvements on their accent.
- Jupyter Notebook/Google Colab
- Librosa for Audio Processing (Frequency Domain Features + Time Domain Features):
- MFCC(Mel Frequency Capstral Coefficent)
- Numpy, Pandas for Data Processing and Analysis.
- scikit-learn for Machine Learning models.
- Tensorflow and keras for Deep learning models.
This Dataset on kaggle.
- Since the dataset has few samples, I choose to classify only on 4 accents: (Will improve when I get more samples.)
- Indian
- American
- British
- Chinese
The dataset contains:
- 2172 samples of speakers in total(audio in mp3 format).
- Samples from 177 different countries.
- Samples of 214 different languages.
- Each user is speaking the passage:
"Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."
- Merged samples from languages in and around India to create "Indian accent"(110).
- Grouped samples to create "_American"(373), "_British"(65) and "Chinese"(88) accents classes.
- Removed other samples of audios and from dataframe.
- Removed unnecessary columns. (age, birthplace, speakerid, file_missing)
- Converted all the mp3 files to a "wav" format. (uncompressed version).
- Trim/pad all the audio files to a standard length of 30 seconds.
- Extracted 13 MFCCs: Mel Frequency Cepstral Coefficents from each audio file. MFCC in a sentence, is a "representation" of the vocal tract that produces the sound. Think of it like an x-ray of your mouth.
- Extracted MFCCs of an audio file will be of shape (1, 2584, 13)
- Add Gaussian noise to each sample.
- Oversample minority classes to handle imbalance in the dataset.
- Dump the data in json format to access it later.
- Load mfccs and targets from json.
- Convert it in a 2D format of (1445, 33592) -> (#samples, (2584*13))
- Create Train and Test data using train_test_split.
- Use SMOTE (Synthetic Minority Over-Sampling Technique) to balance samples.
- Support Vector Machine(SVM):
- Random Forest Classifier(RFC):
-
Using a RFC of max_depth=16 and n_estimators=250.
-
Metrics: Accuracy of 92%, slightly lower than SVC, but pretty decent and almost similar f1-scores.
- K-Nearest-Neighbors(KNN):
-
Using a KNN of n_neighbors=3.
-
Metrics: Accuracy of 76.57% with imbalanced f1-scores. Performs poor compared to "svc" and "rfc".
- Logistic Regression(LR):
This table compares different models and its metrics.
Model | Accuracy | American(f1) | British(f1) | Chinese(f1) | Indian(f1) |
---|---|---|---|---|---|
SVM | 93% | 85% | 98% | 99% | 89% |
RFC | 92% | 85% | 100% | 96% | 87% |
KNN | 76% | 55% | 83% | 89% | 75% |
LR | 87% | 79% | 96% | 94% | 82% |
Note: We can see that SVM performs best here.
- Recurrent Neural Network(RNN):
- Using Keras's Sequential model with 1 input layer, 3 hidden layers(with dropouts)(activation=relu) and 1 output layer(Softmax activation).
- Using Adam optimizer to compile and run it for 50 epochs.
- Metrics: Accuracy of 59%. (Disappointing results)
- Convolutional Neural Network(CNN):
- Using Keras's Sequential model with 1 input layer, 3 convolution layers(with BatchNormalization), 1 Dense Layer(with Dropout) and 1 output layer(softmax activation).
- Using Adam optimizer to compile and run for 30 epochs.
- Metrics: Accuracy of 68% on test data. (Performed really poorly on real data sample.)
Before, the data was heavily imbalanced, and training an SVC just gave an accuracy of 59% with extremely poor f1-scores for each accent. To handle the imbalance in our dataset, I used techniques like:
- SMOTE: (Synthetic Minority Over-Sampling Technique)
- addresses imbalanced datasets by oversampling the minority class.
- The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model.
- Instead, new examples can be synthesized from the existing examples.
Results: SMOTE improved accuracy from 59% to around 65%. (Better, but not acceptable.)
- Undersampling:
- reduces the samples of American(373) to 110 and then trained SVC.
Results: Extremely poor accuracy of 41% with poor f1-scores.
- Oversampling::
- Duplicate samples of minority classes to match close to american samples.
- Get a Duplication ratio: i.e. 373/373 = 1, 373/110 = 3, 373/88 = 4, 373/65=5
- Dupliate each sample by this duplication ratio.
Results: Give a really good training accuracy of 90%, but performed subpar on test data.
- Oversampling with Noise:
- Duplicate samples of minority class in a similar way as above, but
- Extract a random gaussian noise from the signal, and add it to the audio file and then oversample.
Results: Performed best, with an overall accuracy of 93% with nicely balanced f1 scores.
Right now, I have made a predictor application on Jupyter Notebook. It works well within the notebook and can classify most of the random samples correctly.
- To make this app accessible to more people, I want to create a flask application to create Web API for this and design a website to support it.
- To make an android application for mobile users.
Resources relevant to the project:
- https://www.youtube.com/playlist?list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0 : Audio Processing for Machine Learning.
- https://www.youtube.com/watch?v=fMqL5vckiU0&list=PL-wATfeyAMNrtbkCNsLcpoAyBBRJZVlnf&ab_channel=ValerioVelardo-TheSoundofAI: Audio Processing for Deep Learning with Python