Automatic Speaker Recognition using Transfer Learning

Overview

This project focuses on Automatic Speaker Recognition (ASR) using a custom-made dataset derived from Librivox audiobooks, supplemented with testing data scraped from YouTube or collected live. The dataset encompasses diverse audio samples, varying in speakers' gender, language, and accents. The work involves two datasets: one comprising spectrogram images of audio book segments, and the other with extracted spectral features stored in a table format (.csv) alongside speaker labels.

Methodology

Dataset Creation:
- An intermediate dataset was formed by sampling audio at a rate of 44100Hz in .wav mono format from Librivox audiobooks.
- Criteria for selecting speakers included uniqueness, a mix of genders and languages, a minimum of 1 hour of audio per speaker, and minimal background noise.
- The intermediate dataset, consisting of 5-second audio segments from 80 speakers, was then further processed.
CNN Model Architecture:
- The CNN model was trained using 19,000 spectrogram images for the training set and 8,200 validation samples for the validation set.
Fig. 2: CNN Model Architecture
ANN Model Training:
- An artificial neural network was trained on the spectral features, including Spectral Centroid, MFCCs, and Spectral Bandwidth.
- The spectral features were extracted for each audio segment using Librosa and stored in a .csv format with corresponding speaker labels.
Transfer Learning Approach:
- Spectrogram images were fed into a network with hidden convolutional layers acting as a feature extractor.
- The output from both the CNN and ANN models was fed into an SVM classifier for final classification, representing a transfer learning approach.
- This approach combines the small high performance of SVM with the feature learning capabilities of CNN.
Experiments and Comparison:
- After final experiments, the performance of the CNN and ANN models was compared, and the results were discussed.

Dataset Details

The intermediate audio dataset was created by breaking down 1-hour samples from 200 speakers into 5-second segments, resulting in approximately 500-700 segments per speaker.
For the CNN dataset, audio segments were converted into spectrogram images in PNG format, resulting in 19,000 training samples and 8,200 validation samples.
For the ANN dataset, spectral features were calculated using Librosa and stored in a .csv format, yielding a total of 28,427 samples for 80 speakers.
Transfer learning data for implementing the models was scraped from YouTube and collected live, featuring 5 speakers with 30s of data each (6 samples) in both spectrogram image and extracted feature formats with corresponding labels.

For a detailed explanation of the project, refer to the published paper: Link to Published Paper

Usage

Run Creating_Datasets.ipynb to download Librivox audiobooks and create the initial dataset.
(Optional) Run DataProcessing.ipynb to preprocess files into a zip format before creating datasets.
Run Training_CNN.ipynb to train the CNN model and store weights.
Run Training_ANN.ipynb to train the ANN model and store weights.
Run Transfer_Learning.ipynb to test pretrained models on new data and compare their performance.

Future Directions

Explore different model architectures.
Incorporate more diverse datasets.
Address any identified limitations.

License

This project is licensed under the Apache License - see the LICENSE.md file for details.

sonalgan / asr-tf Goto Github PK

asr-tf's Introduction

Automatic Speaker Recognition using Transfer Learning

Overview

Methodology

Dataset Details

Usage

Future Directions

License

asr-tf's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent