Giter Club home page Giter Club logo

free-spoken-digit-dataset's Introduction

🔢Audio MNIST Classification on the Free Spoken Digit Dataset🔊

This project is the MNIST equivalent for audio. The dataset contains recordings of speakers saying numbers 0 to 9. The data is converted to mel spectograms and classified using a modified ResNet18 model.

Dataset

Free Spoken Digit Dataset (FSDD) is a simple audio/speech dataset consisting of recordings of spoken digits in wav files. The dataset contains approximately 20 MB of 1,500 recordings of spoken digits from 0 to 9. Each digit was spoken by 50 different speakers, and each speaker spoke each digit five times. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

FSDD is an open dataset, which means it will grow over time as data is contributed. It is a useful dataset for speech recognition tasks and can be thought of as an audio version of the popular MNIST dataset which consists of hand-written digits.

alt text

Training

Notebook

Open In Colab
  • Imput data: Mel spectograms (2400 train, 600 test)
  • Model: Modified ResNet18 model
  • Total epochs: 300
  • LR: 0.001, step size 20, gamma 0.9
  • Loss: 1.4690
  • Val Loss: 1.5069
  • Accuracy: 0.9571
  • Val Acc: 0.9696

Best epoch (highest validation accuracy):

  • Epoch: 271
  • Precision: 0.9691420399131312
  • Recall: 0.9694164524957861
  • F1 score: 0.9690412348965143

alt text

Inference

You can use the trained model to run inference on a single mel spectogram image using:
python inference.py Data\Mel\0\0_george_0.png

Output:
Prediction: 0 (Confidence: 0.9999996423721313)

Future steps

  • Any empty signal at the beginning or end of each clip is removed, however, this does not mean noise before and after the section of the desired signal is removed. To fix, the data could be passed through a noise gate to remove noise before and after the desired signal while not affecting it.
  • The model should be modified to allow input of different lengths without having to resize them.

free-spoken-digit-dataset's People

Contributors

dilne avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.