Giter Club home page Giter Club logo

timitspeech's Introduction

With this repo you can preprocess an audio dataset (modify phoneme classes, resample audio etc), and train LSTM networks for framewise phoneme classification. You can achieve 82% accuracy on the TIMIT dataset, similar to the results from Graves et al (2013), although CTC is not used here. Instead, the network generates predictions in the middle of each phoneme interval, as specified by the labels. This is to simplify things, but using CTC shouldn't be too much trouble.

In order to create and train a network on a dataset, you need to:

  1. install software. I recommend using Anaconda and a virtual environment.

    • create an environment from the provided file: conda env create -f environment.yml
  2. Generate a binary file from the source dataset. It's easiest if you structure your data as in the TIMIT dataset, although that's not really needed. Just make sure that the wav and its corresponding phn file have the same path except for the extension. Otherwise they won't get matched and your labels will be off.

    • WAV files, 16kHz sampling rate, folder structure dataset/TRAIN/speakerName/videoName/. Each videoName/ directory contains a videoName.wav and videoName.phn. The phn contains the audio sample (@16kHz) numbers where each phoneme starts and ends.
    • If your files are in a different format, you can use functions from fixDataset/ to: (use transform.py, with the appropriate arguments, see bottom of file)
      • fix wav headers, resample wavs. Store them under dataRoot/dataset/fixed(nbPhonemes)/
        transform.py phonemes -i dataRoot/TIMIT/original/ -o dataRoot/TIMIT/fixed
      • fix labelfiles: replace phonemes (eg to use a reduced phoneme set; I used the 39 phonemes from Lee and Hon (1989)). Stored next to fixed wavs, under root/dataset/fixed(nbPhonemes)/
      • create a MLF file (like from HTK tool, and as used in the TCDTIMIT dataset)
      • the scripts should be case-agnostic, but you can convert lower to uppercase and vice versa by running find . -depth -print0 | xargs -0 rename '$_ = lc $_' in the root dataset directory (change 'lc' to 'uc to convert to upper case). Repeat until you get no more output.
    • Then set variables in datasetToPkl.py (source and target dir, nbMFCCs to use etc), and run the file
      • the result is stored as root/dataset/binary(nbPhonemes)/dataset/dataset_nbPhonemes_ch.pkl. eg root/TIMIT/binary39/TIMIT/TIMIT_39_ch.pkl
      • the mean and std_dev of the train data are stored as root/dataset/binary_nbPhonemes/dataset_MeanStd.pkl. It's useful for normalization when evaluating.
  3. Use RNN.py to start training. Its functions are implemented in RNN_tools_lstm.py, but you can set the parameters from RNN.py.

    • set location of pkl generated by datasetToPkl.py

    • specify number of LSTM layers and number of units per layer

    • use bidirectional LSTM layers

    • add some dense layers (though it did not improve performance for me)

    • learning rate and decay (LR is updated at end of RNN_tools_lstm.py). It's decreased if the performance hasn't improved for some time.

    • it will automatically give the model a name based on the specified parameters. A log file, the model parameters and a pkl file containing training info (accuracy, error etc for each epoch) are stored as well. The storage location isroot/dataset/results

  4. to evaluate a dataset, change the test_dataset variable to whatever you want (TIMIT/TCDTIMIT/combined)

    1. You can generate test datasets with noise (either white noise or simultaneous speakers) of a certain level using mergeAudioFiles.py to create the wavs and testdataToPkl.py to convert that to pkl files.
    2. If this noisy audio is to be used for combinedSR, you need to generate the pkl files a bit differently, using audioToPkl_perVideo.py. That pkl file can then be combined with the images and labels generated by combinedSR/datasetToPkl. You can enable this by setting some parameters in combinedSR/combinedNN.py

On TIMIT, you should get about 82% accuracy using a 2-layer, 256 units/layer bidirectional LSTM network. You should get about 67% on TCD-TIMIT.

The TIMIT dataset is non-free and available from https://catalog.ldc.upenn.edu/LDC93S19.
The TCD-TIMIT dataset is free for research and available from https://sigmedia.tcd.ie/TCDTIMIT/.
If you want to use TCD-TIMIT, I recommend to use my repo TCDTIMITprocessing to download, and extract the database. It's quite a nasty job otherwise. You can use extractTCDTIMITaudio.py to get the phoneme and wav files.

If you want to do lipreading or audio-visual speech recognition, check out my other repository MultimodalSR

timitspeech's People

Contributors

matthijsvk avatar mvankeirsbilck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

timitspeech's Issues

RUN.py

can you tell me where should i set my TIMIT_39_ch.pkl at? Dataset = the psth of TIMIT_39_ch.pkl ??Thanks a lot .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.