Giter Club home page Giter Club logo

ilid's Introduction

iLID

Automatic spoken language identification (LID) using deep learning.

Motivation

We wanted to classify the spoken language within audio files, a process that usually serves as the first step for NLP or speech transcription.

We used two deep learning approaches using the Tensorflow and Caffe frameworks for different model configuration.

Repo Structure

  • /data
    • Scripts to download training data from Voxforge and Youtube. For usage details see below.
  • /Evaluation
    • Prediction scripts for single audio files or list of files using Caffe
  • /Preprocessing
    • Includes all scripts to convert a WAV audio file into spectrogram and mel-filter spectrogram images using a Spark Pipeline.
    • All scripts to create/extract the audio features
    • To convert a directory of WAV audio files using the Spark pipeline run: ./run.sh --inputPath {input_path} --outputPath {output_path} | tee sparkline.log -
  • /models
  • /tensorflow
    • All the code for setting up and training various models with Tensorflow.
    • Includes training and prediction script. See train.py and predict.py.
    • Configure your learning parameters in config.yaml.
    • Add or change network under /tensorflow/networks/instances/.
  • /tools
    • Some handy scripts to clean filenames, normalize audio files and other stuff.
  • /webserver
    • A web demo to upload audio files for prediction.
    • See the included README

Requirements

  • Caffe
  • TensorFlow
  • Spark
  • Python 2.7
  • OpenCV 2.4+
  • youtube_dl
// Install additional Python requirements
pip install -r requirements.txt
pip install youtube_dl

Datasets

Downloads training data / audio samples from various sources.

Voxforge

/data/voxforge/download-data.sh
/data/voxforge/extract_tgz.sh {path_to_german.tgz} german

Youtube

  • Downloads various news channels from Youtube.
  • Configure channels/sources in youtube/sources.yml
python /data/youtube/download.py

Models

We trained models for 2/4 languages (English, German, French, Spanish).

Best Performing Models

The top scoring networks were trained with 15.000 images per languages, a batch size of 64, and a learning rate of 0.001 that was decayed to 0.0001 after 7.000 iterations.

Shallow Network EN/DE

Shallow Network EN/DE/FR/ES

Training

// Caffe:
/models/{model_name}/training.sh
// Tensorflow:
python /tensorflow/train.py

Labels

0 English, 
1 German, 
2 French, 
3 Spanish

Training Data

For training we used both the public Voxforge dataset and downloaded news reel videos from Youtube. Check out the /data directory for download scripts.

ilid's People

Contributors

hotzenklotz avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.