Giter Club home page Giter Club logo

deepspeech's Introduction

DeepSpeech for Russian language

Project DeepSpeech is an open source Speech-To-Text engine implemented by Mozilla, based on Baidu's Deep Speech research paper using TensorFlow.

This particular repository is focused on creating big vocabulary ASR system for Russian language (paper).

Big datasets for training are being crawled using developed method (paper) from YouTube videos with captions.

Datasets:

Labeled russian speech (CSVs + wav):

Used language model:

Developed speech recognition system for Russian language achieves 18% WER on custom dataset crawled from voxforge.com.

Created ASR system was applied to speech search task in big collection of video files. Implemented search service allows to jump to a particular moment in video where requested text is being spoken.

Here is a demo:

search-demo-5

Please write to [email protected] for any questions and support.

Using released files for inference

You will need Mac OS or Linux

  1. Follow this Mozilla's DeepSpeech guide to install pip3 deepspeech package.
  2. Go to releases and download tensorflow_pb_models.tar.gz and language_model.tar.gz.
  3. Unpack all files (output_graph.pb, lm.binary, trie and alphabet.txt) to some folder. And run inference:
deepspeech output_graph.pb alphabet.txt lm.binary trie my_russian_speech_audio_file.wav

Continue training from released checkpoint

Checkpoint is a directory that you specify as --checkpoint_dir parameter when training with DeepSpeech.py. You can continue training TensorFlow acoustic model from released checkpoint with your own datasets.
Released checkpoints are using --n_hidden=2048 (number of neurons in hidden layers in neural network), and it cannot be modified if you want to use this released checkpoint (for values other than 2048 it will throw an error).

To use released checkpoint in your training:

  1. Follow Training setup guide
  2. Extract checkpoint_dir.tar.gz archive downloaded from release somewhere in your /network/checkpoints directory (or any other). Change absolute paths to main checkpoint file in checkpoint text file. Example of checkpoint file contents:
model_checkpoint_path: "/network/DeepSpeech-ru-v1.0-checkpoint_dir/model.ckpt-126656"
all_model_checkpoint_paths: "/network/DeepSpeech-ru-v1.0-checkpoint_dir/model.ckpt-126656"
  1. Train with your own datasets setting --checkpoint_dir parameter to directory that you extracted checkpoint to.

Training setup

Requirements:

  • Good NVIDIA GPU with at least 8GB of VRAM
  • Linux OS
  • cuda-command-line-tools
  • docker
  • nvidia-docker (for CUDA support)

Setting up training environment

  1. Check nvidia-smi command is working before moving to the next step
  2. Clone this repo git clone https://github.com/GeorgeFedoseev/DeepSpeech and cd DeepSpeech
  3. Build docker image based on Dockerfile from clone repo:
nvidia-docker build -t deep-speech-training-image -f Dockerfile .
  1. Run container as daemon. Link folders from host machine to docker container using -v <host-dit>:<container-dir> flags. We will need /datasets and /network folders in container to get access to datasets and to store Neural Network checkpoints. -d parameter runs container as daemon (we will connect to container on next step):
docker run --runtime=nvidia -dit --name deep-speech-training-container -v /<path-to-some-assets-folder-on-host>:/assets -v /<path-to-datasets-folder-on-host>:/datasets -v /<path-to-some-folder-to-store-NN-checkpoints-on-host>:/network deep-speech-training-image
  1. Connect to running container (bash -c command is used to sync width and height of console window).
docker exec -it deep-speech-training-container bash -c "stty cols $COLUMNS rows $LINES && bash"

Done! We are now inside training docker container.

Define alphabet in alphabet.txt

All training samples should have transcript consisting of characters defined in data/alphabet.txt file. In this repository alphabet.txt consists of space character, dash character and russian letters. If sample transcriptions in dataset will contain out-of-alphabet characters then DeepSpeech will throw an error.

Generate language model (using KenLM toolkit and generate_trie under the hood)

Run python script with first parameter being some long text file from where language model will be estimated (for example some Wikipedia dump txt file)

python /DeepSpeech/maintenance/create_language_model.py /assets/big-vocabulary.txt

This script also has parameters:

  • o:int - maximum length of word sequences in language model
  • prune:int - minimum number of occurences for sequence in vocabulary to be in language model

Example with extra parameters:

python /DeepSpeech/maintenance/create_language_model.py /assets/big-vocabulary.txt 3 2

It will create 3 files in data/lm folder: lm.binary, trie and words.arpa. words.arpa is intermediate file, DeepSpeech is using trie and lm.binary files for language modelling. Trie is a tree, representing all prefixes of words in LM. Each node (leaf) is a prefix and child-nodes are prefixes with one letter added.

Training on sample dataset data/tiny-dataset

Dataset consists of 3 sets: train, dev and test. For each set there is CSV file and folder, containing wave files. In csv each row contains full path to audio, filesize in bytes and text transcription. For saving-space-in-repo purposes sample dataset has only 9 audio recordings (which is enough for demo and not enough for good WER). CSV file for train set repeats same 3 rows 7 times to simulate more data.
To run demonstration of training process execute:

bash bin/train-tiny-dataset.sh

You should see training and validation progressbars running for each epoch. Training process stops when validaton error stops decreasing (early stopping). Then starts testing phase that uses language model (LM is not used during trainig), thats why it takes longer time for each sample to process (beam search implementation that uses language model is one-threaded and CPU only).

Obviously with so tiny dataset good WER is not achievable. To achieve good WER (at least < 20%) use datasets with > 500hrs of speech.

You can examine which parameters are passed to DeepSpeech.py script by checking contents of train-tiny-dataset.sh file.

Setup Telegram notifications

Because training big RNNs like in DeepSpeech takes time (from few hours to days and even weeks on weak hardware), its good to be notified about training results and not to check manually all the time.
You can use Telegram Bot to send you log messages. Create bot in Telegram and get accessToken, start chatting with bot and get chatId. Then create telegram_credentials.json file in the root folder of the project with following contents:

{
  "accessToken": "<your-access-token>",
  "chatId": "<your-chat-id>"
}

To tell DeepSpeech.py to send you log messages through your Telegram Bot to specified chat add flag --log_telegram=1 when running training.

deepspeech's People

Contributors

andi4191 avatar andrenatal avatar baljeet avatar cr1at0rs avatar crs117 avatar cwiiis avatar dwks avatar elleo avatar farwayer avatar gardenia22 avatar georgefedoseev avatar gonimar avatar gvoysey avatar interfect avatar kdavis-mozilla avatar lissyx avatar lparam avatar mach327 avatar mcfletch avatar mikalaidrabovich avatar nkansal96 avatar nobuf avatar pecastro avatar qin avatar reuben avatar saurabhvyas avatar stes avatar techiev2 avatar tilmankamp avatar wgent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deepspeech's Issues

About scorer file.

Hi. I am happy to see that you are trying to create deppspeech 's russian model. But I have a few questions. First of all I want to know what is "LM-mixed-yt-echo-wiki-o5-prune2-24Jun18 (1GB)" model?It is the final model or not? And what contains this link "https://github.com/georgefedoseev/DeepSpeech/releases"? In this link there is another lm.binary, alphabet and trie file. Which one is the correct ? And where is your scorer file for corresponding language? Where is normalized LM training text? Thanks.

Running instructions seems to be outdated

It looks like some options are mandatory now

$ deepspeech output_graph.pb alphabet.txt lm.binary trie my_russian_speech_audio_file.wav
usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA] [--lm_beta LM_BETA] [--version] [--extended] [--json]
                  [--candidate_transcripts CANDIDATE_TRANSCRIPTS] [--hot_words HOT_WORDS]
deepspeech: error: the following arguments are required: --model, --audio

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.