Giter Club home page Giter Club logo

speech-corpus-dl's Introduction

Speech Corpus Downloader

Download and prepare common free speech corpora. Tested with Python 3.6 and 3.7.

Contents

Installation

Requires: Python 3, SoX.

  • Go to your workspace directory.
  • Optional: Create a virtualenv.
git clone [email protected]:mdangschat/speech-corpus-dl.git

Set the default corpus directory (DATA_DIR) in config.py and create it. The default path is: ~/workspace/speech-corpus. The selection of used corpora for train, dev and test is currently done within the generate.py file.

cd speech-corpus-dl
pip install -r requirements.txt

To start generating the corpus run:

ipython generate.py

Be aware that this will download about 90 GiB of data and requires over 300 GiB of space to extract (if downloaded archives are not deleted).

Default Configuration

It converts all files to 16 kHz, mono, WAV files and stores them in CSV files (e.g. train.csv, dev.csv). Examples shorter than 0.7 or longer than 17.0 seconds are removed. TED-LIUM examples with labels with fewer than 5 words are removed, due to a subjective higher transcription error rate.

The generated corpus contains about 1275 hours of speech for training and takes up 144GiB of disk space. Example of the generated output folder structure:

speech-corpus
├── cache
│   ├── dev-clean.tar.gz
│   ├── en.tar.gz
│   ├── tatoeba_audio_eng.zip
│   ├── TEDLIUM_release2.tar.gz
│   ├── test-clean.tar.gz
│   ├── train-clean-100.tar.gz
│   └── train-clean-360.tar.gz
├── commonvoicev2_train.csv
├── corpus
│   ├── cvv2
│   ├── LibriSpeech
│   ├── tatoeba_audio_eng
│   ├── TEDLIUM_release2
│   └── timit
├── corpus.json
├── dev.csv
├── librispeech_dev.csv
├── librispeech_test.csv
├── librispeech_train.csv
├── tatoeba_train.csv
├── tedlium_dev.csv
├── tedlium_test.csv
├── tedlium_train.csv
├── test.csv
└── train.csv

The generated CSV files have the following format:

path;label;length
relative/path/to/example;lower case transcription without puntuation;3.14
...

Where path is the relative WAV path from the DATA_DIR/corpus/ directory (String). By default label is the lower case transcription without punctuation (String). Finally, length is the audio length in seconds (Float).

Composition

  • train.csv:
    • Common Voice v2, all validated files.
    • Libri Speech train set
    • Tatoeba
    • Tedlium v2 train set
  • dev.csv:
    • Libri Speech dev set
  • test.csv:
    • Libri Speech test set

Supported Corpora

Common Voice

Libri Speech

Tatoeba

TED-LIUM

TODO: Document

TIMIT

  • Website: https://catalog.ldc.upenn.edu/LDC93S1
  • License: LDC User Agreement for Non-Members
  • Note: This is a special case since this is not a free corpus, therefore no download available. If you have a license, the corpus can be included:
    • Enable the use_timit flag in generate.py
    • Place the extracted timit data in the corpus/timit/TIMIT/ directory of your destination folder.

Statistics

$ ipython tools/wav_lengths.py 
Reading audio samples: 100%|██████████████████████████████████████████████████████████████| 897948/897948 [35:38<00:00, 419.93samples/s]
Total sample length=4592557.368s (~1275h) of workspace/speech-corpus/train.csv.
Mean sample length=81832 (5.115)s.
Plot saved to: /tmp/_plot_wav_lengths.png

Example length distribution plot

$ ipython tools/word_counts.py 
Calculating statistics for workspace/speech-corpus/train.csv
Word based statistics:
	total_words = 11,716,026
	number_unique_words = 82,352
	mean_sentence_length = 13.05 words
	min_sentence_length = 1 words
	max_sentence_length = 84 words
	Most common words:  [('the', 649751), ('to', 347617), ('and', 296845), ('a', 279516), ('of', 271555), ('i', 220023), ('you', 181157), ('in', 174778), ('that', 162348), ('it', 139181)]
	27344 words occurred only 1 time; 37,415 words occurred only 2 times; 49,962 words occurred only 5 times; 57,979 words occurred only 10 times.

Character based statistics:
	total_characters = 60,402,553
	mean_label_length = 67.27 characters
	min_label_length = 2 characters
	max_label_length = 422 characters
	Most common characters: [(' ', 10818079), ('e', 6112206), ('t', 4864057), ('o', 4006837), ('a', 3917617), ('i', 3437314), ('n', 3319309), ('s', 3051207), ('h', 3000228), ('r', 2703305), ('d', 2059374), ('l', 1976372), ('u', 1444798), ('m', 1347476), ('w', 1231782), ('c', 1175514), ('y', 1145448), ('g', 1045719), ('f', 979325), ('p', 838584), ('b', 756279), ('v', 486572), ('k', 465861), ('j', 73971), ('x', 73567), ('q', 39701), ('z', 32051)]
	Most common characters: [' ', 'e', 't', 'o', 'a', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'y', 'g', 'f', 'p', 'b', 'v', 'k', 'j', 'x', 'q', 'z']

speech-corpus-dl's People

Contributors

mdangschat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

sundy1219

speech-corpus-dl's Issues

Corpus Validation

Add a script to check if every file listed in a created CSV file actually exists. Maybe there are some other reasonable checks, as well.

Update Documentation

Update the README.md to:

  • Updated docstrings
  • Include the installation process
  • Information about the CSV file format
  • Downloaded amount of data and required disk space for extraction
    • CV v2: 20.9GiB
    • LibriSpeech: ~6.5 + 21.5 GiB
    • Tatoeba: 3.9 GiB
    • TEDLIUM v2: 34.3 GiB
  • Updated tree output

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.