Giter Club home page Giter Club logo

nptel2020-indian-english-speech-dataset's Introduction

NPTEL2020 - Indian English Speech Dataset

A Speech-to-Text dataset scraped from NPTEL for Indo-English accent, from Education Domain.

Crawl Information

Metadata Details
Crawl Period April 2020
Source NPTEL-HRD YouTube (Playlist)
Crawl Type Only videos with manually uploaded subtitles
Content License Creative Commons
Language English (most videos are in South-Asian accent)
Domain General & Technical Education
Crawling and Processing Code Fast-KTSpeechCrawler
Total Videos Crawled 19,500
Average Video Duration 40mins
Dataset Format LibriSpeech (audio in wav, transcript in txt, metadata in json)
No. of chunks created 6,253,389 (6.2M)
Average chunk length 3 - 10 secs
Total no. of hours 15,700 hours
Total Compressed Dataset Size 1.1 TB (1.7 TB upon decompression)

Dataset Quality

The dataset was not manually annotated by us. We assume NPTEL has used Google ASR on top of which they have made reasonable amount of corrections.

We split the dataset as follows: (randomly sampled)

Split Number of chunks
Train set 5M
Validation set 625k
Test set 625k
Sample Set 1k

Note:

  • Sample Set is a small subset manually annotated by us to compute the quality of data. We refer to it as Pure Set.
  • We computed (in May 2020) the following results using the Pure Set for benchmarking purposes.
    • All the DL models listed below uses LM.
    • For Jasper and QuartzNet, we use models from Nvidia's NeMo framework.
    • For DeepSpeech, we use Baidu's official model.
    • The benchmarking code for others can be found here.
Model/Service Word Error Rate
Actual transcripts 0.1451
Google Speech-to-Text 0.4895
AWS Transcribe 0.3438
Rev.ai (Temi Speech) API 0.3218
ESP-Net 0.4321
DeepSpeech v.0.7 0.5872
Nvidia Jasper 0.3939
QuartzNet (Ultra-Tiny) 0.3981

To understand if the data we have crawled is useful, we sample 500k chunks from the train set and fine-tune an English ASR model using that for an epoch. We chose QuartzNet (Ultra-Tiny) pre-trained model for the fine-tuning because it was lightweight as well as very competitive in accuracy as seen above. (Training code can be found in this same repo)

On the Pure-Set, we observed an improvement in WER from 0.5034 (pre-trained model without LM) to 0.3207 (fine-tuned model without LM), which signified a promising scope to use the crawled dataset for much further improvement.

The pre-mature fine-tuned model can be found in the GitHub Releases section of the repo. (Currently due to lack of compute, we couldn't exhaustively use the data and find out results across different models & methods)

Suggestions and Future Works

  • Even though the dataset is noisy compared to publicly available datasets, we believe it would serve as a good intial data for building models.
  • Especially this dataset focuses on South Asian English accent, and is of education domain.
  • Even the raw audio from this dataset would be useful for pre-training ASR models like Wav2Vec 2.0.
    (As can be seen on this recent leaderboard)
  • For a better but closed dataset, check this recent competition: IIT-M Speech Lab - Indian English ASR Challenge
    (could be available on request)

Downloads

Crawl your own playlist

In some cases, we might need data containing a single speaker(TTS, Speaker recognition, etc). For that, choose a youtube playlist of your choice and crawl it.

Please check here for the instructions to do that on the crawler we used.

Contact us

For clarifications, please write on the GitHub Issues section.

nptel2020-indian-english-speech-dataset's People

Contributors

gokulnc avatar prem-kumar27 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.