Giter Club home page Giter Club logo

espnet-asr's Introduction

espnet-asr

espnet-asr is an End-to-end Automatic Speech Recognition (ASR) system using ESPnet.

End-to-end ASR systems are reported to outperform conventional approaches. However, it is not simple to train robust end-to-end ASR models and make recognition efficient.

In this project, we provide an easy-to-use inference code, pre-trained models, and training recipes to handle these problems.

The pre-trained models are tuned to achieve competitive performance for each dataset at the time of release, and an intuitive inference code is provided for easy evaluation.

1. Installation

To run the end-to-end ASR examples, you must install PyTorch and ESPnet. We recommend you to use virtual environment created by conda.

conda create -n ESPnet python=3

conda activate ESPnet


Install pytorch according to your GPU version (or CPU).

Detail Here

CPU

(ESPnet) conda install pytorch torchvision cpuonly -c pytorch

CUDA 10.2

(ESPnet) conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

CUDA 11.3

(ESPnet) conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch


Then, install ESPnet.

(ESPnet) pip install espnet

2. Downloading pre-trained models

You can download pre-trained models for Zeroth-Korean, ClovaCall, KSponSpeech and Librispeech datasets. You can check the performance of the pre-trained models here.

(ESPnet) tools/download_mdl.sh

3. Decoding

Inference is simple. For example, to recognize utterances listed in evalset/zeroth_korean/data/wav.scp using the model, mdl/zeroth_korean.zip, pre-trained for Zeroth-Korean dataset with decoding options in conf/decode_asr.yaml, run the following command.

python3 bin/asr_inference.py \
--mdl mdl/zeroth_korean.zip \
--wav_scp evalset/zeroth_korean/data/wav.scp \
--config conf/decode_asr.yaml \
--output_dir output/zeroth_korean \
--ngpu 1

If it causes errors such as "ModuleNotFoundError: No module named 'espnet'", you must use python instead of python3 as follows:

python bin/asr_inference.py \
--mdl mdl/zeroth_korean.zip \
--wav_scp evalset/zeroth_korean/data/wav.scp \
--config conf/decode_asr.yaml \
--output_dir output/zeroth_korean \
--ngpu 1

You can check the recognition result.

(ESPnet) cat output/zeroth_korean/1best_recog/text 
104_003_0019 지난해 삼 월 김 전 장관의 동료 인 장동 련 홍익대 교수가 민간 자문단 장으로 위촉 되면서 본격적인 공모 와 개발 작업에 들어갔다
104_003_0193 그 바람 에 나 의 몸 도 겹쳐 쓰러지 며 한창 피어난 노란 동백꽃 속으로 폭 파묻혀 버렸다
104_003_0253 현재 백화점과 영화관 등 은 오픈 해 영업 하고 있고 테마파크 및 아파트 등 의 공사는 이천 십 팔 년 완공 을 목표로 진행돼 왔다
...

4. Fast Decoding

Recognition latency can be reduced by changing decoding options, but it can hurt recognition performance.

python3 bin/asr_inference.py \
--mdl mdl/zeroth_korean.zip \
--wav_scp evalset/zeroth_korean/data/wav.scp \
--config conf/fast_decode_asr.yaml \
--output_dir output/fast_zeroth_korean \
--ngpu 1

5. Other pre-trained models

5.1 KsponSpeech

You can evaluate KsponSpeech samples by running the following commands.

python3 bin/asr_inference.py \
--mdl mdl/ksponspeech.zip \
--wav_scp evalset/ksponspeech/data/wav.scp \
--config conf/decode_asr.yaml \
--output_dir output/ksponspeech \
--ngpu 1 

You can use the conf/fast_decode_asr_ksponspeech.yaml for fast decoding.

python3 bin/asr_inference.py \
--mdl mdl/ksponspeech.zip \
--wav_scp evalset/ksponspeech/data/wav.scp \
--config conf/fast_decode_asr_ksponspeech.yaml \
--output_dir output/fast_ksponspeech \
--ngpu 1 

5.2 ClovaCall

Redistribution of ClovaCall dataset is prohibited. You can download the ClovaCall dataset from the page

5.3 Librispeech

You can evaluate Librispeech samples by running the following commands.

python3 bin/asr_inference.py \
--mdl mdl/librispeech.zip \
--wav_scp evalset/librispeech/data/wav.scp \
--config conf/decode_asr.yaml \
--output_dir output/librispeech \
--ngpu 1

Or

python3 bin/asr_inference.py \
--mdl mdl/librispeech.zip \
--wav_scp evalset/librispeech/data/wav.scp \
--config conf/fast_decode_asr.yaml \
--output_dir output/fast_librispeech \
--ngpu 1

5.4 ESPnet Model Zoo

You can get more information for pre-trained models in ESPnet model zoo

6. Limitations

  • Voice activity detection (VAD) is not supported : Speech utterances must be segemented for correct evaluation.

7. Inference testing on YouTube data

To perform inference testing on YouTube data, you need to install youtube-dl, ffmpeg, and sox as follows.

(ESPnet) conda install -c conda-forge youtube-dl

conda install -c conda-forge ffmpeg

(ESPnet) yum install sox

You can use "tools/recog_youtube.sh" to do inference testing. "tools/recog_youtube.sh" extracts audio stream from a given YouTube URL, splits the audio file into multiple files of 5 seconds in length, and then run the inference program for each segmented file. For example, to recognize audio stream in "https://www.youtube.com/watch?v=foLYddwKDcs&ab_channel=KBSNews", you can run the following command.

(ESPnet) tools/recog_youtube.sh --url foLYddwKDcs --download-dir download/foLYddwKDcs

The command downloads audio from YouTube URL "foLYddwKDcs" to "download/foLYddwKDcs" directory, and run "bin/asr_inference.py" using "mdl/ksponspeech.zip" model. You can check the recognition result.

(ESPnet) cat output/foLYddwKDcs/1best_recog/text 
foLYddwKDcs001 아 정부는 이번 주말이 중대 분기점이 들 거라면서 주말 상황을 지켜본
foLYddwKDcs002 그게 거리 두 개 3단계 격상력으로 결정하겠다고 했습니다 또 방역에 힘을 모아주
foLYddwKDcs003 어 될 수 있으면 집에 머물러 줄 것을 겉음료증 있습니다 안다 연계잡니다
foLYddwKDcs004 정부네는 사회적 벌이 둘이 3단계 격상을
foLYddwKDcs005 배우에 보루라고 보고 있습니다 1월 판단을 중대 분기점이 성탄절 
...

To run only inference step, run the script with "--stage" set to 2.

tools/recog_youtube.sh --stage 2 --url foLYddwKDcs --download-dir download/foLYddwKDcs --mdl mdl/ksponspeech.zip --config conf/fast_decode_asr_ksponspeech.yaml --output output/foLYddwKDcs

Contact

Feel free to ask any questions to [email protected] or requests to issues.

espnet-asr's People

Contributors

hchung12 avatar woojjn avatar

espnet-asr's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.