transfertts's Introduction

TransferTTS (Zero-shot VITS) - PyTorch Implementation (-Ongoing-)

Note!!(09.23.)

In current, this is just a implementation of zero-shot system; Not the implementation of the first contribution of the paper: Transfer learning framework using wav2vec2.0. As the future work, the model equipped with complete implementations of the two contributions (zero-shot and transfer-learning) will be implemented in the follwoing repository. Congratulations on being awarded the best paper in INTERSPEECH 2022.

Overview

Unofficial PyTorch Implementation of Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. Most of codes are based on VITS

MelStyleEncoder from StyleSpeech is used instead of the reference encoder.
Implementation of untranscribed data training is omitted.
LibriTTS dataset (train-clean-100 and train-clean-360) is used. Sampling rate is set to 22050Hz.

Pre-requisites (from VITS)

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

Preprocessing

Run

python prepare_wav.py --data_path [LibriTTS DATAPATH]

for some preparations.

Training

Train your model with

python train_ms.py -c configs/libritts.json -m libritts_base

Inference

python inference.py --ref_audio [REF AUDIO PATH] --text [INPUT TEXT]

References

transfertts's People

Contributors

Stargazers

Watchers

transfertts's Issues

Speech synthesis results

Hello @hcy71o ,

Liked your work in Transfer TTS and SC VITS. I have trained a model up to 350000 steps using LibriTTS train clean 100 dataset only but when I synthesize results using some random audio file the speech is not clear.

So, my question is:

How many steps did you train your model?
What should be the length (duration) of audio files while passing to inference.py.
Also should the reference audio be a part of the training data speaker, or can it be unseen?
Do you have any demo page where we can see the comparison of Transfer TTS generated audio with VITS?

Thanks

Recommend Projects

hcy71o / transfertts Goto Github PK

transfertts's Introduction

TransferTTS (Zero-shot VITS) - PyTorch Implementation (-Ongoing-)

Note!!(09.23.)

Overview

Pre-requisites (from VITS)

Preprocessing

Training

Inference

References

transfertts's People

Contributors

Stargazers

Watchers

Forkers

transfertts's Issues

Speech synthesis results

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent