GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI

Demo Video and Features

Check out our demo video in Chinese: Bilibili Demo

few.shot.fine.tuning.demo.mp4

Features:

Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Todo List

High Priority:
- Localization in Japanese and English.
- User guide.
Features:
- Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
- TTS speaking speed control.
- Enhanced TTS emotion control.
- Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- Improve English and Japanese text frontend.
- Develop tiny and larger-sized TTS models.
- Colab scripts.
- Expand training dataset (2k -> 10k).

Requirements (How to Install)

Python and PyTorch Version

Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.

Pip Packages

pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en

Additional Requirements

If you need Chinese ASR (supported by FunASR), install:

pip install modelscope torchaudio sentencepiece funasr

FFmpeg

Ubuntu/Debian Users

sudo apt install ffmpeg

MacOS Users

brew install ffmpeg

Windows Users

Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.

Pretrained Models

Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS\pretrained_models.

For Chinese ASR, download models from Damo ASR Models and place them in tools/damo_asr/models.

For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights.

Dataset Format

The TTS annotation .list file format:

vocal_path|speaker_name|language|text

Example:

D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.

Language dictionary:

'zh': Chinese
'ja': Japanese
'en': English

Credits

Special thanks to the following projects and contributors:

tps-f / gpt-sovits Goto Github PK

gpt-sovits's Introduction