gokulkarthik / text2speech Goto Github PK

Towards Building Text-To-Speech Systems for the Next Billion Users - Microsoft Research Intern Work - Accepted at ICASSP 2023

Home Page: https://arxiv.org/abs/2211.09536

Jupyter Notebook 53.79% Python 28.08% Shell 18.13%

assamese bengali bodo gujarati hindi indian-languages kannada malayalam manipuri odia

text2speech's Introduction

Towards Building Text-To-Speech Systems for the Next Billion Users

🎉 Accepted at ICASSP 2023

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

TL;DR: We open-source SOTA Text-To-Speech models for 13 Indian languages: Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil and Telugu.

Authors: Gokul Karthik Kumar*, Praveen S V*, Pratyush Kumar, Mitesh M. Khapra, Karthik Nandakumar

[ArXiv Preprint] [Audio Samples] [Try It Live] [Video]

Unified architecture of our TTS system

Results

Setup:

Environment Setup:

# 1. Create environment
sudo apt-get install libsndfile1-dev
conda create -n tts-env
conda activate tts-env

# 2. Setup PyTorch
pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

# 3. Setup Trainer
git clone https://github.com/gokulkarthik/Trainer 

cd Trainer
pip3 install -e .[all]
cd ..
[or]
cp Trainer/trainer/logging/wandb_logger.py to the local Trainer installation # fixed wandb logger
cp Trainer/trainer/trainer.py to the local Trainer installation # fixed model.module.test_log and added code to log epoch 
add `gpus = [str(gpu) for gpu in gpus]` in line 53 of trainer/distribute.py

# 4. Setup TTS
git clone https://github.com/gokulkarthik/TTS 

cd TTS
pip3 install -e .[all]
cd ..
[or]
cp TTS/TTS/bin/synthesize.py to the local TTS installation # added multiple output support for TTS.bin.synthesis

# 5. Install other requirements
> pip3 install -r requirements.txt

Data Setup:

Format IndicTTS dataset in LJSpeech format using preprocessing/FormatDatasets.ipynb
Analyze IndicTTS dataset to check TTS suitability using preprocessing/AnalyzeDataset.ipynb

Training Steps:

Set the configuration with main.py, vocoder.py, configs and run.sh. Make sure to update the CUDA_VISIBLE_DEVICES in all these files.
Train and test by executing sh run.sh

Inference:

Trained model weight and config files can be downloaded at this link.

python3 -m TTS.bin.synthesize --text <TEXT> \
    --model_path <LANG>/fastpitch/best_model.pth \
    --config_path <LANG>/config.json \
    --vocoder_path <LANG>/hifigan/best_model.pth \
    --vocoder_config_path <LANG>/hifigan/config.json \
    --out_path <OUT_PATH>

Code Reference: https://github.com/coqui-ai/TTS

text2speech's People

Contributors

Stargazers

Watchers

Forkers

chhaviilli gurugubelllik nayanjha16 haiderasad srinivas-gowriraj arush2019 yashshukla11 yunusrf

text2speech's Issues

hifigan discriminator error

while training the vocoder hifigan i encounter a error

Traceback (most recent call last): File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1500, in fit self._fit() File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1484, in _fit self.train_epoch(epoch) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1261, in train_epoch _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1124, in train_step step_optimizer=step_optimizer, File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 979, in _optimize outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 937, in _model_train_step return model.train_step(*input_args) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/gan.py", line 136, in train_step D_out_fake = self.model_d(y_hat.detach()) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 218, in forward scores_, feats_ = self.msd(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 194, in forward score, feat = d(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 155, in forward x = l(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 107, in __call__ setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training)) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight v = normalize(torch.mv(weight_mat.t(), u), dim=0, eps=self.eps, out=v) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)``

this is not a OOM error as i have confirmed that, seems as though input size is not matching the conv1d layer

Punjabi

Hi! Thanks for the great job! I am wondering if Punjabi is also open-sourced, since it is not mentioned in your language list, but there are weights available for that language.

Thanks!

Link to download models

Hello,

Can you please provide link to download models?

Unable to find options for varying pitch in inferencing

Hi @gokulkarthik , I was able to do the training using this repo for fastpitch model.

However during the inferencing stage for fastpitch model, I am unable to find arguments for varying the pich. Can you please point me in the right direction regarding how to achive pitch variations at inferencing stage using this repo for fastpich model ?

multi speaker training

i wanted to ask will training on multiple male and multiple female voices have any effect on the final results such as proper voice generation

Unable to launch Wandb dashboard

Hi @gokulkarthik ,

I am fairly new with Wandb , I was trying to implement this repo but I am getting the following error :

self.dashboard_logger, self.c_logger = self.init_loggers(

File "/hdd4/Nayan/envs/Nayan_ai_or_bharat_tts/lib/python3.8/site-packages/trainer/trainer.py", line 575, in init_loggers
dashboard_logger = logger_factory(config, output_path)
File "/hdd4/Nayan/envs/Nayan_ai_or_bharat_tts/lib/python3.8/site-packages/trainer/logging/init.py", line 35, in logger_factory
dashboard_logger = WandbLogger( # pylint: disable=abstract-class-instantiated
TypeError: Can't instantiate abstract class WandbLogger with abstract methods add_audio, add_figure, add_scalar

Can you please guide me with how to resolve this issue

Does the model take reference audio for TTS like Coqui TTS(Zero shot)

I saw that you have mentioned CoquiTTS as the code reference in the Readme.md. Does your model take a reference audio wav file as input along with the text and produce speech in that voice?

Assamese model inferencing

Hi,
I am trying to infer on Assamese model in the shared link.Getting the issue in config file.Using the below command mentioned in the repo.

python3 -m TTS.bin.synthesize --text "কেনেকুৱা আছ? আজি আহিব নালাগে" --model_path as/fastpitch//best_model.pth --config_path as/fastpitch/config.json --vocoder_path as/hifigan/best_model.pth --vocoder_config_path as/hifigan/config.json --out_path ./

Getting the below issue:
symbols, phonemes = make_symbols(**self.tts_config.characters)
TypeError: make_symbols() got an unexpected keyword argument 'characters_class'

Seems like some extra keyword has come in config.I tried removing these but phoneme generation also has some error.

Unable to run inference ;TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

@gokulkarthik I am trying to run an inference with the following command
python3 -m TTS.bin.synthesize --text "राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है"
--model_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/fastpitch/best_model.pth
--config_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/fastpitch/config.json
--vocoder_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/hifigan/best_model.pth
--vocoder_config_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/hifigan/config.json
--out_path /home/raj/Downloads/ai4bharat/output

I am getting this error

> Using model: fast_pitch

Text: राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं
उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है
> Text splitted to sentences.
['राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है']
Traceback (most recent call last):
File "/home/raj/anaconda3/envs/tts-env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/raj/anaconda3/envs/tts-env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/raj/.local/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 418, in
main()
File "/home/raj/.local/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 396, in main
wav = synthesizer.tts(
File "/home/raj/.local/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 323, in tts
outputs = synthesis(
File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/utils/synthesis.py", line 213, in synthesis
outputs = run_model_torch(
File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
outputs = _func(
File "/home/raj/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/models/forward_tts.py", line 684, in inference
o_en, x_mask, g, _ = self._forward_encoder(x, x_mask, g)
File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/models/forward_tts.py", line 399, in _forward_encoder
g = self.emb_g(g) # [B, C, 1]
File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType