Giter Club home page Giter Club logo

text2speech's Introduction

Towards Building Text-To-Speech Systems for the Next Billion Users

🎉 Accepted at ICASSP 2023

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

TL;DR: We open-source SOTA Text-To-Speech models for 13 Indian languages: Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil and Telugu.

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

Authors: Gokul Karthik Kumar*, Praveen S V*, Pratyush Kumar, Mitesh M. Khapra, Karthik Nandakumar

[ArXiv Preprint] [Audio Samples] [Try It Live] [Video]

Unified architecture of our TTS system

Results

Setup:

Environment Setup:

# 1. Create environment
sudo apt-get install libsndfile1-dev
conda create -n tts-env
conda activate tts-env

# 2. Setup PyTorch
pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

# 3. Setup Trainer
git clone https://github.com/gokulkarthik/Trainer 

cd Trainer
pip3 install -e .[all]
cd ..
[or]
cp Trainer/trainer/logging/wandb_logger.py to the local Trainer installation # fixed wandb logger
cp Trainer/trainer/trainer.py to the local Trainer installation # fixed model.module.test_log and added code to log epoch 
add `gpus = [str(gpu) for gpu in gpus]` in line 53 of trainer/distribute.py

# 4. Setup TTS
git clone https://github.com/gokulkarthik/TTS 

cd TTS
pip3 install -e .[all]
cd ..
[or]
cp TTS/TTS/bin/synthesize.py to the local TTS installation # added multiple output support for TTS.bin.synthesis

# 5. Install other requirements
> pip3 install -r requirements.txt

Data Setup:

  1. Format IndicTTS dataset in LJSpeech format using preprocessing/FormatDatasets.ipynb
  2. Analyze IndicTTS dataset to check TTS suitability using preprocessing/AnalyzeDataset.ipynb

Training Steps:

  1. Set the configuration with main.py, vocoder.py, configs and run.sh. Make sure to update the CUDA_VISIBLE_DEVICES in all these files.
  2. Train and test by executing sh run.sh

Inference:

Trained model weight and config files can be downloaded at this link.

python3 -m TTS.bin.synthesize --text <TEXT> \
    --model_path <LANG>/fastpitch/best_model.pth \
    --config_path <LANG>/config.json \
    --vocoder_path <LANG>/hifigan/best_model.pth \
    --vocoder_config_path <LANG>/hifigan/config.json \
    --out_path <OUT_PATH>

Code Reference: https://github.com/coqui-ai/TTS

text2speech's People

Contributors

gokulkarthik avatar tahirjmakhdoomi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

text2speech's Issues

hifigan discriminator error

while training the vocoder hifigan i encounter a error

Traceback (most recent call last): File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1500, in fit self._fit() File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1484, in _fit self.train_epoch(epoch) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1261, in train_epoch _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 1124, in train_step step_optimizer=step_optimizer, File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 979, in _optimize outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx) File "/home/haider/Documents/TTS_files/ai4bharat/Trainer/trainer/trainer.py", line 937, in _model_train_step return model.train_step(*input_args) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/gan.py", line 136, in train_step D_out_fake = self.model_d(y_hat.detach()) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 218, in forward scores_, feats_ = self.msd(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 194, in forward score, feat = d(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/haider/Documents/TTS_files/ai4bharat/TTS/TTS/vocoder/models/hifigan_discriminator.py", line 155, in forward x = l(x) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 107, in __call__ setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training)) File "/home/haider/anaconda3/envs/tts-env/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight v = normalize(torch.mv(weight_mat.t(), u), dim=0, eps=self.eps, out=v) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)``

this is not a OOM error as i have confirmed that, seems as though input size is not matching the conv1d layer

Punjabi

Hi! Thanks for the great job! I am wondering if Punjabi is also open-sourced, since it is not mentioned in your language list, but there are weights available for that language.

Thanks!

Unable to find options for varying pitch in inferencing

Hi @gokulkarthik , I was able to do the training using this repo for fastpitch model.

However during the inferencing stage for fastpitch model, I am unable to find arguments for varying the pich. Can you please point me in the right direction regarding how to achive pitch variations at inferencing stage using this repo for fastpich model ?

multi speaker training

i wanted to ask will training on multiple male and multiple female voices have any effect on the final results such as proper voice generation

Unable to launch Wandb dashboard

Hi @gokulkarthik ,

I am fairly new with Wandb , I was trying to implement this repo but I am getting the following error :

self.dashboard_logger, self.c_logger = self.init_loggers( 

File "/hdd4/Nayan/envs/Nayan_ai_or_bharat_tts/lib/python3.8/site-packages/trainer/trainer.py", line 575, in init_loggers
dashboard_logger = logger_factory(config, output_path)
File "/hdd4/Nayan/envs/Nayan_ai_or_bharat_tts/lib/python3.8/site-packages/trainer/logging/init.py", line 35, in logger_factory
dashboard_logger = WandbLogger( # pylint: disable=abstract-class-instantiated
TypeError: Can't instantiate abstract class WandbLogger with abstract methods add_audio, add_figure, add_scalar

Can you please guide me with how to resolve this issue

Assamese model inferencing

Hi,
I am trying to infer on Assamese model in the shared link.Getting the issue in config file.Using the below command mentioned in the repo.

python3 -m TTS.bin.synthesize --text "কেনেকুৱা আছ? আজি আহিব নালাগে" --model_path as/fastpitch//best_model.pth --config_path as/fastpitch/config.json --vocoder_path as/hifigan/best_model.pth --vocoder_config_path as/hifigan/config.json --out_path ./

Getting the below issue:
symbols, phonemes = make_symbols(**self.tts_config.characters)
TypeError: make_symbols() got an unexpected keyword argument 'characters_class'

Seems like some extra keyword has come in config.I tried removing these but phoneme generation also has some error.

Unable to run inference ;TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

@gokulkarthik I am trying to run an inference with the following command
python3 -m TTS.bin.synthesize --text "राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है"
--model_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/fastpitch/best_model.pth
--config_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/fastpitch/config.json
--vocoder_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/hifigan/best_model.pth
--vocoder_config_path /home/raj/Downloads/ai4bharat/text2speech/models/v1/hi/hifigan/config.json
--out_path /home/raj/Downloads/ai4bharat/output

I am getting this error

> Using model: fast_pitch

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Init speaker_embedding layer.
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...

Text: राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं
उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है
 > Text splitted to sentences.
['राजस्थान और उत्तर प्रदेश से लेकर हरियाणा मध्य प्रदेश एवं उत्तराखंड में सेना में भर्ती से जुड़ी अग्निपथ स्कीमका विरोध जारी है']
Traceback (most recent call last):
  File "/home/raj/anaconda3/envs/tts-env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/raj/anaconda3/envs/tts-env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 418, in
    main()
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 396, in main
    wav = synthesizer.tts(
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 323, in tts
    outputs = synthesis(
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/utils/synthesis.py", line 213, in synthesis
    outputs = run_model_torch(
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/home/raj/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/models/forward_tts.py", line 684, in inference
    o_en, x_mask, g, _ = self._forward_encoder(x, x_mask, g)
  File "/home/raj/.local/lib/python3.10/site-packages/TTS/tts/models/forward_tts.py", line 399, in _forward_encoder
    g = self.emb_g(g)  # [B, C, 1]
  File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/raj/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.