fenrlr / mb-istft-vits2 Goto Github PK

View Code? Open in Web Editor NEW

106.0 5.0 26.0 4.13 MB

Application of MB-iSTFT-VITS components to vits2_pytorch

License: MIT License

Python 95.59% Jupyter Notebook 3.99% Cython 0.42%

text-to-speech tts speech-synthesis pytorch vits

mb-istft-vits2's Introduction

MB-iSTFT-VITS2

A... vits2_pytorch and MB-iSTFT-VITS hybrid... Gods, an abomination! Who created this atrocity?

This is an experimental build. Does not guarantee performance, therefore.

According to shigabeev's experiment, it can now dare claim the word SOTA for its performance (at least for Russian).

pre-requisites

Python >= 3.8
CUDA
Pytorch version 1.13.1 (+cu117)
Clone this repository
Install python requirements.
```
pip install -r requirements.txt
```
~~1. You may need to install espeak first: apt-get install espeak~~

If you want to proceed with those cleaned texts in filelists, you need to install espeak.
```
apt-get install espeak
```
Prepare datasets & configuration

~~1. ex) Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1~~
1. wav files (22050Hz Mono, PCM-16)
2. Prepare text files. One for training^(ex) and one for validation^(ex). Split your dataset to each files. As shown in these examples, the datasets in validation file should be fewer than the training one, while being unique from those of training text.
  - Single speaker^(ex)
```
wavfile_path|transcript
```
  - Multi speaker^(ex)
```
wavfile_path|speaker_id|transcript
```
3. Run preprocessing with a cleaner of your interest. You may change the symbols as well.
  - Single speaker
```
python preprocess.py --text_index 1 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME
python preprocess.py --text_index 1 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
```
  - Multi speaker
```
python preprocess.py --text_index 2 --filelists PATH_TO_train.txt --text_cleaners CLEANER_NAME
python preprocess.py --text_index 2 --filelists PATH_TO_val.txt --text_cleaners CLEANER_NAME
```
  The resulting cleaned text would be like this(single). ^{ex - multi}
Build Monotonic Alignment Search.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

Edit configurations based on files and cleaners you used.

Setting json file in configs

Model	How to set up json file in configs	Sample of json file configuration
iSTFT-VITS2	`"istft_vits": true,` `"upsample_rates": [8,8],`	istft_vits2_base.json
MB-iSTFT-VITS2	`"subbands": 4,` `"mb_istft_vits": true,` `"upsample_rates": [4,4],`	mb_istft_vits2_base.json
MS-iSTFT-VITS2	`"subbands": 4,` `"ms_istft_vits": true,` `"upsample_rates": [4,4],`	ms_istft_vits2_base.json
Mini-iSTFT-VITS2	`"istft_vits": true,` `"upsample_rates": [8,8],` `"hidden_channels": 96,` `"n_layers": 3,`	mini_istft_vits2_base.json
Mini-MB-iSTFT-VITS2	`"subbands": 4,` `"mb_istft_vits": true,` `"upsample_rates": [4,4],` `"hidden_channels": 96,` `"n_layers": 3,` `"upsample_initial_channel": 256,`	mini_mb_istft_vits2_base.json

Training Example

# train_ms.py for multi speaker
python train.py -c configs/mb_istft_vits2_base.json -m models/test

Credits

mb-istft-vits2's People

Contributors

Stargazers

Watchers

mb-istft-vits2's Issues

My experience and pronunciation problems

Hi,
First, I wanted to congratulate you for this amazing work and wanted to share my experience.

I decided to give it a go and tested a dataset on a RTX 3090 for 1500 epochs (~217k steps). It took almost 4 days.

I can say that the audio quality is perfect, I can't notice any difference between the source audios and the inferenced audios. This is something I just cound't get with VITS1, I can always notice differences in audio quality between source clips and inferenced clips. Again, this is in terms of audio quality.

However, when it comes to pronunciation, there are always some words that are mispronunced. I trained the same datasets with a few differente configs.

This is the config I'm using:

{
  "use_mel_posterior_encoder": true,
  "use_transformer_flows": true,
  "transformer_flow_type": "pre_conv2",
  "use_spk_conditioned_encoder": false,
  "use_noise_scaled_mas": true,
  "use_duration_discriminator": ***,
  "duration_discriminator_type": "dur_disc_2",
  "ms_istft_vits": true,
  "mb_istft_vits": false,
  "istft_vits": false,
  "use_sdp": ***
}

*** I trained with: ***
use_sdp = true and use_duration_discriminator = true;
The inferenced clips seems like the person is drunk, some times it pronounces words wrong and does not produce good outputs at all.

use_sdp = false and use_duration_discriminator = true;
Does not seem to have pronunciation problems, but the output is very robotic. It does not sound natural, but might be good for some use cases.

use_sdp = true and use_duration_discriminator = false;
This is the one I trained for 1500 epochs, the output was much more natural than the other 2, but it always had pronunciation problems. It seems like the more I train, the less this problem appear, but I decided to stop training after 1500 epochs as I cound't see any improvements after that. Regardless, this is the one I would use if I had to pick one.

Does anyone have any tips on how to improce the pronunciation issues? I have the same model trained with vits1, I trained with sdp = true for around 1500 epochs as well, and it outputs really good results, in termos of sounding natural, but I can never get the audio quality to match the original clips. I have hopes that I'm close to get perfect pronunciation with perfect audio quality (which is already being done) but I don't know what else I could try.

To make it clear, it's not always the same words that are mispronunced, every time I run inference with the same input, the output varies in terms of what is mispronounced, some times it outputs a perfect audio, but not aways.

Does anyone have any tips on how to improve it?

Unfortunatelly I can't share audio samples as I don't have authorization to do so.

Thank you!

Multispeaker training issue

I have multispeaker data. I am using train_ms.py script. I have few questions.

a) Do I need to update the "n_speakers" to match the number of distinct speakers in my dataset?
b) I am currently running training with "n_speakers" = 0 (default) and the training is around 60K. The generated output seems bad specially in terms of duration of generated audio vs ground truth duration. The generated output duration is 7 seconds whereas ground truth duration is 11 seconds.
c) During the training I have updated the config "max_text_len" to 320 from default 190, to allow more segments to pass the threshold as my dataset has some longer utterances/transcript. Can this be an cause for the issue for quality issue that is being noticed?

Boolean value subbands in istft_vits2_base.json?

MB-iSTFT-VITS2/configs/istft_vits2_base.json

Line 51 in 17593f6

"subbands": false,

MB-iSTFT-VITS2/pqmf.py

Line 56 in 17593f6

subbands (int): The number of subbands.

Was wondering if this is accurate? I usually see people set it to values like 4 or 8

rank error

Hi，there

I have a question about the training of the model.

I encountered the following error in my training.

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).

I observed that they disappeared after 16 epochs of training.
Then i try training again,
When the training reached 40 epoch, it stopped again.
Why is this?

Best regards.

Training more Chinese speaking person error: ValueError: too many values to unpack (expected 3)

**Traceback (most recent call last):
File "/content/MB-iSTFT-VITS2/train_ms.py", line 492, in
main()
File "/content/MB-iSTFT-VITS2/train_ms.py", line 57, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

I set print(self.audiopaths_sid_text) to try to analyze the problem, and the result is as follows:
[['./Auwa/Auwa_AllAge_00055.wav', 'Auwa_All', 'ㄏㄠˇㄉㄞˇ ㄧㄝˇ ㄕˋ ㄅㄢˋ ㄌㄜ˙ ㄋㄧㄢˊ ㄎㄚˇㄐㄧˋ ㄎㄚˇ ㄉㄜ˙ ㄖㄣˊ，ㄧㄠˋㄕˋ ㄅㄨˋ ㄌㄞˊ，ㄋㄚˋ ㄎㄜˇ ㄌㄤˋㄈㄟˋ ㄌㄜ˙。'], ['./Auwa_18R/Auwa_AdultOnly_00012.wav', 'Auwa_18R', 'ㄏㄚˉ ㄚ˙ㄏㄚˉ ㄚ˙，ㄜˋ ㄅㄨˊㄧㄠˋ~ ㄅㄨˋㄒㄧㄥˊ ㄏㄚˉ ㄣˊ~ ㄋㄧˇ ㄅㄚˇㄕㄡˇ ㄕㄡˇㄓˇ ㄔㄡˉㄔㄨˉㄌㄞˊ。'],……]]
And then the same content keeps getting output for about a few dozen lines
How should I change data_utils.py to eliminate errors? Or how to find the fault? (I train on Google Colab)

Comparative Analysis and Training Results of VITS2 with HifiGAN, iSTFT and BigVGAN

Greetings,

First and foremost, I'd like to extend my commendations on developing such an outstanding model; its performance surpasses anything I have personally trained thus far. It's a noteworthy contribution to the field, and I applaud your work.

I've conducted a series of training experiments to validate the efficiency and efficacy of your model. For ease of reference, I've made the training results, model weights, and TensorBoard logs publicly accessible. You can review them via the following Google Drive link:
Training Results and Model Weights

Moreover, I've prepared audio samples that compare the performance of your model with that of VITS2, HifiGAN, and BigVGAN. This will offer a comprehensive perspective on how your model stacks up against other state-of-the-art solutions in the domain.
Comparative Audio Samples

Best wishes

“ZeroDivisionError: integer division or modulo by zero” in Google Colab

The Chinese-Japanese bilingual MB-iSTFT-VITS was trained using a custom dataset, which was formatted in Python3.10, Torch1.12.1, and Torchvision0.13.1 environments, but the following error occurred:
/content/MB-iSTFT-VITS2/dataset_raw/auwa2-MB-iSTFT-VITS2
INFO:auwa2-MiVITS-W1:{'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'fft_sizes': [384, 683, 171], 'hop_sizes': [30, 60, 10], 'win_lengths': [150, 300, 60], 'window': 'hann_window'}, 'data': {'use_mel_posterior_encoder': True, 'training_files': '/content/MB-iSTFT-VITS2/dataset_raw/auwa2-MB-iSTFT-VITS2/character_scripts.txt.cleaned', 'validation_files': '/content/MB-iSTFT-VITS2/dataset_raw/auwa2-MB-iSTFT-VITS2/character_scripts_val.txt.cleaned', 'text_cleaners': ['zh_ja_mixture_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'use_mel_posterior_encoder': True, 'use_transformer_flows': True, 'transformer_flow_type': 'pre_conv', 'use_spk_conditioned_encoder': False, 'use_noise_scaled_mas': True, 'use_duration_discriminator': True, 'ms_istft_vits': False, 'mb_istft_vits': True, 'istft_vits': False, 'subbands': 4, 'gen_istft_n_fft': 16, 'gen_istft_hop_size': 4, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [4, 4], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16], 'n_layers_q': 3, 'use_spectral_norm': False, 'use_sdp': False}, 'model_dir': './logs/models/auwa2-MiVITS-W1'}
2023-09-28 13:38:10.745201: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2023-09-28 13:38:12.103548: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:jaxlib.mlir._mlir_libs:Initializing MLIR with module: _site_initialize_0
DEBUG:jaxlib.mlir._mlir_libs:Registering dialects from initializer <module 'jaxlib.mlir._mlir_libs._site_initialize_0' from '/usr/local/lib/python3.10/dist-packages/jaxlib/mlir/_mlir_libs/_site_initialize_0.so'>
DEBUG:jax._src.xla_bridge:No jax_plugins namespace packages available
DEBUG:jax._src.path:etils.epath found. Using etils.epath for file I/O.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Using mel posterior encoder for VITS2
/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Using transformer flows pre_conv for VITS2
Using normal encoder for VITS1 (cuz it's single speaker after all)
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
Mutli-band iSTFT VITS2
Loading training data: 0% 0/7 [00:00<?, ?it/s]
Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x7addb89dc3a0>
Traceback (most recent call last):
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in del
self._shutdown_workers()
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1441, in _shutdown_workers
if not self._shutdown:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'
Traceback (most recent call last):
File "/content/MB-iSTFT-VITS2/train.py", line 461, in
main()
File "/content/MB-iSTFT-VITS2/train.py", line 55, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/MB-iSTFT-VITS2/train.py", line 207, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc],
File "/content/MB-iSTFT-VITS2/train.py", line 240, in train_and_evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(loader):
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 444, in iter
return self._get_iterator()
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1038, in init
super(_MultiProcessingDataLoaderIter, self).init(loader)
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 651, in init
self._sampler_iter = iter(self._index_sampler)
File "/content/MB-iSTFT-VITS2/data_utils.py", line 400, in iter
ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero
I hope you can give me the idea of troubleshooting the problem, thank you very much!

ValueError: too many values to unpack (expected 2)

Hi, I was using cjke_cleaners2 to clean the texts. However, I got the error ValueError: too many values to unpack (expected 2) as shown below

But I don't think I got more than 2 values to unpack. Here is the structure of my cleaned texts

So I wonder which step could go wrong. Thank you!

The model for Chinese speech doesn't work after training for 1700 epochs

Hi, I trained the model on a Chinese dataset (~13min high-quality Chinese speech), and after 1700 epochs I still could not infer from the model (there was no sound at all). I used chinese_cleaners and followed your instruction. So I wonder which step might go wrong. Should I continue to train the model for more epochs? Thank you!

Does batch size make a difference to end quality?

Apologies if this is too general a question - but would using a smaller batch size impact the end-quality of the model? I'm ~5 days into a size 16 batch run on a 3080ti. (using MB-Mini) I've noticed the quality hasn't improved very much past the first 100k iterations. Is this just an issue with the batch size?

I'll post results after another few days.

AttributeError: 'ResidualCouplingTransformersLayer2' object has no attribute 'remove_weight_norm'

Hi there, I have this error when I want to convert model weight to onnx model.
How can I solve the problem?

Using mel posterior encoder for VITS2
Multi-band iSTFT VITS2
INFO:root:dec.stft.forward_basis is not in the checkpoint
INFO:root:dec.stft.inverse_basis is not in the checkpoint
INFO:root:Loaded checkpoint 'logs/models/G_48000.pth' (iteration 242)
Removing weight norm...
Traceback (most recent call last):
File "onnx_export.py", line 79, in
net_g.flow.remove_weight_norm() # Remove weightnorm in flows - a8d9f74
File "/MB-iSTFT-VITS2/models.py", line 759, in remove_weight_norm
l.remove_weight_norm()
File "/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'ResidualCouplingTransformersLayer2' object has no attribute 'remove_weight_norm'

New feature: Deleting the previous .pth files when training

Hi, I found that there was no funtion in train.py that can delete the existing .pth files, which made the training process require a lot of space. It would be very helpful if you could add a function to deleting the previous .pth files when training. Thank you!

ONNX format

You could use this PR: p0p4k/vits2_pytorch#43

KSS dataset for 490 epochs but the quality is not as good as I expected

First of all, thank you for sharing such a wonderful code.
I trained using the KSS dataset for 490 epochs, but the quality is not as good as I expected.
It seems that the TTS speaks a bit fast.
wav
What might have gone wrong during the training?

{
"train": {
"log_interval": 200,
"eval_interval": 3000,
"seed": 1234,
"epochs": 20000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"fft_sizes": [384, 683, 171],
"hop_sizes": [30, 60, 10],
"win_lengths": [150, 300, 60],
"window": "hann_window"
},
"data": {
"use_mel_posterior_encoder": true,
"training_files":"kss/kss_cjke_train.txt.cleaned",
"validation_files":"kss/kss_cjke_val.txt.cleaned",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 0,
"cleaned_text": true
},
"model": {
"use_mel_posterior_encoder": true,
"use_transformer_flows": true,
"transformer_flow_type": "pre_conv",
"use_spk_conditioned_encoder": false,
"use_noise_scaled_mas": true,
"use_duration_discriminator": true,
"ms_istft_vits": false,
"mb_istft_vits": true,
"istft_vits": false,
"subbands": 4,
"gen_istft_n_fft": 16,
"gen_istft_hop_size": 4,
"inter_channels": 192,
"hidden_channels": 96,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [4,4],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [16,16],
"n_layers_q": 3,
"use_spectral_norm": false,
"use_sdp": false
}
}

Would it be possible to train and initialize training with previous checkpoint?

I have trained a model with n speakers and now a new speaker is being added. Given remaining speaker data remains same and a new speaker (n+1) is being added. Is there a way to use previous checkpoint (n speakers), so that convergence is faster?

How to go ahead by adding emotion as a trainable parameter?

I would need emotion to be an additional parameter during training. How can I add this in the model?