suhitaghosh10 / emo-stargan Goto Github PK

Implementation of Emo-StarGAN

License: MIT License

Python 100.00%

anonymisation emotion stargan voice-conversion

emo-stargan's Introduction

Emo-StarGAN

This repository contains the source code of the paper Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion, accepted in Interspeech 2023. An overview of the method and the results can be found here.

Highlights:

Emo-StarGAN: An emotion-preserving deep semi-supervised voice conversion-based speaker anonymisation method is proposed.
Emotion supervision techiniques are proposed: (a) Direct: using emotion classifier (b) Indirect: using losses leveraging acoustic features and deep features which represent the emotional content of the source and converted samples.
The indirect techniques can also be used in the absence of emotion labels.
Experiments demonstrate its generalizability on the following benchmark datasets, across different accents, genders, emotions and cross-corpus conversions:

Samples

Samples can be found here.

Demo

The demo can be found at Demo/EmoStarGAN Demo.ipynb.

Pre-requisites:

Python >= 3.9
Install the python dependencies mentioned in the requirements.txt

Training:

Before Training

Before starting the training, please specify the number of target speakers in num_speaker_domains and other details such as training and validation data in the config file.
Download VCTK and ESD datasets. For VCTK dataset preprocessing is needed, which can be carried out using Preprocess/getdata.py. The dataset paths need to be adjusted in train train_list.txt and validation val_list.txt lists present in Data/.
Download and copy the emotion embeddings weights to the folder Utils/emotion_encoder
Download and copy the vocoder weights to the folder Utils/Vocoder

Train

python train.py --config_path ./Configs/speaker_domain_config.yml

Model Weights

The Emo-StarGAN model weight can be downloaded from here.

Common Errors

When the speaker index in train_list.txt or val_list.txt is greater than the number of speakers ( the hyperparameter num_speaker_domains mentioned in speaker_domain_config.yml), the following error is encountered:

[train]:   0%| | 0/66 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Also note that the speaker index starts with 0 (not with 1!) in the training and validation lists.

References and Acknowledgements

Our repository is heavily based on this great repo yl4579/StarGANv2-VC
kan-bayashi/ParallelWaveGAN
keums/melodyExtraction_JDC

emo-stargan's People

Contributors

Stargazers

Watchers

Forkers

oijoijcoiejoijce ahmeftah maxmax2016

emo-stargan's Issues

Readme

Hello,
Thank you for your great work!

Is this possible to add some guidelines for running experiment ?

Best regards,

Can I train my own model?

Can I train my own model? Using my own data set？

config.yml

Dear suhitaghosh10, thank you very much for sharing your research and publish results!
I'm trying to launch the Demo EmoStarGAN Demo.ipynb notes and it seems like config.yml file is missing, could you please provide the config file?
Thank you very much once again

path of emotion embeddings

Emotion_encoder_path

Speaker and Emotional domain training

You've got 2 config files: speaker domain config for training on VCTK dataset, emotional domain config for training on ESD, yet training instructions mention ONLY speaker domain training.

Would you explain, please?

Task Conversion

Can this model be used for voice conversion tasks that preserve accents？

Is splitting by 5 sec neccesary?

I understand you took it from StarGanVC2, but still how does variable length (3..10 seconds) influence training vs fixed 5 sec length?

GRU and Transformer Encoder Layer

Hello , thank you for your research, very interesting results.
I noticed you have some line of code in the generator part which you actually don't use, but published.
Could you please elaborate a bit on GRU and Encoder methods and how you used them?

where to get stargan_emo.pth

From emostargan demo.ipynb

Evaluation index issues

I would like to ask how to calculate the evaluation indicators MAE and Acc in the paper, and is there any corresponding code？Thank you very much for your answer.

Emotion transfer inference

Good day, thank you very much for your results!

I have a question concerning the emotion inference, would be nice to have a brief illustration of emotion transfer inference in your Demo code, do you plan to include it as well?

Kind regards,
V.D.

emotional transfer for target speakers with only neutral data

Hello!

First of all, congrats for your awesome work!

I am playing with your demo and trying to convert an emotional source utterance (from ESD) to a target speaker that only has neutral data (from VCTK) while preserving the source's emotion, but unfortunately I found the outputs to be always neutral.

I was wondering if you tried any experiment in this direction, since from the paper, as far as I understood, for a target neutral speaker you only tried with a neutral source utterance (VCTK -> VCTK).

(When the output is from a speaker that has already seen emotional data, the conversion is indeed neat!)

Thanks!

Hello, I would like to inquire about the experimental results.

Hello, thank you very much for sharing the code. I noticed that you have two papers with the same task, and the other one is: StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings. But the experimental results are shown in Table 2，The ACC results are different, with some being slightly worse. What is the reason？

May I ask if you labeled the issue of emotional leakage with accents or speakers？

Hello, your job is very interesting. Regarding the issue of emotional leakage, may I ask if you labeled the issue of emotional leakage with accents or speakers？

How well does it generalise for A2M?

Looking at the results of subjective evaluation there's no SMOS (similarity) results. How well does your model actually genrealise for A2M tasks?