mtg / wgansing Goto Github PK

View Code? Open in Web Editor NEW

235.0 13.0 44.0 72 KB

Multi-voice singing voice synthesis

Python 100.00%

trompa

wgansing's Introduction

WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Pritish Chandna, Merlijn Blaauw, Jordi Bonada, Emilia Gómez

Music Technology Group, Universitat Pompeu Fabra, Barcelona

This repository contains the source code for multi-voice singing voice synthesis

Installation

To install, clone the repository and use

pip install -r requirements.txt

to install the packages required.

The main code is in the main.py file.

Training and inference

To use the WGANSing, you will have to download the model weights and place it in the log_dir directory, defined in config.py.

The NUS-48E dataset can be downloaded from here. Once downloaded, please change wav_dir_nus in config.py to the same directory that the dataset is in.

To prepare the data for use, please use prep_data_nus.py.

Once setup, you can run the following commands. To train the model:

python main.py -t

To synthesize a .lab file: Use

python main.py -e filename alternate_singer_name

If no alternate singer is given then the original singer will be used for synthesis. A list of valid singer names will be displayed if an invalid singer is entered.

You will also be prompted on wether plots showed be displayed or not, press y or Y to view plots.

Acknowledgments

The TITANX used for this research was donated by the NVIDIA Corporation. This work is partially supported by the Towards Richer Online Music Public-domain Archives (TROMPA) (H2020 770376) European project.

[1] Duan, Zhiyan, et al. "The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech." 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2013.

[2] Blaauw, Merlijn, and Jordi Bonada. "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs." Applied Sciences 7.12 (2017): 1313.

[3] Blaauw, Merlijn, et al. “Data efficient voice cloning for neural singing synthesis,” in2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

wgansing's People

Contributors

Stargazers

Watchers

wgansing's Issues

dataset from TROMPA

hi, thanks for you sharing,
In Acknowledgments, you write ' This work is partially supported by the Towards Richer Online Music Public-domain Archives (TROMPA) (H2020 770376) European project.' I wonder that they offer you musical score and lyrics and pitch of NUS-48E? can you offer one to me, thank you very much.

Two problems about this repo.

Hi,

I find two questions on this repo that need your help:

(1) The model weights you provided is 950 epoch. However, the network you referred in the paper is trained for 3000epoch. Could you please provide the model which can generate the same audio as the samples from "https://pc2752.github.io/sing_synth_examples/"? I used the model about 950 epoch for inference and the generated audio about "JLEE" and "MCUR" are worse than the samples. By the way, could you please tell me which identity you converse "JLEE" and "MCUR" to in the webpage?

(2) In the paper WGANSing, you propose to adjust the input f0 by an octave to account for the different ranges of the genders. However, I can't find the corresponding code during inference. I try the results generated by the model you provided, which I think the effect of gender change is not good. I have attached the relevant audio. "SAMF_original.wav" is the the audio from dataset("\nus-smc-corpus_48\SAMF\01.wav"). "SAMF_output.wav" is the generated audio from "nus_MCUR_sing_04.hdf5". could you please tell me how to adjust the input f0 or provide the corresponding code?

Thanks in advance!!

SAMF.zip

Missing 'models' module?

I can't seem to find the 'models.py' module to import which I think contains the WGANSing model. Is this released yet?

Potential typo in code

in modules_tf.py:

method signature:

def full_network(f0, phos, singer_label, is_train):

in model.py:

methods calls:

self.output = modules.full_network(self.phone_onehot_labels, self.f0_placeholder, self.singer_onehot_labels, self.is_train)

I think the self.phone_onehot_labels (Pho_in) and self.f0_placeholder (F0_in) need to be swapped in the method calls to be coherent with the signature (or change the signature order). No problem in the model functionnality, the layers have a similar construction signature and are interchangeable in the concatenated input tensor.

loss is nan

I get loss is nan if dataset is over 48

Muffled Singing Voice

hi sir,

I have trained your model at 1000 epochs on my dataset. The resultant synthesized voice is muffled, it seems that the generator is not working well. I don't understand what I do wrong, as I don't make any change in model parameters.
Here I am sharing some results .wav files

original song: song1.wav
synthesized: song1 synthesized

https://drive.google.com/drive/folders/1dk6ACwu5kdqlD0U0L-QX3wGkQie2bRmj?usp=sharing

thanks in advance

Syntax Error

File "main.py", line 121
self.final_loss = tf.reduce_sum(tf.abs(self.output_placeholder- (self.output/2+0.5)))/(config.batch_sizeconfig.max_phr_len64) * config.lambda + tf.reduce_mean(self.D_fake+1e-12)
^
SyntaxError: invalid syntax

met "Segmentation fault": added a lstm layer in the full_network, when run train

Hi, thanks the share. when i add a lstm layer, met "Segmentation fault"

Add a simple lstm layer in the full_network, the code is below:

output = tf.squeeze(output) // origin code, doesn't change.
// lstm, the code i add.
lstm_cell = tf.contrib.rnn.BasicLSTMCell(64, forget_bias=1.0, state_is_tuple=True)
init_state = lstm_cell.zero_state(batch_size=config.batch_size, dtype=tf.float32)
lstm_out, final_state = tf.nn.dynamic_rnn(lstm_cell, output, initial_state=init_state, time_major=False)
print("-----------lstm_out.shape------------", lstm_out.shape) // (30, 128, 64)
return lstm_out

when train, i meet "Segmentation fault", and there is no other error information.
it's amazing, the code i add is simple and no fault.

What should the input file look like when using in test mode?

Sorry if I'm missing something obvious, but I can't find what the input file should look like.

assertion error

Local (My Computer) & distant (Google Colab) training stop at :

/content/WGanSing/data_pipeline.py in data_gen(mode, sec_mode)
    128         feats_targs = np.array(feats_targs)
    129 
--> 130         assert feats_targs.max()<=1.0 and feats_targs.min()>=0.0
    131 
    132         yield feats_targs, targets_f0_1, np.array(pho_targs), np.array(targets_singers)

AssertionError:

When I comment the assert line, the training start but I want to be sure that is not really a problem?

ERROR: Failed building wheel for h5py

Build error on Ubuntu 18.04
I get the build error ERROR: Failed building wheel for h5py.

and then

     #warning "Using deprecated NumPy API, disable it by " \
      ^~~~~~~
    In file included from /tmp/pip-install-z6jkh6m5/h5py/h5py/defs.c:639:0:
    /tmp/pip-install-z6jkh6m5/h5py/h5py/api_compat.h:27:10: fatal error: hdf5.h: No such file or directory
     #include "hdf5.h"
              ^~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    ----------------------------------------

Redondant call to PyWorld

in utils.py - line 262
redondant call to wav2world
>>> feats=pw.wav2world(vocals,fs,frame_period=5.80498866)
removing this line allow to speed up the precomputing prep_data_nus.

EDIT:
Wrong assertion (feats is used before recall), sorry. But maybe the code can be adapted to just copy feats to ap & harm and not recall pw.wav2world that is computational intensive (and called 96 x 2 in the NUS database case).

Encounter error while synthesizing: "a mismatch between the current graph and the graph"

Hi,

I am trying to run the trained model by following the instructions to synthesize a file. I was able to setup the dataset. But when i am running main.py, it fails while restoring from a checkpoint with the following error:

line 1322, in restore
err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that
you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [3,64] rhs shape= [12,64]
[[node save/Assign_177 (defined at D:\GT_Robotic Musicianship\Voice Synthesis\Multi_Voice_Sing_Speak_Sing-master\models.py:59) ]]

The code includes a comment saying:
# There is a mismatch between the graph and the checkpoint being loaded.
# We add a more reasonable error message here to help users (b/110263146)

Could you please help me understand how to get around this error?

Thank you!

Error occured when running prep_data_nus.py

Python version:3.7.4 x64
OS version:Windows 10 1809

C:\Users\JackLin\Desktop\WGANSing-mtg>python prep_data_nus.py
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Python374x64\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Python374x64\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Processing singer ADIZ
C:\Users\JackLin\Desktop\WGANSing-mtg\utils.py:267: RuntimeWarning: divide by zero encountered in log2
y=69+12*np.log2(f0/440)
Traceback (most recent call last):
File "prep_data_nus.py", line 153, in
main()
File "prep_data_nus.py", line 87, in main
hdf5_file["phonemes"][:,] = strings_p
File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "C:\Python374x64\lib\site-packages\h5py_hl\dataset.py", line 707, in setitem
for fspace in selection.broadcast(mshape):
File "C:\Python374x64\lib\site-packages\h5py_hl\selections.py", line 304, in broadcast
raise TypeError("Can't broadcast %s -> %s" % (target_shape, self.mshape))
TypeError: Can't broadcast (10963, 1) -> (10963,)

Building a extended singing corpus

I'm working on a corpus based on various sources to be trained with your model. I've some questions:

Is the dual read/sing version of the same words/songs revelant to the benefit of the produced audio? In the case of only singing songs and spoken audio (random words in dialog/interview) of the same singer are provided?
The current corpus is limited of 4 songs / 4 reads by singer. How many songs (or better what duration) by singer need to be provided to enhance the output signal, and what is the upper limit (approximatly) where the model will be (in all case) convergent during training?
I've notice that the NUS corpus is not correctly tuned in sens of amplitude of audio. Some records are lowest than other and the produced audio can vary greatly. Do you think it's a good idea to normalize the input to have a common loudness?
The NUS corpus is recorded dry (without reverberation), what is the impact of the reverb in this kind of model where the input is in any case decomposed by WORLD and the early reflection (decay of the reverb) had mainly impact on the F0.
About the F0, there's some new great algorithms like CREPE or more recently SPICE (Google AI). Do you think it's possible to combine the aperiodicity and spectral envelope of WORLD with third party F0 analyze or the process is intricated? As I see in the pyworld process the first call is DIO (the F0 estimation) before call StoneMask, CheapTrick and D4C. The F0 estimation of WORLD is clearly not the best and the F0 is crutial in our case.

Thank you so much if you'll find time to answer my questions.

language independent

hi
I want to implement your model using another language corpus so I want to ask you that is your model is language-dependent or independent?
thanks in advance

dataset can't download

Dataset can't download cos 504 error

I want to generate from label file

I want to generate from label file. What should I do?

variable max_phr_len in config.py

I am trying to use the GAN toolkit to generate read speech. I am currently trying to understand your code. It would really help if you tell me what is the role of the variable max_phr_len in config.py ? I see the value is set to 128. What would happen if I increase or decrease that value ? I have been reading your paper but can't seem to understand.

Can not download the dataset

The data link you provided was not able to open. Could you give another link?
Thank you very much!

Is singers independent?

I wonder if already trained singers improve by training new singers.
In paper, it seems like it takes singer identification information with other variables
Does this network learns general voice features or it just treats singers independently?

Performance bottleneck (not from model)

When I've call the first time 'prep_data_nus.py', i've notice the long preprocessing time to generate the hdf5. Approximatly 3 hours on my computer to generate the 96 hdf5 files. I've notice the sp_to_mgc performance bottleneck (SPTK dependency).

To produce a 2m54 song (the Elton John one from the NUS database), my computer need more than 13 minutes. 10 minutes more than the song duration. I've think that it's because I call the model on my CPU (not GPU), but i've do some measurements and found that the problem is clearly not the model and the 'AI' part.

The inference call:

import models
import config

file_name = 'nus_JLEE_sing_15.hdf5'

singer_name = 'MPOL'
singer_index = config.singers.index(singer_name)

model = models.WGANSing()
model.test_file_hdf5_no_question(file_name, singer_index)

The test_file_hdf5_no_question is just the same as test_file_hdf5 without the questions, but with function timing measurment and only the synthesized audio generated (not the ground truth)

The timing result (in seconds)

- load_model [*]   :   2.7976150512695
- read_hdf5_file   :   0.0341496467590
- process_file [*] :   3.0663671493530
- feats_to_audio   : 770.0193181037903

[*] Tensorflow calls

Clearly, the AI part is very fast, even on CPU. The problem come from the audio regeneration.

Details of feats_to_audio calls (always in seconds)

- f0_to_hertz   :   0.0130412578582
- mfsc_to_mgc   :   0.7175555229187
- mgc_to_sp     : 737.2016060352325
- pw.synthesize :  25.4196729660034
- sf.write      :   0.7051553726196

The PyWorld synthesize call is acceptable with 25 seconds (14% of the global audio duration), but the SPTK call is not.

Sadly, to my knowledge, this is the only fast code (C code) to generate Mel-Generalized Cepstrum conversion. And this is not a question of GPU because this is a pure CPU code. What the hell with this algorithm ?!?

I know my computer is a oldskool one : Dell Workstation T7400 with an Intel Xeon 4 cores @ 2.33GHz and 16GB RAM. But it works very well for many things except the pure Deep Learning stuffs.

I don't know if something it's possible in the future with WGANSing because the MGC is in the heart of the project, but I will investigate to find a way to optimize this process. I'm sure it's possible to reduce the computation time with some tricks.

In any case, well done with WGANSing, love that kind of project !

Small but important problems of this repo.

Hi,

I would like to make some minor comments on this repo:

In the README you don't specify what is the usability of this library, in prediction mode: what is you input and what is your output. I was curious for example if you could give a new voice and a text and make this voice sing that text. Or if you could give a singing voice and style transfer it to another singer from his or her speaking voice. It remains unclear to me. How much is it customizable and what is it's use case.
Secondly your requirements file is missing dependencies as matplotlib and tqdm and cython (for pyworld) and also there are version conflicts inside it, that need to be resolved by manually installing and downgrading packages (such as tensorboard). Also you should already have to have installed numpy in order for pyworld to compile (a classic problem fo Cython compilation - it's not on you of-course).
Although the config.py and the download links are well set, the help of the main.py (for example of how to predict and what - eval mode conflicts with wave mode in arguments in what you have inside your README) is not that helpful and for example I was curious what should it be for the folder voice_dir = ../ss_synthesis/voice/

Thanks in advance!!

About the HDF5 and the audio generation.

hi, to generate singing voice, it expects a .hdf5 file from the dataset. Generated .hdf5 needs wave file, Can it not use wave files?

Originally posted by @Kerry0123 in #29 (comment)