auspicious3000 / speechsplit Goto Github PK
View Code? Open in Web Editor NEWUnsupervised Speech Decomposition Via Triple Information Bottleneck
Home Page: http://arxiv.org/abs/2004.11284
License: MIT License
Unsupervised Speech Decomposition Via Triple Information Bottleneck
Home Page: http://arxiv.org/abs/2004.11284
License: MIT License
Great work! The question is to get Mel, normed-F0 value!
Who can help me?
[Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]
How do I solve this error when executing the last cell?
AttributeError Traceback (most recent call last)
in ()
10 os.makedirs('results')
11
---> 12 model = build_model().to(device)
13 checkpoint = torch.load("/content/SpeechSplit/checkpoint_step001000000_ema.pth")
14 model.load_state_dict(checkpoint["state_dict"])
/root/wavenet_vocoder/autovc/synthesis.py in build_model()
AttributeError: 'HParams' object has no attribute 'builder'
Can I get a confirmation on tuning the bottlenecks inline with Appendix B.4?
"The first operation is to increase the channel dimension of the encoder output"
Does this refer to dim_enc
, dim_enc_2
and dim_enc_3
?
"The second operation is to increase the sampling rate of the down-sampled code"
Does this refer to freq
, freq_2
and freq_3
?
Is this project an evolution of autovc?
can I transfer only rhythm or tiber or pitch from audio1 to audio2 where they have different contents?
Hello, I want to retrain the G model. I used the VCTK corpus to delete some flac files that were too long and too short. I tried several times and the results were not good. I only modified the number of speakers in the model parameters. Can you send me a copy of your training data or tell me what preprocessing you have done to the VCTK corpus and which speakers have been selected? My email: [email protected]. Thanks! ! ! (●—●)
Is anyone know how to make a file like demo.pkl
?
I've tried to print a data out, but I still have no idea. Below is my code and what I got:
demoData = pickle.load(open(os.path.join('assets', 'demo.pkl'), "rb"))
print(demoData[0][0])
print(demoData[0][1].shape)
print(demoData[0][2][0].shape)
print(demoData[0][2][1].shape)
print(demoData[0][2][2])
print(demoData[0][2][3])
result:
>> p226
>> (1, 82)
>> (135, 80)
>> (135,)
>> 135
>> 003002
I suppose it contains 3 things:
Thanks in advance
Hello,
Thanks for uploading the code! I wanted to let you know I'm having some issues with running the code from the demo, getting this error:
RuntimeError: Error(s) in loading state_dict for WaveNet:
Missing key(s) in state_dict: "upsample_net.conv_in.weight", "upsample_net.upsample.up_layers.1.weight_g", "upsample_net.upsample.up_layers.1.weight_v", "upsample_net.upsample.up_layers.3.weight_g", "upsample_net.upsample.up_layers.3.weight_v", "upsample_net.upsample.up_layers.5.weight_g", "upsample_net.upsample.up_layers.5.weight_v", "upsample_net.upsample.up_layers.7.weight_g", "upsample_net.upsample.up_layers.7.weight_v".
Unexpected key(s) in state_dict: "upsample_conv.0.bias", "upsample_conv.0.weight_g", "upsample_conv.0.weight_v", "upsample_conv.2.bias", "upsample_conv.2.weight_g", "upsample_conv.2.weight_v", "upsample_conv.4.bias", "upsample_conv.4.weight_g", "upsample_conv.4.weight_v", "upsample_conv.6.bias", "upsample_conv.6.weight_g", "upsample_conv.6.weight_v", "conv_layers.0.conv1x1c.bias", "conv_layers.1.conv1x1c.bias", "conv_layers.2.conv1x1c.bias", "conv_layers.3.conv1x1c.bias", "conv_layers.4.conv1x1c.bias", "conv_layers.5.conv1x1c.bias", "conv_layers.6.conv1x1c.bias", "conv_layers.7.conv1x1c.bias", "conv_layers.8.conv1x1c.bias", "conv_layers.9.conv1x1c.bias", "conv_layers.10.conv1x1c.bias", "conv_layers.11.conv1x1c.bias", "conv_layers.12.conv1x1c.bias", "conv_layers.13.conv1x1c.bias", "conv_layers.14.conv1x1c.bias", "conv_layers.15.conv1x1c.bias", "conv_layers.16.conv1x1c.bias", "conv_layers.17.conv1x1c.bias", "conv_layers.18.conv1x1c.bias", "conv_layers.19.conv1x1c.bias", "conv_layers.20.conv1x1c.bias", "conv_layers.21.conv1x1c.bias", "conv_layers.22.conv1x1c.bias", "conv_layers.23.conv1x1c.bias".
I used to have size mismatches as well, but then I edited these rows from inside the wavenet_vocoder repo:
residual_channels=512,
gate_channels=512, # split into 2 gropus internally for gated activation
skip_out_channels=256,
Maybe it's something obvious for you, thank you for publishing your code and of course your time, much obliged.
Although the tuning process mentioned is very intuitive, it seems like there's no theoretical guarantee that the same bottleneck sizes will work for all speakers. I think it's a research problem in itself to be able to decide the bottlenecks directly from the speech(without going through the manual tuning process).
But practically speaking, it might be possible that a set of bottleneck sizes might work well in general for most of the cases. Is that the case with the sizes used in the repo? Did anyone try using the same sizes on a different dataset? Since training takes a long time, for each iteration of the tuning process for every new speaker or dataset, I'm afraid the approach might become very impractical to use.
@auspicious3000 any insights or help is very much appreciated
Hi, thanks for your great work.
I know that the pre-trained model, which you ask us to download before running demo.ipynb, has trained through 660000 steps.
Do you mind me to ask which dataset and how much data you trained in that one you gave us?
Lines 17 to 25 in 10ed8b9
Hi, I am trying to change the wavenet vocoder to the melgan, however, I noticed the scaling of dB is different from what is generated in demo.ipynb, compared to what melgan generated.
On the left is c.T from demo.ipynb plotted, while on the right is melgan generated spectrogram from wav files. Note the scale is different, one is positive, while the other is negative.
demo.ipynb spectrogram generates fine using wavenet, but generates garbage when fed into melgan. Scaling db values linearly to approximate melgan works ok when fed to melgan, but is there a proper method to convert between the db scalings?
Is it possible to use PWG vocoder(https://github.com/kan-bayashi/ParallelWaveGAN) instead of Wavenet on the output of the decoder? Specifically, do I need to change the frame length and frame hop to make the mel spectrograms compatible with PWG.
Wavenet inference is very slow so it would help we are able to use any other neural vocoders directly. That way we could just finetune the given pretrained speechsplit models instead of training again from scratch.
File "SpeechSplit/data_loader.py", line 108, in call
pdb.set_trace()
NameError: name 'pdb' is not defined
where to define and find the pdb in data_loader file
Hi everyone,
I was trying to make my own Generator model; however, I found the result always carries Vibrato.
datasets: VCTK + LibriSpeech clean-100 + LibriSpeech clean-360 (with no data augmentation)
Instead of using one-hot speaker id, I was using speaker embedding.
The validation loss is 47.18.
Here is my result.
The intonation and naturalness sound okay, but the voice sounds like a man/woman speaking in front of a fan,
and the microphone is three steps away from the speaker.
Could anyone give me some advice or suggestion that may fix this kind of issue?
Should I change the datasets or maybe all I need is data augmentation?
Thanks in advance.
If the length of content code, rhythm code and pitch code is different from each other, how do they align since there is no attention mechanism in decoder?
Hi there,
I am getting the following error:
ImportError: cannot import name 'pad_seq_to_2'
No documentation online seems to support this function in the package 'utils'. I would be really grateful to have some clarity on this issue!
Thanks :)
The sampling rate of the VCTK corpus is 48K Hz while the model requires the sampling rate to be 16K Hz. To match the sampling rate, I used librosa's resample function and my code looks like:
import librosa
y, sr = librosa.load(wav_file, sr=48000)
y_16k = librosa.resample(y, sr, 16000)
Is this the same code you used for downsampling the audios? I want to clarify this because I want to make sure the data distribution is the same.
I am attempting to retrain at 22050Hz. At this SR validation loss for G and P do not decrease (P actually steadily increases). I am using test samples from every speaker in train set. Both loss_id's decrease as expected.
I train G according to this code in solver.py:
self.G = self.G.train()
# G Identity mapping loss
x_f0 = torch.cat((x_real_org, f0_org), dim=-1)
x_f0_intrp = self.Interp(x_f0, len_org)
f0_org_intrp = quantize_f0_torch(x_f0_intrp[:,:,-1])[0]
x_f0_intrp_org = torch.cat((x_f0_intrp[:,:,:-1], f0_org_intrp), dim=-1)
# G forward
x_pred = self.G(x_f0_intrp_org, x_real_org, emb_org)
g_loss_id = F.mse_loss(x_pred, x_real_org, reduction='mean')
# Backward and optimize.
self.g_optimizer.zero_grad()
g_loss_id.backward()
self.g_optimizer.step()
loss['G/loss_id'] = g_loss_id.item()
and train P according to this code:
self.P = self.P.train()
# Preprocess f0_trg for P
x_f0_trg = torch.cat((x_real_org, f0_org), dim=-1)
x_f0_intrp_trg = self.Interp(x_f0_trg, len_org)
# Target for P
f0_trg_intrp = quantize_f0_torch(x_f0_intrp_trg[:,:,-1])[0]
f0_trg_intrp_indx = f0_trg_intrp.argmax(2)
# P forward
f0_pred = self.P(x_real_org,f0_trg_intrp)
p_loss_id = F.cross_entropy(f0_pred.transpose(1,2),f0_trg_intrp_indx, reduction='mean')
self.p_optimizer.zero_grad()
p_loss_id.backward()
self.p_optimizer.step()
loss['P/loss_id'] = p_loss_id.item()
I feel this may be due to the LSTMs in the encoders and decoders, since at a different SR the vocal features appear over a different scale, however any other suggestions would be appreciated.
I have integrated training for P in solver.py but am unsure of what learning rate to use.
The default for G is 0.001, but I doubt this is also correct for P.
What initial LR should I use for P?
Many thanks.
What I have done: I purposely set a 0-like speaker embedding vector during testing for both image representation and loss measure (MSE, I assume higher is better).
For the result, I can clearly observe a significant MSE (around 33) after few days of training. However, after doing the real voice conversion (from one speaker to another), the model only achieves reconstruction without voice conversion.
If possible, it would be really appreciated knowing if there exist other ways to test voice conversion during training.
Great Thanks.
Greetings, thanks for such a good project.
In my experiment, i used the same dataset VCTK as yours, and i had only trained for 68000-steps. The log of my experiment like this:
I noticed that the validation loss is rising and the training loss also has some fluctuating peaks. Is it a normal phenomenon?
thank you in advance :)
Can you solve the inference problem of audio not in the data set?
Is the spk2gen.pkl
file available?
Hi, I observed that the range of spectrogram saved in npy file is -0.2 ~ 0.8. I am wondering why you normalize spectrogram into this range? For what reason?
how to use with my own voices?
with what script do I extract the components of my voices?
Is this mean the pre-trained model given is kind of overfitted model trained on small dataset?
Ok I have downloaded visual studio code to debug and understand
I see that make_spect_f0.py
is used to generate raptf0
and spmel
folders with values
So this make_spect_f0 reads a folder and decides whether it is male voice or female voice from spk2gen.pkl file
So as a beginning I have deleted all folders raptf0
and spmel
and wavs
then composed a wavs folder and composed another folder inside wavs as p285 which is a male assigned folder
Then inside p285 I have put my more than 2 hours long wav file myfile.wav
Question 1 : Does it have to be 16k hz and mono? or We can use maximum quality?
After I run make_spect_0.py
, it has composed myfile.npy
and myfile.npy
in raptf0
and spmel
folders
Then I did run make_metadata.py and it has composed train.pkl
inside spmel
Then when I run main.py
I get this below error at solver.py
I want to train a model. I don't want test.
Then I want to use this model to convert style of a speech to the trained model
So I need help thank you
Hi. I wanted to ask if you performed data normalization of an audio after trimming all the silences!
And if you did, what method did you use? (maybe link to a paper or lecture or some package, please?)
What was final validation loss of G and P after training is almost done? My result is something like this.. and I'm not sure if it's an okay number.
When running demo.ipynb, I am presented with this error:
AttributeError Traceback (most recent call last)
in ()
10 os.makedirs('results')
11
---> 12 model = build_model().to(device)
13 checkpoint = torch.load("/content/SpeechSplit/checkpoint_step001000000_ema.pth")
14 model.load_state_dict(checkpoint["state_dict"])/root/wavenet_vocoder/autovc/synthesis.py in build_model()
AttributeError: 'HParams' object has no attribute 'builder'
This has been mentioned before in #1, but there have been no solutions posted. This repo has very few instructions. The ones that exist are vague and lack detail. It would be helpful to have a more comprehensive installation tutorial.
How to get the right x_org, f0_org, len_org, uid_org value?
x_org, f0_org, len_org, uid_org = sbmt_i[2]
Thanks for your awesome contributions to this paper. I want to eval my synthesized audio. Can you share with me the code to compute GPE VDE and FFE? Thank you!
The loss of my training set looks normal, but the loss of the validation set has been rising. The loss of my training set looks normal, but the loss of the validation set has been rising. The structure of the validation set is:
[speaker, speaker_onehot, (spmel, raptf0, len, chapter)],
spmel and raptf0 were extracted by make_spect_f0.py directly.
Is there any problem with this?
I tried several times and the loss of validation set is rising.
Hi. Thank you for the fantastic project.
Does your model is capable to transfer content, rhythm, and pitch between different sentences?
I've prepared a demo.pkl file in the way that metadata[0] is
metadata[0].wav.zip
And the metadata[1] is was left the same.
Here is the result:
p226_p231_003002_RFU.wav.zip
Did I do something wrong, or your model is not intended to do such conversions?
Extract spectrogram and f0: python make_spect_f0.py
Generate training metadata: python make_metadata.py
My code is based on the above step!
Who can help me?
import os
import sys
import pickle
import numpy as np
import soundfile as sf
from scipy import signal
from librosa.filters import mel
from numpy.random import RandomState
from pysptk import sptk
from utils import butter_highpass
from utils import speaker_normalization
from utils import pySTFT
import torch
from autovc.model_bl import D_VECTOR
from collections import OrderedDict
mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)
C = D_VECTOR(dim_input=80, dim_cell=768, dim_emb=256).eval().cuda()
c_checkpoint = torch.load('assets/3000000-BL.ckpt')
new_state_dict = OrderedDict()
for key, val in c_checkpoint['model_b'].items():
new_key = key[7:]
new_state_dict[new_key] = val
C.load_state_dict(new_state_dict)
num_uttrs = 1
len_crop = 128
spk2gen = pickle.load(open('assets/spk2gen.pkl', "rb"))
rootDir = 'assets/wavs'
targetDir_f0 = 'assets/raptf0'
targetDir = 'assets/spmel'
dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)
speakers = []
for subdir in sorted(subdirList):
print(subdir)
if not os.path.exists(os.path.join(targetDir, subdir)):
os.makedirs(os.path.join(targetDir, subdir))
if not os.path.exists(os.path.join(targetDir_f0, subdir)):
os.makedirs(os.path.join(targetDir_f0, subdir))
_,_, fileList = next(os.walk(os.path.join(dirName,subdir)))
if spk2gen[subdir] == 'M':
lo, hi = 50, 250
elif spk2gen[subdir] == 'F':
lo, hi = 100, 600
else:
raise ValueError
utterances = []
utterances.append(subdir)
_, _, fileList = next(os.walk(os.path.join(dirName, subdir)))
# make speaker embedding [Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]
assert len(fileList) >= num_uttrs
idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
utterances.append(idx_uttrs)
prng = RandomState(int(subdir[1:]))
for i in range(num_uttrs):
dirName2=dirName.replace("wavs", "spmel")
npyfile=fileList[idx_uttrs[i]].replace("wav", "npy")
tmp = np.load(os.path.join(dirName2, subdir, npyfile))
# choose another utterance if the current one is too short
embs = []
left = np.random.randint(0, tmp.shape[0]-len_crop)
melsp = torch.from_numpy(tmp[np.newaxis, left:left+len_crop, :]).cuda()
emb = C(melsp)
embs.append(emb.detach().squeeze().cpu().numpy())
#embs1=emb.detach().squeeze().cpu().numpy()
# read audio file
x, fs = sf.read(os.path.join(dirName, subdir, fileList[idx_uttrs[i]]))
assert fs == 16000
if x.shape[0] % 256 == 0:
x = np.concatenate((x, np.array([1e-06])), axis=0)
y = signal.filtfilt(b, a, x)
wav = y * 0.96 + (prng.rand(y.shape[0]) - 0.5) * 1e-06
# compute spectrogram
D = pySTFT(wav).T
D_mel = np.dot(D, mel_basis)
D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16
S = (D_db + 100) / 100
# extract f0 [Speaker_Name , One-hot , [Mel, normed-F0, length, utterance_name] ]
f0_rapt = sptk.rapt(wav.astype(np.float32) * 32768, fs, 256, min=lo, max=hi, otype=2)
index_nonzero = (f0_rapt != -1e10)
mean_f0, std_f0 = np.mean(f0_rapt[index_nonzero]), np.std(f0_rapt[index_nonzero])
f0_norm = speaker_normalization(f0_rapt, index_nonzero, mean_f0, std_f0)
embs.append(f0_norm)
#embs2=f0_norm
embs.append(tmp.shape[0])
#embs3= tmp.shape[0]
embs.append(subdir)
#embs4= subdir
embss = tuple(embs)
utterances.append(embss)
speakers.append(utterances)
with open(os.path.join(rootDir, 'train.pkl'), 'wb') as handle:
pickle.dump(speakers, handle)
I want to run the demo having been interested by the paper, but I am facing problems as follows:
pip install wavenet-vocoder==0.1.1
The following error is thrown:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-e12167bbae28> in <module>
4 import pickle
5 import os
----> 6 from synthesis import build_model
7 from synthesis import wavegen
8
ModuleNotFoundError: No module named 'synthesis'
I feel this is because the install of wavenet-vocoder through pip does not install the necessary module 'synthesis'. Instructions are not completely clear on the wavenet-vocoder repo, or here how this synthesis module is accessed.
Any help would be appreciated!
Hi,
Thanks for this amazing work!
I was quickly trying to run the demo. I am currently stuck with the synthesis module being imported to generate the audio from mels.
Kindly let me know where can I find it!
Thanks,
Shubham
In the file demo.pkl, the last dimension is utterance id and it is designated as 003002 in demo. How to find the corresponding wav in VCTK corpus whose id is 003002 in your demo? And I am wondering how the utterance id is obtained. It seems that you doesn't mention how to get the utterance id according to the VCTK corpus.
Thank you!
Hi, thanks for the great work.
Sorry for the rudimentary question.
I have a question about the pre-trained model in demo.ipybn.
In the paper, it says that it was trained by 20 speakers, but the speaker ID vector used in demo.ipynb has a size of 82, and it looks like it has information for 82 speakers.
Please tell me how many speakers were used in the pre-trained model and why the speaker ID in demo.ipynb has a size of 82.
demo.pkl is a list of 6 entries.
What does each entry of the list represent? I figured out the last entry is for identification. But I still have no idea what the other entries mean.
How can we manually construct a demo.pkl-like file? Are there any APIs to make one?
Thank you.
I am trying to replicate your work. I am currently making F0 converter model for P checkpoint generation. I am stuck at loss calculation.
I see when I use F0_Converter model to generate P, I get a 257 dimension one-hot encoded feature P.
Demo.ipynb
f0_pred = P(uttr_org_pad, f0_trg_onehot)[0]
f0_pred.shape
> torch.Size([192, 257])
I wanted to ask you when training the F0 converter model, what is the value that you are using to calculate the loss?
I tried using the following value but I am not sure if that is the right way.
This is what I am doing to generate f0_pred and to calculate the loss:
f0_pred = self.P(x_real_org,f0_org_intrp)[0]
p_loss_id = F.mse_loss(f0_pred,f0_org_intrp,reduction='mean')
I just want to know if I am on the right track.
Can you help me out here @auspicious3000
I have trained the Generator model with my own data. However, I found that there may not exist a code for generating the speech from the trained Generator. And I check the code named "demo.ipynb" for founding out the way. It indicates that a trained F0_Converter is needed.
So I would like to ask the author that dose it nessusary to train a F0_Converter first for generating the speech from the trained Generator?(Because I found no code for training F0_Converter)? Or we just need to use the pretrained F0_Converter?
To Run Demo
(done) Download pre-trained models to assets
(done) Download the same WaveNet vocoder model as in AutoVC to assets
(done) wavenet_vocoder git checkout 44e0e36 for more information, please refer to
(done) Run demo.ipynb
ModuleNotFoundError: No module named 'synthesis'
Hi,
Thanks for the complete code!
I wanted to check how can I train F0 convertor P.
train.py only trains speech split model G.
Kindly help.
Hi,
Did you guys experiment using a pretrained encoder for getting the speaker embedding similar to your previous work (AutoVC).
PS: Amazing work by the way!
Thanks,
thanks for your great work!
I have 2 questions:
Thanks
Thanks for the codebase. Good work!
In the paper, Speech is split into -- timbre (using speaker embedding), pitch, rhythm, content. If I am not wrong, the accent information of the speaker is not captured by the speaker embedding. (I know this because when I experimented with AutoVC codebase, the speaker embedding did not capture the accent info. It accent info of the source speech was always seen in the voice conversion output.)
Any ideas on how to split the accent information from speech?
Thanks,
Pravin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.