mazzzystar / randomcnn-voice-transfer Goto Github PK

View Code? Open in Web Editor NEW

397.0 22.0 75.0 7.55 MB

Audio style transfer with shallow random parameters CNN.

Home Page: https://soundcloud.com/mazzzystar/sets/speech-conversion-sample

Python 100.00%

style-transfer voice-transfer voice-conversion speech-conversion

randomcnn-voice-transfer's Introduction

Voice style transfer with random CNN

Maybe the fastest voice style transfer with reasonable result ?

What is voice style transfer?

Inspired by the paper A Neural Algorithm of Artistic Style , the idea of Neural Voice Transfer aims at "using Obama's voice to sing songs of Beyoncé" or something related.

We aim to:

Highlight of my work

Use 2-D CONV rather than 1-D for audio spectrogram.
Compute grams over time-axis.
Training fast. 5-10 minutes to train and transfer on 1 single GPU(Tesla P40).
Do not need dataset! You can transfer any 2 pieces of audio.(But some format of audio may occur error, then you should sudo apt-get install libav-tools)

Performance compared with other works

Some of other projects with audio results are as below.

Dmitry Ulyanov: Audio texture synthesis and style transfer: The first project of putting forward to use the shallow random CNN for voice transfer. However, the results in this URL is not so good, it sounds like two voice mixed together. I used his code on my boy.wav and girl.wav to generate audio, the result faces the same problem. You can hear the comparison at Stairway2Nightcall, the audio for comparison is downloaded from Dmitry Ulyanov's website.
Google style tranfer results. The paper is On Using Backpropagation for Speech Texture Generation and Voice Conversion, the archicture is using 13 layers CONV + pre-trained CTC model. Our results sounds is comparable with theirs, you can hear it on soundcloud.com
Voice Style Transfer to Kate Winslet with deep neural networks. This project has the best results among all current works.(Maybe?) And the cost is:
- Heavy architecture. We can see from it's github repo , the architecture is to training 2 networks, Net1 classifier and Net2 synthesizer and combine them together
- Delicate dataset. Except of using widely known dataset such as TIMIT, the author used the girls 2 hours audio dataset, and 1,000+ recording of <boy, girl> pairs audio speaking the same sentence, that's maybe unacceptable in reality of training others voice.
- Not general. The model was trained only for Kate Winslet's voice transfer. If we want to transfer to Obama's voice, we need to gather Obama's voice data and train that network again.

To sum up, our results is far better than the original random CNN results, which use the same dataset (only two audio) as we did. For those pre-trained deep neural network based on huge dataset, our results is comparable, and can be traind in 5 minutes, without using any outer dataset.(But still, all these conclusion are based on human taste.)

Results

You can listen to my current result now ! It's on soundcloud, link1, link2.

The generated spectrogram compared with content and style.

Compare the spectrogram of gen with content and style(X axis represents Time Domain, Y axis represents Frequency Domain), we can find that:

The structure is almost the same as content, and the gap along frequency axis, which determines the voice texture to a great extent, is more alike to the style.
The base skeleton is shifted upward a little bit for being similar to the style(The style is girl's voice, which has higher frequency than boy's).

Reproduce it yourself

pip install -r requirements.txt 
# remove `CUDA_VISIBLE_DEVICES` when use CPU, though it will be slow. 
CUDA_VISIBLE_DEVICES=0 python train.py -content input/boy18.wav -style input/girl52.wav

Tips: change 3x1 CONV to 3x3 CONV can get smoother generated spectrogram.

But..does the `gram` of random CNN output really works ?

Below is my experiments result of using texture gram after 1-layer RandomCNN to capture speaker identity by putting them as the only feature in a simple nearest neighbor speaker identification system. The table shows the result of speaker identification accuracy of this system over the first 15 utterances of 30 first speakers of the VCTK dataset, along with 100 utterances of 4 first speakers.

Speakers	Train/Test	Accuracy
30	270/180	45.6%
4	240/160	92.5%

It seems texture gram along time-axis really captured something, you can check it by:

python vctk_identify

randomcnn-voice-transfer's People

Contributors

Stargazers

Watchers

randomcnn-voice-transfer's Issues

when I run train, some problems came out:

`C:\Users\Administrator\AppData\Local\Programs\Python\Python35\python.exe E:/PyCharmProject/randomCNN-voice-transfer-master3/train.py
Traceback (most recent call last):
File "E:/PyCharmProject/randomCNN-voice-transfer-master3/train.py", line 4, in
from torch.autograd import Variable
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\torch_init_.py", line 84, in
from torch._C import *
ImportError: DLL load failed: 找不到指定的程序。

Process finished with exit code 1`

no voice

HI，why is there no sound in my output.wav ? I replace librosa.output.write_wav() using wavfile.write.

Is there a way to use this solution without CUDA?

First, thanks for sharing this solution. Pardon me if I'm using the Issues in the wrong way, I am not an expert on programming. I'm trying to run on a laptop without nvdia video card, but pytorch points out that it needs cuda libraries. I know that without CUDA, the process will be more slow.

Question: difference between this repo and others?

https://github.com/andabi/deep-voice-conversion seems to do the same thing, but is simpler, correct?

And then there is https://github.com/auspicious3000/autovc

And this https://github.com/marcoppasini/MelGAN-VC

https://arxiv.org/pdf/1910.03713.pdf

They all seem to point to the idea that separate information within a piece of audio can be separated

Just a note https://github.com/JRMeyer/open-speech-corpora has some good stuff.

required os and python versions?

import torch error

Question

Do you think the results would be better if i saved the .npy spectrogram and used my own vocoder to inference it?

Why does it have so much noise and vibration, how can we remove it ?

With so much noise its basically unuseable. Google's was perfectly noise free.
Hope I am looking at correct samples:
https://soundcloud.com/mazzzystar/sets/speech-conversion-sample

Is it possible to retain a model of the voice conversion?

Hello, @mazzzystar !

First of all, thank you for this implementation and providing the source code. It is not an issue at all, but an assumption if it is possible to save a model after the conversion and retain it to reuse on the same source and target voice (with different samples).

can't optimize a non-leaf Tensor

Hi
Recently, when running your code on GPU I have error: can't optimize a non-leaf Tensor
Tested on local machine, GCP Compute engine with PyTorch and Cuda.

AttributeError: module 'librosa' has no attribute 'output'

librosa.output was removed in librosa version 0.8.0. So any updated version of librosa will result on AttributeError: module 'librosa' has no attribute 'output'.

I'll try to modify that and will inform you the results.

Phase restoration algorithm

Hi @mazzzystar ,
Thank you for great work
I am interested in your phase restoration algorithm

def spectrum2wav(spectrum, sr, outfile):
    # Return the all-zero vector with the same shape of `a_content`
    a = np.exp(spectrum) - 1
    p = 2 * np.pi * np.random.random_sample(spectrum.shape) - np.pi
    for i in range(50):
        S = a * np.exp(1j * p)
        x = librosa.istft(S)
        p = np.angle(librosa.stft(x, N_FFT))
    librosa.output.write_wav(outfile, x, sr)

Here we just make initial random assumption on the phase and after several iterative forward and backward transformations for some reason our phase assumption improves.

So my question is why should it converge? Do you probably know any relevant literature about this? I am relatively new to audio processing and can't understand why should this algorithm work well.

Installation Environment

hi， I would like to train this demo. which is Installation Environment ? python version, pytorch, or other software

thank you.

Bad very long time to train compare the original project DmitryUlyanov

I run !python train.py -content input/boy18.wav -style input/girl52.wav on google colab and it tooks long time to finish

torch.Size([1, 1, 257, 244])
torch.Size([1, 1, 257, 355])
1000 5.0% 1m 38s content_loss:3.222046 style_loss:183.106659 total_loss:186.328705
2000 10.0% 3m 19s content_loss:3.329478 style_loss:32.088646 total_loss:35.418125
3000 15.0% 4m 59s content_loss:3.537519 style_loss:12.308197 total_loss:15.845716
4000 20.0% 6m 40s content_loss:3.672033 style_loss:4.532965 total_loss:8.204998

and still keep going

in original project by dimitry this doesn't happen

Some unclear details.

Hi,

your code seems interesting and I tried to run it on a personal example.
First you don't specify the torch version, though it seems to work with the latest one.
Secondly, does this model work better with more data?
I try to run it on style of about 30 min and target of a bit more than 4 min but I receive an out of memory error in the gpu.
I tried to investigate the code, but I don't see any batch size, that I could edit so I could get your code to work in my example.

Can you propose anything?
Thanks in advance!

Any updates to this?

Just curious if you plan to further update this project as I think its really cool!

bad result i git!!

Input output???

I see several input files in the input folder.
How does this work? Can I take my voice and convert the existing wav in my voice and create a new output?
Intention and usage is not clearly documented.

some problems

在图像的风格转移中，通过迭代处理随机初始化图像的像素值，得到最优的风格生成图像，所以处理的关键数据是图像的像素值（也就是3D中的RGB值）；但是在语音的风格转移中，声音信号被转换成2D spectrum，横轴是时间，纵轴是频率，2D图中的每个值是对应时间、对应频率的能量。
所以我想问：
1. 在语音的风格转移中，处理（迭代更新）的关键数据是2D spectrum图中的能量吗？
2. 图像的像素值很好理解，表征颜色，可称为图像的特征向量；但是，如果1中的回答是“是”，那么能量可以表征语音的什么呢？是音调、是音色、还是其他什么？
3. “我之前也尝试过类似的方法，即用不同说话人、相同内容的语音样本，计算一个loss来抹去speaker information，然后用同一个人的不同句子，计算另一个loss抹去content information，但训练效果并不好。”可否参看你的算法代码呢？

RuntimeError: CUDA out of memory

Traceback (most recent call last):
File "train.py", line 75, in
content_loss = content_param * compute_content_loss(a_C, a_G)
File "/content/randomCNN-voice-transfer/utils.py", line 64, in compute_content_loss
J_content = 1.0 / (4 * m * n_C * n_H * n_W) * torch.sum((a_C_unrolled - a_G_unrolled) ** 2)
RuntimeError: CUDA out of memory. Tried to allocate 2.99 GiB (GPU 0; 14.73 GiB total capacity; 10.14 GiB already allocated; 1.89 GiB free; 1.97 GiB cached)

How to Train this repo, can you please eloborate ?

Hey @mazzzystar I gone through this repo. It looks good.. I checked audio samples. Would like to do voice cloning / Voice conversion.
How much data we need for training. I didn't seen any data for training. How to evaluate results, where do we found output.
Could you please elaborate this repo.

RuntimeError: CUDA error: unspecified launch failure

Hi I have frequently this error. I reduced the inputs but it appears very often without predicibility.

2300 46.0% 129m 3s content_loss:66.045113 style_loss:150.286469 total_loss:216.331573
Traceback (most recent call last):
File "train.py", line 100, in
content_loss.item(),
RuntimeError: CUDA error: unspecified launch failure

my conda environment:
absl-py @ file:///D:/bld/absl-py_1598382479526/work
astor @ file:///home/conda/feedstock_root/build_artifacts/astor_1593610464257/work
audioread==2.1.8
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache==1.6.1
certifi==2020.6.20
cffi @ file:///D:/bld/cffi_1595805721274/work
colorama==0.4.3
colorlog==4.2.1
cycler==0.10.0
decorator==4.4.2
future==0.18.2
gast @ file:///home/conda/feedstock_root/build_artifacts/gast_1596839682936/work
google-pasta==0.2.0
grpcio @ file:///D:/bld/grpcio_1596715850903/work
h5py @ file:///D:/bld/h5py_1595110299148/work
hydra-colorlog==0.1.4
hydra-core==0.11.3
importlib-metadata @ file:///D:/bld/importlib-metadata_1593211612489/work
inflect==0.2.5
ipykernel @ file:///D:/bld/ipykernel_1595447157738/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython @ file:///D:/bld/ipython_1598750239682/work
ipython-genutils==0.2.0
jedi @ file:///D:/bld/jedi_1595018503144/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1593624380152/work
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1598486169312/work
jupyter-core==4.6.3
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.2.0
librosa==0.6.3
llvmlite==0.31.0
Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1589366472132/work
matplotlib==2.2.5
mkl-service==2.3.0
numba==0.48.0
numpy @ file:///D:/bld/numpy_1597938567428/work
olefile==0.46
omegaconf==1.4.1
opt-einsum==3.3.0
pandas==1.1.1
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1595548966091/work
pickleshare==0.7.5
Pillow @ file:///D:/bld/pillow_1594213048891/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1598885455507/work
protobuf==3.13.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1593275161868/work
Pygments==2.6.1
pyparsing==2.4.7
PyQt5==5.12.3
PyQt5-sip==4.19.18
PyQtWebEngine==5.12.1
pyreadline==2.1
pystoi==0.3.3
python-dateutil==2.8.1
pytz==2020.1
pywin32==227
PyYAML==5.3.1
pyzmq==19.0.2
resampy==0.2.2
scikit-learn @ file:///D:/bld/scikit-learn_1596546337481/work
scipy @ file:///C:/ci/scipy_1597686737426/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
sounddevice==0.4.0
tensorboard==1.15.0
tensorboardX @ file:///home/conda/feedstock_root/build_artifacts/tensorboardx_1594067496847/work
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.2
termcolor==1.1.0
threadpoolctl @ file:///tmp/tmp79xdzxkt/threadpoolctl-2.1.0-py3-none-any.whl
torch==1.6.0
torchaudio==0.6.0
torchvision==0.7.0
tornado==6.0.4
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1596476591553/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1599170348041/work
Unidecode==1.0.22
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1595859607677/work
Werkzeug==1.0.1
wincertstore==0.2
wrapt==1.12.1
zipp==3.1.0