Comments (7)
@OlaWod could you share the code nyou used for the results, I wanted to reproduce it with a different audio set, thanks
WER, CER: here
F0-PCC:
from tqdm import tqdm
import numpy as np
import pyworld as pw
import argparse
import librosa
def get_f0(x, fs=16000, n_shift=160):
x = x.astype(np.float64)
frame_period = n_shift / fs * 1000
f0, timeaxis = pw.dio(x, fs, frame_period=frame_period)
f0 = pw.stonemask(x, f0, timeaxis, fs)
return f0
def compute_f0(wav, sr=16000, frame_period=10.0):
wav = wav.astype(np.float64)
f0, timeaxis = pw.harvest(
wav, sr, frame_period=frame_period, f0_floor=20.0, f0_ceil=600.0)
return f0
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
parser.add_argument("--title", type=str, default="1", help="output title")
args = parser.parse_args()
pccs = []
with open(args.txtpath, "r") as f:
for rawline in tqdm(f.readlines()):
src, tgt = rawline.strip().split("|")
src = librosa.load(src, sr=16000)[0]
src_f0 = get_f0(src)
tgt = librosa.load(tgt, sr=16000)[0]
tgt_f0 = get_f0(tgt)
if sum(src_f0) == 0:
src_f0 = compute_f0(src)
tgt_f0 = compute_f0(tgt)
print(rawline)
pcc = np.corrcoef(src_f0[:tgt_f0.shape[-1]], tgt_f0[:src_f0.shape[-1]])[0, 1]
#print(i, pcc)
if not np.isnan(pcc.item()):
pccs.append(pcc.item())
with open(f"result/{args.title}.txt", "w") as f:
for pcc in pccs:
f.write(f"{pcc}\n")
pcc = sum(pccs) / len(pccs)
f.write(f"mean: {pcc}")
print("mean: ", pcc)
O-Nat.: here
O-Sim.:
from resemblyzer import VoiceEncoder, preprocess_wav
from tqdm import tqdm
import numpy as np
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
parser.add_argument("--title", type=str, default="1", help="output title")
args = parser.parse_args()
encoder = VoiceEncoder()
ssims = []
with open(args.txtpath, "r") as f:
for rawline in tqdm(f.readlines()):
src, tgt = rawline.strip().split("|")
src = preprocess_wav(src)
src = encoder.embed_utterance(src)
tgt = preprocess_wav(tgt)
tgt = encoder.embed_utterance(tgt)
ssim = np.inner(src, tgt)
ssims.append(ssim.item())
with open(f"result/{args.title}.txt", "w") as f:
for ssim in ssims:
f.write(f"{ssim}\n")
ssim = sum(ssims) / len(ssims)
f.write(f"mean: {ssim}")
print("mean: ", ssim)
from freevc.
- About 10 days.
- We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
- Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
- At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.
from freevc.
- About 10 days.
- We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
- Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
- At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.
- Below are the testing results:
model 1: FreeVC trained up to 540k steps with data of only 6 VCTK speakers (2079 utterances, 69.753 minutes in total)
model 2: FreeVC trained up to 540k steps with the same dataset split as in the paper
results of 1200 VCTK-to-seen conversions:
| | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 | 7.17 | 2.85 | 76.69 | 4.30 | 78.70 |
|Model 2 | 7.71 | 2.97 | 81.79 | 4.47 | 80.10 |
results of 1200 LibriTTS-to-seen conversions:
| | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 | 3.53 | 1.20 | 66.69 | 4.48 | 81.67 |
|Model 2 | 3.22 | 1.05 | 71.64 | 4.59 | 82.59 |
from freevc.
No problem, anyways the results of model2-540k are amazing , there is a big difference in the quality and naturalness of the s2s & u2s in model2-540k and model1-540k. Model2-450k samples are outstanding I would argue that they are better than the original demo or almost the same! Since model2 is trained with the same dataset split as in the paper I think training up to 900k would be A lot since 540k only makes such good results.
from freevc.
@OlaWod OMG Thank you so much Mr.Jingyi for updating me with your results under low-resource setting. Can you upload some .wav results so I can hear and know the quality and naturalness of the low-resource results , I Also attempted and Upsampled some of the results on your page which are in 16KHz to 48KHz ( I used the 3x model that can upsample from 16khz to 48khz )
1-https://drive.google.com/file/d/1LVoVoknVy-Y0iz6psqIlTFqf33w8vFGx/view?usp=share_link
2-https://drive.google.com/file/d/1D3vYuBnOGLyCbhp5l_V7dYW7md50LjY4/view?usp=share_link
3-https://drive.google.com/file/d/1ItMHQajGxhGiUOXkBMId73QXCoZLkJkf/view?usp=share_link
5-Finally do you think that vo-coder would make a huge difference? https://arxiv.org/abs/2206.13404 they claim it is artifact free , although some people trained it and said it is not that impressive , I on my own still didn't try it to judge but i think they might be doing something wrong!!!
from freevc.
I've uploaded some results here.
4. Sorry I don't understand what you are trying to say.
5. I think the difference won't be huge.
from freevc.
@OlaWod could you share the code you used for the results, I wanted to reproduce it with a different audio set, thanks
from freevc.
Related Issues (20)
- Asking help for understanding code.
- The audio suffix of VCTK data set is not '_ mic2.flac'? HOT 2
- Question for hps.data.n_mel_channels
- Inference or train with WavLM-Base or WavLM-Base+? HOT 1
- Condition decoder on desired output length to have control over speech rate in inference?
- 基于您现有的模型使用aishell3训练,大概要训练多久,作者有试过吗
- Unseen Male to Male results in Female output HOT 1
- 音色转换程度不一致
- Epoch duration
- 关于算法的类型 HOT 1
- 训练了500个epoch,按照freevc.json配置进行训练,无论wav_tgt使用何种音色,测试出来的音色都是同一个?
- Changing batch size to 16 or 32
- poor performance on seen-to-unseen task while finetuning on Hindi language HOT 2
- 2023.01.10 update: code below can deteriorate model performance HOT 3
- Vocoder version
- Fine tuning with custom (multilingual) data HOT 1
- How to start inference example? HOT 1
- 关于训练问题
- target pitch issue after training (not appearing if using the pretrained checkpoint) HOT 1
- Config file for the FreeVC-24 checkpoint HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from freevc.