Hello , in the paper you said "Our models are trained up to 900k steps on a single NVI

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've uploaded some results <a href="https://1drv.ms/u/s!AnvukVnlQ3ZTzS6Oujo9tz5gvAL5?e

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Questions about freevc HOT 7 CLOSED

olawod commented on July 3, 2024

Questions

from freevc.

Comments (7)

OlaWod commented on July 3, 2024 2

@OlaWod could you share the code nyou used for the results, I wanted to reproduce it with a different audio set, thanks

WER, CER: here

F0-PCC:

from tqdm import tqdm
import numpy as np
import pyworld as pw
import argparse
import librosa


def get_f0(x, fs=16000, n_shift=160):
    x = x.astype(np.float64)
    frame_period = n_shift / fs * 1000
    f0, timeaxis = pw.dio(x, fs, frame_period=frame_period)
    f0 = pw.stonemask(x, f0, timeaxis, fs)
    return f0
    
    
def compute_f0(wav, sr=16000, frame_period=10.0):
    wav = wav.astype(np.float64)
    f0, timeaxis = pw.harvest(
        wav, sr, frame_period=frame_period, f0_floor=20.0, f0_ceil=600.0)
    return f0
    

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()
    
    pccs = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = librosa.load(src, sr=16000)[0]
            src_f0 = get_f0(src)
            tgt = librosa.load(tgt, sr=16000)[0]
            tgt_f0 = get_f0(tgt)
            if sum(src_f0) == 0:
                src_f0 = compute_f0(src)
                tgt_f0 = compute_f0(tgt)
                print(rawline)
            pcc = np.corrcoef(src_f0[:tgt_f0.shape[-1]], tgt_f0[:src_f0.shape[-1]])[0, 1]
            #print(i, pcc)
            if not np.isnan(pcc.item()):
                pccs.append(pcc.item())
            
    with open(f"result/{args.title}.txt", "w") as f:
        for pcc in pccs:
            f.write(f"{pcc}\n")
        pcc = sum(pccs) / len(pccs)
        f.write(f"mean: {pcc}")
    print("mean: ", pcc)

O-Nat.: here

O-Sim.:

from resemblyzer import VoiceEncoder, preprocess_wav
from tqdm import tqdm
import numpy as np
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()
    
    encoder = VoiceEncoder()    
    
    ssims = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = preprocess_wav(src)
            src = encoder.embed_utterance(src)
            tgt = preprocess_wav(tgt)
            tgt = encoder.embed_utterance(tgt)
            ssim = np.inner(src, tgt)
            ssims.append(ssim.item())
            
    with open(f"result/{args.title}.txt", "w") as f:
        for ssim in ssims:
            f.write(f"{ssim}\n")
        ssim = sum(ssims) / len(ssims)
        f.write(f"mean: {ssim}")
    print("mean: ", ssim)

from freevc.

OlaWod commented on July 3, 2024 1

About 10 days.
We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.

from freevc.

OlaWod commented on July 3, 2024 1

About 10 days.

We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.

Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.

At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.

Below are the testing results:

model 1: FreeVC trained up to 540k steps with data of only 6 VCTK speakers (2079 utterances, 69.753 minutes in total)
model 2: FreeVC trained up to 540k steps with the same dataset split as in the paper

results of 1200 VCTK-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   7.17   |   2.85   |   76.69    |    4.30    |   78.70    |
|Model 2 |   7.71   |   2.97   |   81.79    |    4.47    |   80.10    |

results of 1200 LibriTTS-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   3.53   |   1.20   |   66.69    |    4.48    |   81.67    |
|Model 2 |   3.22   |   1.05   |   71.64    |    4.59    |   82.59    |

from freevc.

francqz31 commented on July 3, 2024 1

No problem, anyways the results of model2-540k are amazing , there is a big difference in the quality and naturalness of the s2s & u2s in model2-540k and model1-540k. Model2-450k samples are outstanding I would argue that they are better than the original demo or almost the same! Since model2 is trained with the same dataset split as in the paper I think training up to 900k would be A lot since 540k only makes such good results.

from freevc.

francqz31 commented on July 3, 2024

@OlaWod OMG Thank you so much Mr.Jingyi for updating me with your results under low-resource setting. Can you upload some .wav results so I can hear and know the quality and naturalness of the low-resource results , I Also attempted and Upsampled some of the results on your page which are in 16KHz to 48KHz ( I used the 3x model that can upsample from 16khz to 48khz )
1-https://drive.google.com/file/d/1LVoVoknVy-Y0iz6psqIlTFqf33w8vFGx/view?usp=share_link
2-https://drive.google.com/file/d/1D3vYuBnOGLyCbhp5l_V7dYW7md50LjY4/view?usp=share_link
3-https://drive.google.com/file/d/1ItMHQajGxhGiUOXkBMId73QXCoZLkJkf/view?usp=share_link
5-Finally do you think that vo-coder would make a huge difference? https://arxiv.org/abs/2206.13404 they claim it is artifact free , although some people trained it and said it is not that impressive , I on my own still didn't try it to judge but i think they might be doing something wrong!!!

from freevc.

OlaWod commented on July 3, 2024

I've uploaded some results here.
4. Sorry I don't understand what you are trying to say.
5. I think the difference won't be huge.

from freevc.

Ashraf-Ali-aa commented on July 3, 2024

@OlaWod could you share the code you used for the results, I wanted to reproduce it with a different audio set, thanks

from freevc.

Questions about freevc HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent