Giter Club home page Giter Club logo

Comments (7)

OlaWod avatar OlaWod commented on July 3, 2024 2

@OlaWod could you share the code nyou used for the results, I wanted to reproduce it with a different audio set, thanks

WER, CER: here

F0-PCC:

from tqdm import tqdm
import numpy as np
import pyworld as pw
import argparse
import librosa


def get_f0(x, fs=16000, n_shift=160):
    x = x.astype(np.float64)
    frame_period = n_shift / fs * 1000
    f0, timeaxis = pw.dio(x, fs, frame_period=frame_period)
    f0 = pw.stonemask(x, f0, timeaxis, fs)
    return f0
    
    
def compute_f0(wav, sr=16000, frame_period=10.0):
    wav = wav.astype(np.float64)
    f0, timeaxis = pw.harvest(
        wav, sr, frame_period=frame_period, f0_floor=20.0, f0_ceil=600.0)
    return f0
    

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()
    
    pccs = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = librosa.load(src, sr=16000)[0]
            src_f0 = get_f0(src)
            tgt = librosa.load(tgt, sr=16000)[0]
            tgt_f0 = get_f0(tgt)
            if sum(src_f0) == 0:
                src_f0 = compute_f0(src)
                tgt_f0 = compute_f0(tgt)
                print(rawline)
            pcc = np.corrcoef(src_f0[:tgt_f0.shape[-1]], tgt_f0[:src_f0.shape[-1]])[0, 1]
            #print(i, pcc)
            if not np.isnan(pcc.item()):
                pccs.append(pcc.item())
            
    with open(f"result/{args.title}.txt", "w") as f:
        for pcc in pccs:
            f.write(f"{pcc}\n")
        pcc = sum(pccs) / len(pccs)
        f.write(f"mean: {pcc}")
    print("mean: ", pcc)

O-Nat.: here

O-Sim.:

from resemblyzer import VoiceEncoder, preprocess_wav
from tqdm import tqdm
import numpy as np
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--txtpath", type=str, default="samples.txt", help="path to txt file")
    parser.add_argument("--title", type=str, default="1", help="output title")
    args = parser.parse_args()
    
    encoder = VoiceEncoder()    
    
    ssims = []
    with open(args.txtpath, "r") as f:
        for rawline in tqdm(f.readlines()):
            src, tgt = rawline.strip().split("|")
            src = preprocess_wav(src)
            src = encoder.embed_utterance(src)
            tgt = preprocess_wav(tgt)
            tgt = encoder.embed_utterance(tgt)
            ssim = np.inner(src, tgt)
            ssims.append(ssim.item())
            
    with open(f"result/{args.title}.txt", "w") as f:
        for ssim in ssims:
            f.write(f"{ssim}\n")
        ssim = sum(ssims) / len(ssims)
        f.write(f"mean: {ssim}")
    print("mean: ", ssim)

from freevc.

OlaWod avatar OlaWod commented on July 3, 2024 1
  1. About 10 days.
  2. We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
  3. Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
  4. At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.

from freevc.

OlaWod avatar OlaWod commented on July 3, 2024 1
  1. About 10 days.
  2. We have not attempted to train in this low-resource setting. I‘ll give it a try in a few weeks.
  3. Audio with a higher sampling rate sounds better than audio with a lower sampling rate. The WavLM module operates at 16kHz, thus the model structure needs to be redesigned to make it synthesize audios with a different sampling rate. For example, this paper uses a length resampling decoder to tackle this problem. Also, there are many works on speech super-resolution, and it is possible to jointly train a 16kHz VC model and a 16kHz-to-xxkHz speech super-resolution model.
  4. At the very beginning of our experiment we used a HiFi-GAN trained by ourselves, and trained the VC model to 800k steps. Later we switched to the official HiFi-GAN as it is available to everyone. But after training the new VC model to 800k steps we found that the objective results (WER, CER, SSIM, etc.) are slightly worse than the old one. I was unhappy with this and so I continued to train it to 900k steps so that its performance can match our old model. So, I think a better vocoder can make a difference, but it won’t be huge.
  1. Below are the testing results:

model 1: FreeVC trained up to 540k steps with data of only 6 VCTK speakers (2079 utterances, 69.753 minutes in total)
model 2: FreeVC trained up to 540k steps with the same dataset split as in the paper

results of 1200 VCTK-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   7.17   |   2.85   |   76.69    |    4.30    |   78.70    |
|Model 2 |   7.71   |   2.97   |   81.79    |    4.47    |   80.10    |

results of 1200 LibriTTS-to-seen conversions:

|        | WER% (↓) | CER% (↓) | F0-PCC (↑) | O-Nat. (↑) | O-Sim. (↑) |
|Model 1 |   3.53   |   1.20   |   66.69    |    4.48    |   81.67    |
|Model 2 |   3.22   |   1.05   |   71.64    |    4.59    |   82.59    |

from freevc.

francqz31 avatar francqz31 commented on July 3, 2024 1

No problem, anyways the results of model2-540k are amazing , there is a big difference in the quality and naturalness of the s2s & u2s in model2-540k and model1-540k. Model2-450k samples are outstanding I would argue that they are better than the original demo or almost the same! Since model2 is trained with the same dataset split as in the paper I think training up to 900k would be A lot since 540k only makes such good results.

from freevc.

francqz31 avatar francqz31 commented on July 3, 2024

@OlaWod OMG Thank you so much Mr.Jingyi for updating me with your results under low-resource setting. Can you upload some .wav results so I can hear and know the quality and naturalness of the low-resource results , I Also attempted and Upsampled some of the results on your page which are in 16KHz to 48KHz ( I used the 3x model that can upsample from 16khz to 48khz )
1-https://drive.google.com/file/d/1LVoVoknVy-Y0iz6psqIlTFqf33w8vFGx/view?usp=share_link
2-https://drive.google.com/file/d/1D3vYuBnOGLyCbhp5l_V7dYW7md50LjY4/view?usp=share_link
3-https://drive.google.com/file/d/1ItMHQajGxhGiUOXkBMId73QXCoZLkJkf/view?usp=share_link
5-Finally do you think that vo-coder would make a huge difference? https://arxiv.org/abs/2206.13404 they claim it is artifact free , although some people trained it and said it is not that impressive , I on my own still didn't try it to judge but i think they might be doing something wrong!!!

from freevc.

OlaWod avatar OlaWod commented on July 3, 2024

I've uploaded some results here.
4. Sorry I don't understand what you are trying to say.
5. I think the difference won't be huge.

from freevc.

Ashraf-Ali-aa avatar Ashraf-Ali-aa commented on July 3, 2024

@OlaWod could you share the code you used for the results, I wanted to reproduce it with a different audio set, thanks

from freevc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.