Giter Club home page Giter Club logo

ssr_eval's Introduction

Speech Super-resolution Evaluation and Benchmarking

What this repo do:

  • A toolbox for the evaluation of speech super-resolution algorithms.
  • Unify the evaluation pipline of speech super-resolution algorithms for a easier comparison between different systems.
  • Benchmarking speech super-resolution methods (pull request is welcome). Encouraging reproducible research.

I build this repo while I'm writing my paper for INTERSPEECH 2022: Neural Vocoder is All You Need for Speech Super-resolution. The model mentioned in this paper, NVSR, will also be open-sourced here.

Some notes

  1. Suggestions for comparing your model with NVSR.
  • At a sampling-rate <= 44.1 kHz. You can resample NVSR result to this sampling-rate.
  • At a sampling-rate > 44.1 kHz (usually 48kHz).
    First option is resampling your result to 44.1kHz. Another option is train a 48kHz NVSR, which I'm currently working on. I'll release the 48kHz NVSR in the next month.

Installation

Install via pip:

pip3 install ssr_eval

Please make sure you have already installed sox.

Quick Example

A basic example: Evaluate on a system that do nothing:

from ssr_eval import test 
test()
  • The evaluation result json file will be stored in the ./results directory: Example file
  • The code will automatically handle stuffs like downloading test sets.
  • You will find a field "averaged" at the bottom of the json file that looks like below. This field mark the performance of the system.
"averaged": {
        "proc_fft_24000_44100": {
            "lsd": 5.152331300436993,
            "log_sispec": 5.8051057146229095,
            "sispec": 30.23394207533686,
            "ssim": 0.8484425044157442
        }
    }

Here we report four metrics:

  1. Log spectral distance(LSD).
  2. Log scale invariant spectral distance [1] (log-sispec).
  3. Scale invariant spectral distance [1] (sispec).
  4. Structral similarity (SSIM).

⚠️ LSD is the most widely used metric for super-resolution. And I include another three metrics just in case you need them.


main_idea

Below is the code of test()

from ssr_eval import SSR_Eval_Helper, BasicTestee

# You need to implement a class for the model to be evaluated.
class MyTestee(BasicTestee):
    def __init__(self) -> None:
        super().__init__()

    # You need to implement this function
    def infer(self, x):
        """A testee that do nothing

        Args:
            x (np.array): [sample,], with model_input_sr sample rate
            target (np.array): [sample,], with model_output_sr sample rate

        Returns:
            np.array: [sample,]
        """
        return x

def test():
    testee = MyTestee()
    # Initialize a evaluation helper
    helper = SSR_Eval_Helper(
        testee,
        test_name="unprocessed",  # Test name for storing the result
        input_sr=44100,  # The sampling rate of the input x in the 'infer' function
        output_sr=44100,  # The sampling rate of the output x in the 'infer' function
        evaluation_sr=48000,  # The sampling rate to calculate evaluation metrics.
        setting_fft={
            "cutoff_freq": [
                12000
            ],  # The cutoff frequency of the input x in the 'infer' function
        },
        save_processed_result=True
    )
    # Perform evaluation
    ## Use all eight speakers in the test set for evaluation (limit_test_speaker=-1) 
    ## Evaluate on 10 utterance for each speaker (limit_test_nums=10)
    helper.evaluate(limit_test_nums=10, limit_test_speaker=-1)

The code will automatically handle stuffs like downloading test sets. The evaluation result will be saved in the ./results directory.

Baselines

We provide several pretrained baselines. For example, to run the NVSR baseline, you can click the link in the following table for more details.


Table.1 Log-spectral distance (LSD) on different input sampling-rate (Evaluated on 44.1kHz).

Method One for all Params 2kHz 4kHz 8kHz 12kHz 16kHz 24kHz 32kHz AVG
NVSR [Pretrained Model] Yes 99.0M 1.04 0.98 0.91 0.85 0.79 0.70 0.60 0.84
WSRGlow(24kHz→48kHz) No 229.9M - - - - - 0.79 - -
WSRGlow(12kHz→48kHz) No 229.9M - - - 0.87 - - - -
WSRGlow(8kHz→48kHz) No 229.9M - - 0.98 - - - - -
WSRGlow(4kHz→48kHz) No 229.9M - 1.12 - - - - - -
Nu-wave(24kHz→48kHz) No 3.0M - - - - - 1.22 - -
Nu-wave(12kHz→48kHz) No 3.0M - - - 1.40 - - - -
Nu-wave(8kHz→48kHz) No 3.0M - - 1.42 - - - - -
Nu-wave(4kHz→48kHz) No 3.0M - 1.42 - - - - - -
Unprocessed - - 5.69 5.50 5.15 4.85 4.54 3.84 2.95 4.65

Click the link of the model for more details.

Here "one for all" means model can process flexible input sampling rate.

Features

The following code demonstrate the full options in the SSR_Eval_Helper:

testee = MyTestee()
helper = SSR_Eval_Helper(testee, # Your testsee object with 'infer' function implemented
                        test_name="unprocess",  # The name of this test. Used for saving the log file in the ./results directory
                        test_data_root="./your_path/vctk_test", # The directory to store the test data, which will be automatically downloaded.
                        input_sr=44100, # The sampling rate of the input x in the 'infer' function
                        output_sr=44100, # The sampling rate of the output x in the 'infer' function
                        evaluation_sr=48000, # The sampling rate to calculate evaluation metrics. 
                        save_processed_result=False, # If True, save model output in the dataset directory.
                        # (Recommend/Default) Use fourier method to simulate low-resolution effect
                        setting_fft = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], # The cutoff frequency of the input x in the 'infer' function
                        }, 
                        # Use lowpass filtering to simulate low-resolution effect. All possible combinations will be evaluated. 
                        setting_lowpass_filtering = {
                            "filter":["cheby","butter","bessel","ellip"], # The type of filter 
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], 
                            "filter_order": [3,6,9] # Filter orders
                        }, 
                        # Use subsampling method to simulate low-resolution effect
                        setting_subsampling = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000],
                        }, 
                        # Use mp3 compression method to simulate low-resolution effect
                        setting_mp3_compression = {
                            "low_kbps": [32, 48, 64, 96, 128],
                        },
)

helper.evaluate(limit_test_nums=10, # For each speaker, only evaluate on 10 utterances.
                limit_test_speaker=-1 # Evaluate on all the speakers. 
                )

⚠️ I recommand all the users to use fourier method (setting_fft) to simulate low-resolution effect for the convinence of comparing between different system.

Dataset Details

We build the test sets using VCTK (version 0.92), a multi-speaker English corpus that contains 110 speakers with different accents.

  • Speakers used for the test set: p360, p361, p362, p363, p364, p374, p376, s5
  • For the remaining 100 speakers, p280 and p315 are omitted for the technical issues.
  • Other 98 speakers are used for training.

Citation

If you find this repo useful for your research, please consider citing:

@misc{liu2022neural,
      title={Neural Vocoder is All You Need for Speech Super-resolution}, 
      author={Haohe Liu and Woosung Choi and Xubo Liu and Qiuqiang Kong and Qiao Tian and DeLiang Wang},
      year={2022},
      eprint={2203.14941},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Reference

[1] Liu, Haohe, et al. "VoiceFixer: Toward General Speech Restoration with Neural Vocoder." arXiv preprint arXiv:2109.13731 (2021).

ssr_eval's People

Contributors

haoheliu avatar

Stargazers

yuanfeng jiang avatar  avatar Carson Evans avatar yezhangyinge avatar  avatar  avatar Takuma Mori avatar Airlamb avatar  avatar Nickolay V. Shmyrev avatar  avatar LingLing DAN avatar Yong Joon Lee avatar  avatar Dom Main avatar Kun Zhou avatar gotomypc avatar ZXMu avatar Guozhi Cao avatar JakeOne IM avatar Jinwoo Han avatar  avatar blldd avatar  avatar  avatar  avatar  avatar  avatar  avatar Seanghay Yath (上海) avatar Madiha AMARJOUF avatar natas avatar  avatar  avatar Yang (Yolanda) Gao avatar Hao Zhao avatar  avatar Yuan-Man avatar Iver Jordal avatar Mindaugas avatar  avatar Long Zhou avatar  avatar zwf avatar redust avatar yhzhouowo avatar  avatar Bao-Sinh Nguyen avatar Sofian Mejjoute avatar Henry Lao avatar  avatar Wooseok Han avatar  avatar Liam Dyer avatar joonghoonc avatar HYUN PARK avatar Gong Junmin avatar  avatar Qibaba avatar  avatar Jiameng Gao avatar  avatar Eugenio Massolo avatar Zhouwei avatar Pierce Brooks avatar  avatar  avatar Kunasi Ramesh avatar Donghyun Kim avatar Meiying Melissa Chen avatar sagit avatar  avatar Tim von Känel avatar  avatar  avatar owlwang avatar 이준혁(Junhyeok Lee) avatar Ruizhe Cao avatar Chengxi Li avatar Jinglin Liu avatar Jeff Carpenter avatar  avatar xu shengyuan avatar  avatar  avatar Saurabh avatar Muhammed Can Keleş avatar A G-g-ghost! avatar Jack Armitage avatar Louis Maddox avatar  avatar Zeying Xie avatar 雲夢 avatar Shangeth avatar Chin-Yun Yu avatar Zhengxi Liu (刘正曦) avatar Harry Gallagher avatar Woosung Choi avatar Yejin avatar  avatar

Watchers

Nickolay V. Shmyrev avatar  avatar  avatar  avatar

ssr_eval's Issues

Release of Training Code

Hello,

Are you planning on releasing the training code? This repo seems great, but without the training code of NVSR, NuWave2 is more applicable to novel datasets. Hoping to retrain and use NVSR!

Is it possible to run this model on multiple GPUs?

(Talking of NVSR) I ran this model on a cloud GPU A100 80 GB and managed to get it up to 7 minutes to upsample. I'm now curious how far it can go. If I have 8x A100 GPUs would it be possible to be able to upsample a 56 min file? Or is this model not designed to run inference on multiple GPUs?

(I know my method so far has been splitting long audio files then upsampling but I'd like to avoid splitting).

metrics inconsistent with paper

Hi, when I evaluated the example model NVSR like this:

import soundfile as sf
device="cuda"
print("device", device)
for test_name in ["NVSRPostProcTestee"]:
    testee = eval(test_name)(device=device)
    helper = SSR_Eval_Helper(
        testee,
        test_name=test_name,
        input_sr=16000,
        output_sr=16000,
        evaluation_sr=16000,
        setting_fft={
            "cutoff_freq": [4000],
        },
        save_processed_result=False,
    )
    helper.evaluate(limit_test_nums=-1, limit_test_speaker=-1)

I got the final "lsd": 1.0318345543848155, which is not consistent with the result you provide in your paper, could you help me review my test code above?
FYI: I am using the following conda environment:
librosa 0.10.1
lightning-utilities 0.8.0
llvmlite 0.39.1
matplotlib 3.5.1
matplotlib-inline 0.1.6
numba 0.56.4
numpy 1.21.6
opencv-python 4.1.2.30
opencv-python-headless 4.5.4.60
packaging 21.3
pandas 1.1.5
Pillow 9.2.0
pip 23.0.1
protobuf 3.20.1
psutil 5.8.0
pyparsing 3.0.9
pytest-runner 5.3.0
pytorch-lightning 1.9.5
PyWavelets 1.3.0
PyYAML 6.0
requests 2.27.1
scikit-image 0.19.3
scikit-learn 0.22.1
scipy 1.5.2
semantic-version 2.8.5
setuptools 65.3.0
six 1.16.0
sklearn 0.0
soundfile 0.12.1
soxr 0.3.6
ssr-eval 0.0.7
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.0
terminaltables 3.1.10
text-unidecode 1.3
thop 0.1.1.post2207130030
tifffile 2021.11.2
torch 1.10.2
torchaudio 0.10.2
torchlibrosa 0.0.7
torchlightning 0.0.0
torchmetrics 0.11.1
torchvision 0.11.3
tornado 6.2
tqdm 4.64.1
voicefixer 0.0.17
Wave 0.0.2
wheel 0.37.1
yarl 1.8.1
zipp 3.8.1

16kHz waveform generation

When I used your code to generate waveforms in which the target sampling rate is 16kHz, I found there are some mismatches between the high-frequency and low-frequency spectrogram, should I first downsample the GT waveforms to 44.1kHz and then do the inference?

Running pre-trained NSVR

Hello, I am trying to run the pre-trained NSVR. After successfully installing requirements, running "python main.py" results in a EOFError. Here is the produced traceback:

Traceback (most recent call last):
File "main.py", line 172, in
testee = eval(test_name)(device=device)
File "main.py", line 114, in init
super(NVSRPostProcTestee, self).init(device)
File "main.py", line 56, in init
self.model = Model(channels=1)
File "\ssr_eval-main\examples\NVSR\nvsr_unet.py", line 84, in init
self.vocoder = Vocoder(sample_rate=44100).to(device)
File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\base.py", line 14, in init
self._load_pretrain(Config.ckpt)
File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\base.py", line 19, in _load_pretrain
checkpoint = load_checkpoint(pth, torch.device("cpu"))
File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\model\util.py", line 92, in load_checkpoint
checkpoint = torch.load(checkpoint_path, map_location=device)
File "E:\Anaconda\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "E:\Anaconda\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.