Giter Club home page Giter Club logo

istftnet-pytorch's Introduction

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

This repo try to implement iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform specifically model C8C8I. Disclaimer : This repo is build for testing purpose. The code is not optimized for performance.

Training :

python train.py --config config_v1.json

Note:

  • We are able to get good quality of audio with 30 % less training compared to original hifigan.
  • This model approx 60 % faster than counterpart hifigan.

Citations :

@inproceedings{kaneko2022istftnet,
title={{iSTFTNet}: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform},
author={Takuhiro Kaneko and Kou Tanaka and Hirokazu Kameoka and Shogo Seki},
booktitle={ICASSP},
year={2022},
}

References:

istftnet-pytorch's People

Contributors

aqtq314 avatar pranjalya avatar rishikksh20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

istftnet-pytorch's Issues

Predicted phase not in range [-pi .. pi], but in range [-1 .. 1]

The phase output of the generator currently can only range from -1 to 1, which is not enough as full phase in radians is expected later in stft.inverse() (either 0..2*pi or -pi..pi).

The paper mentions somewhat cryptically that "we apply a sine activation function to represent the periodic characteristics of the phase spectrogram", but in any regard the current implementation is faulty since it can not represent the full range of possible phases.

phase = torch.sin(x[:, self.post_n_fft // 2 + 1:, :])

As a suggestion, either try scaling the output by 2*pi, or directly predicting sin(phase) and cos(phase) in the generator (the predicted values can be normalized by dividing both by sin(phase)**2 + cos(phase)**2).

A multi-gpu training bug

stft.py line 164->165:
window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum
inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices],
would get errors . Because, inverse_transform might in cuda1 while window_sum in cuda0.
Change line 164 to window_sum = window_sum.to(inverse_transform.device()) if magnitude.is_cuda else window_sum will fix the problem.

Directly model complex numbers

Has anyone tried to directly model the complex numbers instead of the phase and magnitude? What would be the problem if we model the real and imaginary parts directly?

How about the audio quality?

Hi, thanks to the implement, the inference speed is impressive. How about the audio quality? And have you tried v2 config? Thanks in advance.

Pretrained models

Hello, thank you very much for this repo

Can you please provide pre-trained models for tests?

window_sum in stft is just a constant?

I print the window_sum in stft, line: 155๏ผŒ find that the value will a constant, except for the former and latter padding positions. the window function only plays the role of linear scaling. Does this result meet the windowing expectations?

Single frequency line problem

Thanks for the implemention of ISTFT. It has better inference speed than hifigan v1.However, I found that there is a single frequency line which would cause little noise.I use 16KHZ dataset for training.And all the line is extractly at 4k which is the middle of the all frequency.I'm trying to fix this problem, do you have the same problem?

RuntimeError: istft input and window must be on the same device but got self on cuda:0 and window on cpu

My command to run:

python3 train.py --config config_v1.json --input_wavs_dir /home/yehor/iSTFTNet-pytorch/lada_wavs --input_training_file /home/yehor/iSTFTNet-pytorch/training_list.txt --input_validation_file /home/yehor/iSTFTNet-pytorch/validation_list.txt

Error:

...        (2): Conv1d(128, 128, kernel_size=(11,), stride=(1,), padding=(5,))
      )
    )
  )
  (conv_post): Conv1d(128, 18, kernel_size=(7,), stride=(1,), padding=(3,))
  (reflection_pad): ReflectionPad1d((1, 0))
)
checkpoints directory :  cp_hifigan
Epoch: 1
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train.py", line 280, in <module>
    main()
  File "train.py", line 276, in main
    train(0, a, h)
  File "train.py", line 126, in train
    y_g_hat = stft.inverse(spec, phase)
  File "/home/yehor/iSTFTNet-pytorch/stft.py", line 198, in inverse
    inverse_transform = torch.istft(
RuntimeError: istft input and window must be on the same device but got self on cuda:0 and window on cpu

Different sample rate

Hi @rishikksh20 , thanks for your work.

I have a question. If I want to use the 16K sampling rate, how do I modify the configuration file?
It should not just modify sampling_rate in json.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.