lmnt-com / wavegrad Goto Github PK

View Code? Open in Web Editor NEW

266.0 15.0 47.0 19 KB

A fast, high-quality neural vocoder.

License: Apache License 2.0

Python 100.00%

machine-learning neural-network speech-synthesis text-to-speech wavegrad paper pytorch vocoder speech pretrained-models

wavegrad's Introduction

WaveGrad

We're hiring! If you like what we're building here, come join us at LMNT.

WaveGrad is a fast, high-quality neural vocoder designed by the folks at Google Brain. The architecture is described in WaveGrad: Estimating Gradients for Waveform Generation. In short, this model takes a log-scaled Mel spectrogram and converts it to a waveform via iterative refinement.

Status (2020-10-15)

Audio samples

24 kHz audio samples

Pretrained models

24 kHz pretrained model (183 MB, SHA256: 65e9366da318d58d60d2c78416559351ad16971de906e53b415836c068e335f3)

Install

Install using pip:

pip install wavegrad

or from GitHub:

git clone https://github.com/lmnt-com/wavegrad.git
cd wavegrad
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22 kHz. If you need to change this value, edit params.py.

python -m wavegrad.preprocess /path/to/dir/containing/wavs
python -m wavegrad /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible speech by ~20k steps (~1.5h on a 2080 Ti).

Inference API

Basic usage:

from wavegrad.inference import predict as wavegrad_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = wavegrad_predict(spectrogram, model_dir)

# audio is a GPU tensor in [N,T] format.

If you have a custom noise schedule (see below):

from wavegrad.inference import predict as wavegrad_predict

params = { 'noise_schedule': np.load('/path/to/noise_schedule.npy') }
model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = wavegrad_predict(spectrogram, model_dir, params=params)

# `audio` is a GPU tensor in [N,T] format.

Inference CLI

python -m wavegrad.inference /path/to/model /path/to/spectrogram -o output.wav

Noise schedule

The default implementation uses 1000 iterations to refine the waveform, which runs slower than real-time. WaveGrad is able to achieve high-quality, faster than real-time synthesis with as few as 6 iterations without re-training the model with new hyperparameters.

To achieve this speed-up, you will need to search for a noise schedule that works well for your dataset. This implementation provides a script to perform the search for you:

python -m wavegrad.noise_schedule /path/to/trained/model /path/to/preprocessed/validation/dataset
python -m wavegrad.inference /path/to/trained/model /path/to/spectrogram -n noise_schedule.npy -o output.wav

The default settings should give good results without spending too much time on the search. If you'd like to find a better noise schedule or use a different number of inference iterations, run the noise_schedule script with --help to see additional configuration options.

References

wavegrad's People

Contributors

Stargazers

Watchers

wavegrad's Issues

how do i use it

i ran this code

import torchaudio
import wavegrad.inference
import librosa
import numpy as np
import IPython

waveform, sample_rate = torchaudio.load('CantinaBand3.wav')
sgram = librosa.stft(waveform.flatten().numpy())
mel_sgram = librosa.feature.melspectrogram(S=np.abs(sgram) ** 2, sr=sample_rate, n_fft=2048, hop_length=512, n_mels=128)
mel_sgram = librosa.amplitude_to_db(mel_sgram, ref=np.min)

audio, sample_rate = wavegrad.inference.predict(torch.from_numpy(mel_sgram), "wavegrad-24kHz.pt")
IPython.display.display(IPython.display.Audio(audio.flatten().detach().cpu().numpy(), rate=sample_rate))

the resulting audio file is high-frequency crackles and pops, not the cantina band
also, is this trained specifically for speech? is there anything for music or general audio? i can't train my own because my computer is not strong enough

Dataset and train test split for pretrained model

Hi!
What dataset did you use for training the 24khz model? Also, what train/test split did you use for training?

I am having a poor samples quality on LJSpeech inference with the model (I resampled the dataset beforehand), so I assume this model was not trained on it?

Support for pytorch to tensorflow model conversion

help

its saving every epoch as a model which fills up storage fast. Is there a way to make it save a model only every so many steps?

Details about this pretrained model

Can the author give the details of the pretrained model? Dataset, number of training steps? (It is a pity that my training effect is not as good as the sample given) Thank you

Multi-GPU Training does not seem to work

Hi, thanks a lot for your awesome implementation!

However, I could not manage to train a model on multiple GPUs. I was trying to add CUDA_VISIBLE_DEVICES=1,2,3 before the training command, but nvidia-smi showed that only the first GPU was working on the training.

Do you have any ideas on this issue? Is that the right way to use multi-GPU in this repo?

Thanks again!

Not ImplementedError when inferencing on Google Colab

NotImplementedError Traceback (most recent call last)
in ()
3 model_dir = '/content/drive/MyDrive/MURF files/wavegrad-24kHz.pt'
4 spectrogram = torch.FloatTensor(mels)
----> 5 audio, sample_rate = wavegrad_predict(spectrogram, model_dir)
6
7 # audio is a GPU tensor in [N,T] format.

1 frames
/usr/local/lib/python3.7/dist-packages/wavegrad/inference.py in predict(spectrogram, model_dir, params, device)
40
41 model = models[model_dir]
---> 42 model.params.override(params)
43 with torch.no_grad():
44 beta = np.array(model.params.noise_schedule)

/usr/local/lib/python3.7/dist-packages/wavegrad/params.py in override(self, attrs)
29 self.override(attr)
30 else:
---> 31 raise NotImplementedError
32 return self
33

NotImplementedError:

Way to downscale to n_mels=80, hop_size=256, sr=22050?

Hey, most spectrogram generators are built for 80 bins and a hop size of 256.

Has anybody succeeded in making WaveGrad work with these settings, making for better compatibility with popular models?

Samples quality generated from Pretrained model support

Hi,

I try to generate samples from your pretrained model, but the quality is not good. More specifically, the pitch is much higher than original sample, and many noise in the sample, like below

Here is what I did:

clone repo
change hparam sample rate from 22050 to 24000
download reference audio you provided: https://lmnt.com/assets/wavegrad/24kHz
preprocess wav to extract mel with preprocess.py
inference from extracted mel and generate audio with inference.py
I didn't modify any code during this.
Could you please help have a look. Am I missing anything here?

Thanks

Training result problems

Hop size parameter

Hi, thanks for sharing your work with the community.

I saw your warning regarding the hop_samples parameter (hop_samples=300, # Don't change this. Really.)

I was hoping you could give us a hint about what should we need to change in the code in order to work with different hop lengths.

Thank you!

KeyError: 'audio'

KeyError: 'audio' in dataloader after 36 iters.
Not sure how this can happen, posting in-case you can see any potential causes while I look myself.

https://github.com/CookiePPP/wavegrad/tree/d0469094947e9cd2c6be1f37e9e948ff58be5560
Exact code used (I don't think any of my modifications would cause the crash).

edit:
Crashed again after 214 iters.
Adding some verbosity and running again...

CUDA_VISIBLE_DEVICES=3 python3 -m wavegrad "outdir/init" "/media/cookie/Samsung 860 QVO/ClipperDatasetV2"
Epoch 0:   0%|                                        | 0/11466 [00:00<?, ?it/s]/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/cookie/.local/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Epoch 0:   0%|                             | 36/11466 [01:16<3:37:43,  1.14s/it]Traceback (most recent call last):
  File "/home/cookie/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cookie/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/__main__.py", line 37, in <module>
    main(parser.parse_args())
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/__main__.py", line 24, in main
    train(dataset_from_path(args.data_dirs, params), args, params)
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/learner.py", line 151, in train
    learner.train(max_steps=args.max_steps)
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/learner.py", line 94, in train
    for features in tqdm(self.dataset, desc=f'Epoch {self.step // len(self.dataset)}'):
  File "/home/cookie/.local/lib/python3.7/site-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 971, in _next_data
    return self._process_data(data)
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 4.
Original Traceback (most recent call last):
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/cookie/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/dataset.py", line 68, in collate
    audio = np.stack([record['audio'] for record in minibatch if record['audio'] is not None])
  File "/media/cookie/Samsung PM961/TwiBot/wavegrad/src/wavegrad/dataset.py", line 68, in <listcomp>
    audio = np.stack([record['audio'] for record in minibatch if record['audio'] is not None])
KeyError: 'audio'

Epoch 0:   0%|                             | 36/11466 [01:16<6:43:32,  2.12s/it]

Where do you see the number of epochs set?

Please, where do you see the number of epochs set? What standard does the training end with? Because I see that there is no epoch setting in the training parameters, I don't know how many epochs will end after training.

About Config of pretrained model

this issue presented because there are too much noise in the synthesis audio. I find the pretrained model which applied on 24k audio, however, the default trained config is for 22050 sample rate audio. Are there some difference on config, e.g is noise_schedule np.linspace(1e-4, 0.005, 1000) or Linear(1e−4, 0.005, 1000)

Poor quality of samples

Hi!
So, I am trying to do inference on VCTK with your pretrained model and getting some inconsistencies with resutls here
I am doing the following steps:
resample audio from 48kHz to 24 with sox

sox <infile> -r 24000 <outfile>

change sample rate to 24000 here

Then I run preprocessing and inference as suggested in your readme

python -m wavegrad.preprocess ...
python -m wavegrad.inference ...

and get intelligible but heavily distorted samples with pitch shift-like distortion example

May be you have any suggestions of what could go wrong here?