facebookresearch / speech-resynthesis Goto Github PK

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

License: Other

Python 100.00%

speech-resynthesis's People

Contributors

Stargazers

Watchers

speech-resynthesis's Issues

The problem about the pre-processing of the VCTK dataset

Hello,
Could you help me about the pre-processing of the VCTK data set？
I go to the link CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

And put all the data into the path: ./data/VCTK-Corpus

There is no wav48 and wav16 folder in the file being downloaded.
Did I download the wrong file or did I do something wrong?

Would i ask for help?
I will be grateful for any help you can provide.

When i train the train_f0_vq.py , I meet a trouble .

Epoch: 1
Traceback (most recent call last):
File "train_f0_vq.py", line 217, in
main()
File "train_f0_vq.py", line 213, in main
train(a.local_rank, a, h)
File "train_f0_vq.py", line 101, in train
for i, batch in enumerate(train_loader):
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
soundfile.LibsndfileError: <exception str() failed>
Killing subprocess 2908
Traceback (most recent call last):
File "/root/miniconda3/envs/test/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/test/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/test/bin/python', '-u', 'train_f0_vq.py', '--local_rank=0', '--checkpoint_path', 'checkpoints/lj_f0_vq', '--config', 'configs/LJSpeech/f0_vqvae.json']' returned non-zero exit status 1.

Is there a pre trained model available for download

For VQ-VAE coding at 800 bps, the "small" and "medium" audio about 6000 hours were used for training?

In the paper, "The VQ-VAE model employs the HiFiGAN decoder trained on the LibriLight dataset to match the amount of data reported in [34]." How many hours of LibriLight were used in the training?

Zero shot voice conversion

Can this framework do the zero-shot / few-shot voice conversion task. If the answer is possible, can you give any instructions about how to do it.

speaker information

Will the code used for extracting speaker information in the paper not be provided?

Question about ./datasets/LibriLight/test.txt

In train.txt/val.txt/test.txt, what does “ 'cpc_km100': '-1' ” mean? Are there other available values for 'cpc_km100' apart from '-1'?

Bigger speech2unit Hubert versions

Hi,

I was just wondering if you have tried hubert-large or hubert-xtralarge as alternatives to hubert-base the speech2unit.
First, I tried to train a kmeans for the hubert-base and retrain the vocoder part to see if I can replicate the results with the pretrained kmeans, but the results that I get are worse.
I would very much appreciate if you either released the pretrained kmeans for hubert-large or hubert-xtralarge (if you have it), or gave me some guidelines to try to replicate your results.
Specifically, I want to know the amount of kmeans iterations, the amount of centroids, the layer of hubert, and the batch_size used. Currently, I'm training the kmeans with 150 iterations, 100 centroids, the 6th layer outputs, and a batch_size of 10000, but I don't know if these parameters are correct.

Thank you in advance

VCTK dataset v0.92 not compatible with current training pipeline/scripts

The link to VCTK dataset mentioned in the README (https://datashare.ed.ac.uk/handle/10283/3443) points to version 0.92, which content doesn't correspond to the file paths provided in datasets/VCTK/cpc100/train.txt (VCTK 0.92 contains 2 mic recordings mic1, mic2 in Flac format. which mic should we use? or should we combine them?)

How to use with cpu only?

I only have an AMD GPU at the moment, so I cant use cuda. Is there a way to use cpu?

What should vctk_audio_text_train_filelist.txt be look like?

if we want to train_f0_vq on VTCK speakers

Thanks

Any pretrained models available?

Is there any chance to release your pretrained models for evaluation purposes? I'd like to make a few comparisons before training. Thank you!

Coding new dataset for training

Hello!

I read the corresponding section in the README, and understand that I need to download the LibriLight dataset to train the VQVAE. I downloaded the small.tar file, but open unzipping I don't see files with the path like /checkpoint/pem/morgane/LibriBig/3717/9/3717_3120_9_0021.wav, but something like LibriLight/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb.flac. Am I downloading the correct LibriLight? If not, what can I do?

Thank you!

How to get f0_state.pth, I don't understand

Upsampling Hubert features from 50hz to 100 hz.

The output of Hubert features is 50 Hz(ie: 1s audio gives 50 timestep/frame), how are you upsampling it to 100Hz?

I had a problem during the first step of training. Please how can I solve it

Initializing Training Process..
Initializing Training Process..
Initializing Training Process..
Initializing Training Process..
Batch size per GPU : 2
Batch size per GPU : 2
Batch size per GPU : 2
Batch size per GPU : 2
Initializing Training Process..
Batch size per GPU : 2
Initializing Training Process..
Batch size per GPU : 2
Initializing Training Process..
Initializing Training Process..
Batch size per GPU : 2
Batch size per GPU : 2
Traceback (most recent call last):
File "train_f0_vq.py", line 217, in
main()
File "train_f0_vq.py", line 213, in main
train(a.local_rank, a, h)
File "train_f0_vq.py", line 37, in train
generator = Quantizer(h).to(device)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
return self._apply(convert)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Running yaapt on-the-fly extremely slows the training

Hi, thanks for kindly releasing the code for the paper. (Also congratulations on the acceptance in INTERSPEECH!)

While I was running the code, I encountered a significant issue - pYAAPT.yaapt extremely slow the training.
Here's how I found out such a bottleneck on speed:

I tried to run train_f0_vq.py as specified in README.
However, training was too slow; looks like we need to train an f0 vq model for 400000 steps, but a single epoch (about 700 steps) took 2657 seconds to run. GPU util was really low, and CPUs were running like crazy. (My server has 3080 Ti with 64 CPU cores.)
I suspected pYAAPT.yaapt to be a cause for this. To test that, I forked a repository to add a caching functionality: https://github.com/seungwonpark/speech-resynthesis
After that, a single epoch after the first epoch (for an initial caching) took only 36 seconds.

So my question is, how did you manage to run yaapt on-the-fly without caching? Though I succeeded in training the model fast enough, I shall need to disable caching again since it requires the _sample_interval method to sample the same interval for each audio (i.e. disabling the data augmentation via randomly choosing the interval).

Hi, I was wondering if this will work under ubuntu or Windows or both

License question

Since this code has also been added to FairSeq, is the license CreativeCommons, or is it FairSeq's MIT? Thanks!

facebookresearch / speech-resynthesis Goto Github PK

speech-resynthesis's People

Contributors

Stargazers

Watchers

Forkers

speech-resynthesis's Issues

Recommend Projects

Recommend Topics

Recommend Org