moonintheriver / neuralsvb Goto Github PK

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

License: GNU General Public License v3.0

Python 100.00%

singing-voice acl2022 singing-synthesis gan singing-voice-synthesis singing-voice-conversion

neuralsvb's Issues

The pre-trained model of NSVB

Hi! I really liked your DiffSinger project! NeuralSVB also looks very promising. You write you provide the pre-trained model of NSVB. I would like to try it on my data. How can I get a model?

Issue by loading the pre-trained model

Maybe there is an issue by loading the pre-trained model (line 61 ckpt_utils.py)

size mismatch for token_embed.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256]).
size mismatch for asr_decoder.project_out_dim.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256])

NSVB checkpoint or data request

Hi, is it possible to make the trained svb model available? Or could you maybe describe how the input data should look like so that I could train my own model?

关于NSVB

听了demo后有些疑问，
1 如果实际使用来美化唱歌，那么Inference的时候是需要原唱的pitch curve对吧？
2 虽然测试样例不在训练样本中，但是GT Professional和GT Amateur是同一个人录制的。Inference中GT Professional不可能是自己，这样泛化性有测试过吗？

How can NSVB generalize to unseen singers?

NSVB is trained on PopBuTFy with 34 speakers. Even with the 30-hour internal singing data as described in the paper in the training of Stage1 , I doubt that this level of data would enable it to generalize to unseen singers. I believe it can only generalize to a similar singer in the training set. I trained Stage 1 with the 50-hour OpenSinger data and 3 other singers, the resulting model can only generalize to a similar singer in the training set, but it can't do the same for a very different singer. Has anyone been able to do a better generalization here?

hi

Does this model work in English?

hi, request for datasets and source code.

This work is very outstanding and we are insterested in it. Are there any plans to make the dataset and associated pretrained models public in the near future? Thank you

``rpad'' to be a negetive value...

Thanks for your code. I want to know where is the requirements_auto.txt . I can't find it in the code.

ask for dataset and checkpoints

This work is great. May I ask whether the model save point or data set will be released,?I want to test the effect.

Problem with proper data loading

Hi, I'd like to run your model by myself, however I cannot find proper way to load the dataset with .mp3 files you provided. Is there a chance to share the dataloader you've used or give some hints how to process the .mp3 files to valid dataset which could be used in your usage examples? I'll be very grateful!

Please, can someone send me a download link to the PopBuTFy dataset and 1009_pretrain_asr_english checkpoint ?. I use it for research. Thank you very much if you send it to me

My inference results in a bunch of noise, I don't know if it's not because the phone_set.json file is reporting errors, I'm using this file：https://huggingface.co/spaces/NATSpeech/PortaSpeech/blob/f26c987c6b32d8e97148a2a077b7abd8e7111d85/data/binary/ljspeech/phone_set.json

Cannot find file phone_set.json and how to generate it?

How to get the PopBuTFy dataset

Hi, how can i get the PopBuTFy dataset to train the model? I have sent the application form to the official mailbox, but there is no response.

救命啊！ jiùmìng a! 救命啊！ jiùmìng a! 救命啊！ jiùmìng a! jiùmìng a!

I got error when inference model:
Exception has occurred: RuntimeError
Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
| Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml']
| Hparams:
accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2,
asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050,
base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10,
clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False,
cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1,
cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9,
dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0,
decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1,
disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3,
discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2,
dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln,
enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8,
encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024,
ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4,
fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: ,
generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True,
lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1,
lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0,
map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550,
max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000,
mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5,
mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0,
min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10,
num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5,
optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p,
phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame,
pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False,
processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256,
ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False,
resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True,
scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100,
test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test,
train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False,
use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False,
use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True,
use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100,
valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64,
vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0,
win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf,
12/14 10:18:24 AM GPU available: True, GPU used: [0]
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt
| load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'.
| Generator Arch: MleSVBVAE(
(pitch_embed): Embedding(300, 256, padding_idx=0)
(pitch_encoder): ConvStacks(
(conv): ModuleList(
(0-2): 3 x ConvBlock(
(conv): ConvNorm(
(conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(norm): GroupNorm(16, 256, eps=1e-05, affine=True)
(dropout): Dropout(p=0, inplace=False)
(relu): ReLU()
)
)
(in_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(vc_asr): VCASR(
(mel_prenet): Prenet(
(layers): ModuleList(
(0): Sequential(
(0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1-2): 2 x Sequential(
(0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(content_encoder): ConformerLayers(
(layers): ModuleList()
(pos_embed): RelPositionalEncoding(
(dropout): Dropout(p=0.05, inplace=False)
)
(encoder_layers): ModuleList(
(0-1): 2 x EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
)
(feed_forward): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(feed_forward_macaron): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
)
(norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
)
)
(layer_norm): Linear(in_features=256, out_features=256, bias=True)
)
(token_embed): Embedding(88, 256, padding_idx=0)
(asr_decoder): TransformerASRDecoder(
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0-1): 2 x TransformerDecoderLayer(
(op): DecSALayer(
(layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFNLayer(
(ffn_1): Sequential(
(0): ConstantPad1d(padding=(8, 0), value=0.0)
(1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,))
)
(ffn_2): Linear(in_features=1024, out_features=256, bias=True)
)
)
)
)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(project_out_dim): Linear(in_features=256, out_features=88, bias=False)
)
)
(upsample_layer): Sequential(
(0): Sequential(
(0): Upsample(scale_factor=2.0, mode='nearest')
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(2): ReLU()
(3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(spk_embed_proj): Linear(in_features=256, out_features=256, bias=True)
(encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True)
(vae_model): GlobalFVAE(
(g_pre_net): Sequential(
(0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,))
)
(encoder): GlobalFVAEEncoder(
(pre_net): Sequential(
(0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,))
)
(wn): WN(
(in_layers): ModuleList(
(0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(7): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,))
(poolings): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(4): ReLU()
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
)
)
(decoder): GlobalFVAEDecoder(
(pre_net): Sequential(
(0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,))
)
(wn): WN(
(in_layers): ModuleList(
(0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(3): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,))
)
)
(z_mapping_function): GlobalLatentMap(
(convs): Sequential(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
(spk_proj): Sequential(
(0): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
(1): ReLU(inplace=True)
(2): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
)
)
| Generator Trainable Parameters: 10.056M
12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt
Traceback (most recent call last):
File "tasks/run.py", line 17, in
run_task()
File "tasks/run.py", line 12, in run_task
task_cls.start()
File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/base_task.py", line 352, in start
trainer.test(cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 92, in test
self.fit(task_cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 100, in fit
self.run_single_process(self.task)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 120, in run_single_process
self.restore_weights(checkpoint)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 355, in restore_weights
getattr(task_ref, k).load_state_dict(v)
File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'

request a pretrained checkpoint

Hi,
The NeuralSVB is interesting. Can I get the checkpoint model? Thanks a lot.

Has anyone gotten good results?

Has anyone gotten good results in English?

I got error when inference model. How to fix it ?

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
RuntimeError: Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.