The pre-trained model of NSVB

Hi! I really liked your DiffSinger project! NeuralSVB also looks very promising. You write you provide the pre-trained model of NSVB. I would like to try it on my data. How can I get a model?

Issue by loading the pre-trained model

Maybe there is an issue by loading the pre-trained model (line 61

  • size mismatch for token_embed.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256]).
  • size mismatch for asr_decoder.project_out_dim.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256])

NSVB checkpoint or data request

Hi, is it possible to make the trained svb model available? Or could you maybe describe how the input data should look like so that I could train my own model?


1 如果实际使用来美化唱歌,那么Inference的时候是需要原唱的pitch curve对吧?
2 虽然测试样例不在训练样本中,但是GT Professional和GT Amateur是同一个人录制的。Inference中GT Professional不可能是自己,这样泛化性有测试过吗?

How can NSVB generalize to unseen singers?

NSVB is trained on PopBuTFy with 34 speakers. Even with the 30-hour internal singing data as described in the paper in the training of Stage1 , I doubt that this level of data would enable it to generalize to unseen singers. I believe it can only generalize to a similar singer in the training set. I trained Stage 1 with the 50-hour OpenSinger data and 3 other singers, the resulting model can only generalize to a similar singer in the training set, but it can't do the same for a very different singer. Has anyone been able to do a better generalization here?

hi, request for datasets and source code.

This work is very outstanding and we are insterested in it. Are there any plans to make the dataset and associated pretrained models public in the near future? Thank you

Problem with proper data loading

Hi, I'd like to run your model by myself, however I cannot find proper way to load the dataset with .mp3 files you provided. Is there a chance to share the dataloader you've used or give some hints how to process the .mp3 files to valid dataset which could be used in your usage examples? I'll be very grateful!

How to get the PopBuTFy dataset

Hi, how can i get the PopBuTFy dataset to train the model? I have sent the application form to the official mailbox, but there is no response.

救命啊! jiùmìng a! 救命啊! jiùmìng a! 救命啊! jiùmìng a! jiùmìng a!

I got error when inference model:
Exception has occurred: RuntimeError
Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".
CUDA_VISIBLE_DEVICES=0 python tasks/ --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
| Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml']
| Hparams:
accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2,
asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050,
base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10,
clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False,
cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1,
cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9,
dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0,
decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1,
disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3,
discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2,
dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln,
enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8,
encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024,
ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4,
fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: ,
generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True,
lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1,
lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0,
map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550,
max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000,
mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5,
mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0,
min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10,
num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5,
optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p,
phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame,
pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False,
processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256,
ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False,
resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True,
scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100,
test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test,
train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False,
use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False,
use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True,
use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100,
valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64,
vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0,
win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf,
12/14 10:18:24 AM GPU available: True, GPU used: [0]
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt
| load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'.
| Generator Arch: MleSVBVAE(
(pitch_embed): Embedding(300, 256, padding_idx=0)
(pitch_encoder): ConvStacks(
(conv): ModuleList(
(0-2): 3 x ConvBlock(
(conv): ConvNorm(
(conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(norm): GroupNorm(16, 256, eps=1e-05, affine=True)
(dropout): Dropout(p=0, inplace=False)
(relu): ReLU()
(in_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
(vc_asr): VCASR(
(mel_prenet): Prenet(
(layers): ModuleList(
(0): Sequential(
(0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1-2): 2 x Sequential(
(0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
(content_encoder): ConformerLayers(
(layers): ModuleList()
(pos_embed): RelPositionalEncoding(
(dropout): Dropout(p=0.05, inplace=False)
(encoder_layers): ModuleList(
(0-1): 2 x EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
(feed_forward): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
(feed_forward_macaron): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
(norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
(layer_norm): Linear(in_features=256, out_features=256, bias=True)
(token_embed): Embedding(88, 256, padding_idx=0)
(asr_decoder): TransformerASRDecoder(
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0-1): 2 x TransformerDecoderLayer(
(op): DecSALayer(
(layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
(layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
(layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFNLayer(
(ffn_1): Sequential(
(0): ConstantPad1d(padding=(8, 0), value=0.0)
(1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,))
(ffn_2): Linear(in_features=1024, out_features=256, bias=True)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(project_out_dim): Linear(in_features=256, out_features=88, bias=False)
(upsample_layer): Sequential(
(0): Sequential(
(0): Upsample(scale_factor=2.0, mode='nearest')
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(2): ReLU()
(3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(spk_embed_proj): Linear(in_features=256, out_features=256, bias=True)
(encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True)
(vae_model): GlobalFVAE(
(g_pre_net): Sequential(
(0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,))
(encoder): GlobalFVAEEncoder(
(pre_net): Sequential(
(0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,))
(wn): WN(
(in_layers): ModuleList(
(0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
(res_skip_layers): ModuleList(
(0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(7): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,))
(out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,))
(poolings): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(4): ReLU()
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(decoder): GlobalFVAEDecoder(
(pre_net): Sequential(
(0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,))
(wn): WN(
(in_layers): ModuleList(
(0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
(res_skip_layers): ModuleList(
(0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(3): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,))
(out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,))
(z_mapping_function): GlobalLatentMap(
(convs): Sequential(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(spk_proj): Sequential(
(0): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
(1): ReLU(inplace=True)
(2): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
| Generator Trainable Parameters: 10.056M
12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt
Traceback (most recent call last):
File "tasks/", line 17, in
File "tasks/", line 12, in run_task
File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/", line 352, in start
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/", line 92, in test
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/", line 100, in fit
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/", line 120, in run_single_process
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/", line 355, in restore_weights
getattr(task_ref, k).load_state_dict(v)
File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'

I got error when inference model. How to fix it ?

CUDA_VISIBLE_DEVICES=0 python tasks/ --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
RuntimeError: Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".

