Giter Club home page Giter Club logo

neuralsvb's Introduction

neuralsvb's People

Contributors

moonintheriver avatar mrzixi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuralsvb's Issues

How can NSVB generalize to unseen singers?

NSVB is trained on PopBuTFy with 34 speakers. Even with the 30-hour internal singing data as described in the paper in the training of Stage1 , I doubt that this level of data would enable it to generalize to unseen singers. I believe it can only generalize to a similar singer in the training set. I trained Stage 1 with the 50-hour OpenSinger data and 3 other singers, the resulting model can only generalize to a similar singer in the training set, but it can't do the same for a very different singer. Has anyone been able to do a better generalization here?

救命啊! jiùmìng a! 救命啊! jiùmìng a! 救命啊! jiùmìng a! jiùmìng a!

I got error when inference model:
Exception has occurred: RuntimeError
Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
| Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml']
| Hparams:
accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2,
asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050,
base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10,
clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False,
cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1,
cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9,
dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0,
decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1,
disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3,
discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2,
dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln,
enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8,
encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024,
ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4,
fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: ,
generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True,
lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1,
lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0,
map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550,
max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000,
mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5,
mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0,
min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10,
num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5,
optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p,
phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame,
pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False,
processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256,
ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False,
resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True,
scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100,
test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test,
train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False,
use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False,
use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True,
use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100,
valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64,
vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0,
win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf,
12/14 10:18:24 AM GPU available: True, GPU used: [0]
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt
| load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'.
| Generator Arch: MleSVBVAE(
(pitch_embed): Embedding(300, 256, padding_idx=0)
(pitch_encoder): ConvStacks(
(conv): ModuleList(
(0-2): 3 x ConvBlock(
(conv): ConvNorm(
(conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(norm): GroupNorm(16, 256, eps=1e-05, affine=True)
(dropout): Dropout(p=0, inplace=False)
(relu): ReLU()
)
)
(in_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(vc_asr): VCASR(
(mel_prenet): Prenet(
(layers): ModuleList(
(0): Sequential(
(0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1-2): 2 x Sequential(
(0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(content_encoder): ConformerLayers(
(layers): ModuleList()
(pos_embed): RelPositionalEncoding(
(dropout): Dropout(p=0.05, inplace=False)
)
(encoder_layers): ModuleList(
(0-1): 2 x EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
)
(feed_forward): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(feed_forward_macaron): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
)
(norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
)
)
(layer_norm): Linear(in_features=256, out_features=256, bias=True)
)
(token_embed): Embedding(88, 256, padding_idx=0)
(asr_decoder): TransformerASRDecoder(
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0-1): 2 x TransformerDecoderLayer(
(op): DecSALayer(
(layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFNLayer(
(ffn_1): Sequential(
(0): ConstantPad1d(padding=(8, 0), value=0.0)
(1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,))
)
(ffn_2): Linear(in_features=1024, out_features=256, bias=True)
)
)
)
)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(project_out_dim): Linear(in_features=256, out_features=88, bias=False)
)
)
(upsample_layer): Sequential(
(0): Sequential(
(0): Upsample(scale_factor=2.0, mode='nearest')
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(2): ReLU()
(3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(spk_embed_proj): Linear(in_features=256, out_features=256, bias=True)
(encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True)
(vae_model): GlobalFVAE(
(g_pre_net): Sequential(
(0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,))
)
(encoder): GlobalFVAEEncoder(
(pre_net): Sequential(
(0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,))
)
(wn): WN(
(in_layers): ModuleList(
(0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(7): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,))
(poolings): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(4): ReLU()
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
)
)
(decoder): GlobalFVAEDecoder(
(pre_net): Sequential(
(0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,))
)
(wn): WN(
(in_layers): ModuleList(
(0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(3): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,))
)
)
(z_mapping_function): GlobalLatentMap(
(convs): Sequential(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
(spk_proj): Sequential(
(0): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
(1): ReLU(inplace=True)
(2): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
)
)
| Generator Trainable Parameters: 10.056M
12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt
Traceback (most recent call last):
File "tasks/run.py", line 17, in
run_task()
File "tasks/run.py", line 12, in run_task
task_cls.start()
File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/base_task.py", line 352, in start
trainer.test(cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 92, in test
self.fit(task_cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 100, in fit
self.run_single_process(self.task)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 120, in run_single_process
self.restore_weights(checkpoint)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 355, in restore_weights
getattr(task_ref, k).load_state_dict(v)
File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'

How to get the PopBuTFy dataset

Hi, how can i get the PopBuTFy dataset to train the model? I have sent the application form to the official mailbox, but there is no response.

关于NSVB

听了demo后有些疑问,
1 如果实际使用来美化唱歌,那么Inference的时候是需要原唱的pitch curve对吧?
2 虽然测试样例不在训练样本中,但是GT Professional和GT Amateur是同一个人录制的。Inference中GT Professional不可能是自己,这样泛化性有测试过吗?

Issue by loading the pre-trained model

Maybe there is an issue by loading the pre-trained model (line 61 ckpt_utils.py)

  • size mismatch for token_embed.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256]).
  • size mismatch for asr_decoder.project_out_dim.weight: copying a param with shape torch.Size([88, 256]) from checkpoint, the shape in current model is torch.Size([87, 256])

NSVB checkpoint or data request

Hi, is it possible to make the trained svb model available? Or could you maybe describe how the input data should look like so that I could train my own model?

The pre-trained model of NSVB

Hi! I really liked your DiffSinger project! NeuralSVB also looks very promising. You write you provide the pre-trained model of NSVB. I would like to try it on my data. How can I get a model?

I got error when inference model. How to fix it ?

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
RuntimeError: Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".

hi, request for datasets and source code.

This work is very outstanding and we are insterested in it. Are there any plans to make the dataset and associated pretrained models public in the near future? Thank you

Problem with proper data loading

Hi, I'd like to run your model by myself, however I cannot find proper way to load the dataset with .mp3 files you provided. Is there a chance to share the dataloader you've used or give some hints how to process the .mp3 files to valid dataset which could be used in your usage examples? I'll be very grateful!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.