openvpi / diffsinger Goto Github PK

This project forked from moonintheriver/diffsinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

License: Apache License 2.0

Python 100.00%

acoustic-model diffusion diffussion-model melody-frontend midi pitch-prediction rectified-flow singing-voice singing-voice-synthesis svs

diffsinger's Introduction

OpenVpi

diffsinger's People

Contributors

Stargazers

Watchers

Forkers

xunmengshe meizu16 colourfulspring lzcsjtu ahuqyu amorjnyh zhzhongshi tisaneyang mia2832791324 sisanime ofshellohicy jm12138 jormungandr000 sdlibowen leyusf ziyangye566842 kongjian123 swqsldz jack-kks chicohan a22002020 vari-h hhy5562877 n0best1rn xlei666 aiarttsukuruyo ukaserge yorhamodelnumbernine oxygen-dioxide kekeguai 314857493 cuvalign ateren gwell1 atregret crazyboy12 ghostmember litture 5wj5 lomitt yuanyiqingmo hasazhegeyue 97littleleaf11 cqsr paramoeciu-bug wrs-wt 1328212169 adelose2020 chenmu2016 fyzhu97 one-sixth mizumiku yqty miskeylee leavesnight pakting dounsm qpmzs emilia-521 loveriswy aylitat helpscott jnightlee huoshenlaile ethpony zyxrtfjku tlbbpcg q447552640 piaoye2019 kemomi chenxing3 snowwolf2867 arxcpq emcsl2 yl520nzdm diangs601 newplayer486486 823129480 lightfollowed jmgm502 dajiangfu sammyalien pulluto gaohanas rushdie99 kingpon cryercipher linlanghupo wwfwwf2360 haoxitu shuaigedev xiaoguan-dot q933019 jjandnn toaoman qimingnana ztsinsun anya-rb-chen sao-bci 97001

diffsinger's Issues

Support pitch control

i. e. MIDI-A mode in the original DiffSinger repository

I trained a DDSP vocoder in torch 1.8.2, and I am getting an error when inferencing from .ds file. I followed all the documentation for training a DDSP vocoder and making it work with DiffSinger. NSF-HiFiGAN vocoder works fine with the same model and same .ds file.

| load phoneme set: ['A', 'AP', 'E', 'SP', 'Y', 'a', 'b', 'bj', 'c', 'ch', 'cl', 'd', 'dj', 'e', 'f', 'fj', 'g', 'gj', 'h', 'hj', 'i', 'j', 'k', 'kj', 'l', 'lj', 'm', 'mj', 'n', 'nj', 'o', 'p', 'pj', 'r', 'rj', 's', 'sh', 'shj', 'sj', 't', 'tj', 'u', 'v', 'vf', 'vj', 'y', 'z', 'zh', 'zj']
| load 'model' from 'checkpoints/crow/model_ckpt_steps_182000.ckpt'.
 [Loading] checkpoints/ddsp-crow/model_best-traced-torch1.8.2.jit
Processed 4 tokens: a tj e SP
Using manual phone duration
Using manual pitch curve
sample time step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 84.55it/s]
Traceback (most recent call last):
  File "main.py", line 202, in <module>
    infer_once(os.path.join(out, f'{name}{suffix}'), save_mel=args.mel)
  File "main.py", line 180, in infer_once
    seg_audio = infer_ins.infer_once(param)
  File "/home/kei/Desktop/diffsinger2/DiffSinger/basics/base_svs_infer.py", line 147, in infer_once
    output = self.forward_model(inp, return_mel=return_mel)
  File "/home/kei/Desktop/diffsinger2/DiffSinger/inference/ds_cascade.py", line 273, in forward_model
    wav_out = self.run_vocoder(mel_out, f0=f0_pred)
  File "/home/kei/Desktop/diffsinger2/DiffSinger/basics/base_svs_infer.py", line 71, in run_vocoder
    y = self.vocoder.spec2wav_torch(c, **kwargs)
  File "/home/kei/Desktop/diffsinger2/DiffSinger/src/vocoders/ddsp.py", line 137, in spec2wav_torch
    signal, _, (s_h, s_n) = self.model(mel.to(self.device), f0.to(self.device))
  File "/home/kei/miniconda3/envs/diffsinger2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
RuntimeError: Unsupported value kind: ComplexDouble```

Multi-speaker training

crash when running variance branch

This checkpoint is of the category 'acoustic', but a checkpoint of category 'variance' is required.

I checked the migrate.py, only acoustic category is provided.
So can u share the variance ckpt file?

question about MIDI and F0

I wonder how to get F0 truth in practice ? In my opinion, MIDI could be labeled mannually, and the F0 truth are usually got from human voice? And i wonder why you focus on midiless version in future, the midiless performance better than midi one or the midi lable is harder to obtain than F0?
In my test, the ph_dur estimation is poor, when test your midi-A version. If given the ph_dur GT as input, the performance is improved a lot. However the performance in midi-B test is always better than midi-A test .
I am new to SVS, sorry for above questions.

Inference with local Python environment

main.py in openvpi/DiffSinger

ONNX deployment

AssertionError: ckpt not found in checkpoints/0102_xiaoma_pe

It seems that there is a missing pretrained pitch extractor error when using configs/midi/e2e/opencpop/ds100_adj_rel.yaml. Where can I download it?

Docs for tts/

Is there a doc for tts tasks? or could you give me some hint about how to run it. Thanks.

Diffusion-base variance predictors with high fidelity, accuracy and expressiveness

Demo of current progress: https://www.bilibili.com/video/BV12T411t7Qg

run without mel2ph raise error

inp = {
        'text': '小酒窝长睫毛AP是你最美的记号',
        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
        'input_type': 'word'
    }
target = "/content/DiffSinger/infer_out/小酒窝.wav"
ds.DiffSingerE2EInfer.example_run(inp, target=target)
Audio(filename=target)

Output:

| load phoneme set: ['AP', 'E', 'En', 'SP', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
| load 'model' from 'checkpoints/0116_female_triplet_ds1000/model_ckpt_steps_320000.ckpt'.
| Load HifiGAN:  checkpoints/nsf_hifigan/model
Removing weight norm...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-2-af10976b8c2b>](https://localhost:8080/#) in <module>
      6     }
      7 target = "/content/DiffSinger/infer_out/小酒窝.wav"
----> 8 ds.DiffSingerE2EInfer.example_run(inp, target=target)
      9 Audio(filename=target)

7 frames
[/content/DiffSinger/basics/base_svs_infer.py](https://localhost:8080/#) in example_run(cls, inp, target)
    156         # call the model
    157         infer_ins = cls(hparams)
--> 158         out = infer_ins.infer_once(inp)
    159 
    160         # output to file

[/content/DiffSinger/basics/base_svs_infer.py](https://localhost:8080/#) in infer_once(self, inp, return_mel)
    145     def infer_once(self, inp, return_mel=False):
    146         inp = self.preprocess_input(inp, input_type=inp['input_type'] if inp.get('input_type') else 'word')
--> 147         output = self.forward_model(inp, return_mel=return_mel)
    148         output = self.postprocess_output(output)
    149         return output

[/content/DiffSinger/inference/ds_e2e.py](https://localhost:8080/#) in forward_model(self, inp, return_mel)
    151         spk_id = sample.get('spk_ids')
    152         with torch.no_grad():
--> 153             output = self.model(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
    154                                 pitch_midi=sample['pitch_midi'], midi_dur=sample['midi_dur'],
    155                                 is_slur=sample['is_slur'],mel2ph=sample['mel2ph'])

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/content/DiffSinger/src/diff/diffusion.py](https://localhost:8080/#) in forward(self, txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy, infer, **kwargs)
    236             conditioning diffusion, use fastspeech2 encoder output as the condition
    237         '''
--> 238         ret = self.fs2(txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy,
    239                        skip_decoder=True, infer=infer, **kwargs)
    240         cond = ret['decoder_inp'].transpose(1, 2)

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/content/DiffSinger/modules/naive_frontend/encoder.py](https://localhost:8080/#) in forward(self, txt_tokens, mel2ph, spk_embed_id, ref_mels, f0, uv, energy, skip_decoder, spk_embed_dur_id, spk_embed_f0_id, infer, is_slur, **kwarg)
     53                 spk_embed_dur_id=None, spk_embed_f0_id=None, infer=False, is_slur=None, **kwarg):
     54         B, T = txt_tokens.shape
---> 55         dur = mel2ph_to_dur(mel2ph, T).float()
     56         dur_embed = self.dur_embed(dur[:, :, None])
     57         encoder_out = self.encoder(txt_tokens, dur_embed)

[/content/DiffSinger/modules/fastspeech/tts_modules.py](https://localhost:8080/#) in mel2ph_to_dur(mel2ph, T_txt, max_dur)
    241 
    242 def mel2ph_to_dur(mel2ph, T_txt, max_dur=None):
--> 243     B, _ = mel2ph.shape
    244     dur = mel2ph.new_zeros(B, T_txt + 1).scatter_add(1, mel2ph, torch.ones_like(mel2ph))
    245     dur = dur[:, 1:]

AttributeError: 'NoneType' object has no attribute 'shape'
SEARCH STACK OVERFLOW

Question about training in other languages like English

Hello, Really nice work you are doing here. I was just wondering if you can inform and guide me about how can I create the music score for my custom English dataset that has about 5-6 hours of English singing and yes it is already transcribed in just text , and also wondering if it is manual work or is there a tool or python code to automate it (the music score thing), I academically know that the music score consists of ,input text, input note and input duration , I'm asking about how can I create the input note and duration? because i want to train different SVS algorithms and they all require music score starting from hifi-singer until Diffsinger,, So i think i need to make my English dataset formatted like opencpop. here is an example of a single line of train.txt of opencpop:

"2002000043|疯狂|f eng k uang SP|D#4/Eb4 D#4/Eb4 D4 D4 rest|0.609200 0.609200 2.137240 2.137240 3.058880|0.16178 0.44742 0.11219 2.02505 3.05888|0 0 0 0 0"

To be honest I don't understand all the EBD# stuff and also the different decimal numbers and the bunch of 0000s

and also whether or not I'm going to need to do the "|f eng k uang SP" English phoneme translation between the input text |疯狂| and input note "D#4/Eb4 D#4/Eb4 D4 D4 rest" since my dataset is already in English because i think that "|f eng k uang SP" is the English character translation of this Chinese sung word ||疯狂|

If you can guide me in details , I would be happy to give back and donate the English singing dataset to you since you know there isn't a single one available.

btw i saw the "SingingVoice-MFA-Training" repository but i don't understand it and I'm not sure if it can help in my situation , And I don't know if it will work with English or just Chinese.

If you can answer this that would be so amazing because I'm lost , Thanks in advance. :)

有关DiffSinger中文译名的提议 / Suggestions about Chinese translation to "Diffsinger"

在过去一段时间内，DiffSinger的用户数量快速上升。作为一个主打面向中文的虚拟歌唱解决方案，一个中文译名可能有其必要。通过这个中文译名，一是能凸显DiffSinger的本地特色，更多的是能扩大其影响力，也提高其作为社区驱动的开源方案的正式程度。相较于基于学术名称的英文名“DiffSinger”，一个理想的中文译名，可以使更广大中文用户对于DiffSinger感到亲切，并进而推进用户社群扩大。典型的成功翻译案例，有如初音未来。典型的中文名称，有如洛天依，都让人感到是高度本地化、成气候的社群氛围。因此我提起此Issue，想对DiffSinger的中文译名做一些提议。个人能力有限，对AI知之甚少，只是想做些外围工作，贡献自己的微薄力量。如有不妥当之处，还需您劳神指出！

In the recent period of time, the number of users of DiffSinger has rapidly increased. As an open-source VSinger solution targeting Chinese audiences, having a Chinese translation may be necessary. Such a translation can not only highlight the local features of DiffSinger, but more importantly, expand its influence and increase its formality as a community-driven open-source solution. Compared to the name based upon academic English name "DiffSinger," an ideal Chinese translation can make more Chinese users feel familiar with DiffSinger, and thus promote the expansion of the user community. Successful cases of translation, such as Hatsune Miku, and typical Chinese names, such as Luo Tianyi, create a highly localized and influential community-driven content creating atmosphere. Therefore, I raise this issue, hoping to propose some Chinese translation for DiffSinger. My personal abilities are limited and my knowledge of AI is scant, but I still want to contribute my humble efforts by doing these non-technical works. If there is any improper wordings or opinions, please feel free to point it out. (This English version is originally created by OpenAI's GPT3 and later modified by myself)

以下是我的若干建议：

Here comes some of my suggestions:

根据技术原理：散音 (散音AI)
根据技术原理：漫音（漫音AI）
根据项目**：德发声 (Diff音译+人人发声，为每个人提供机会)
根据项目**：林逾荆（零大于金，代表不收费，同时有志成为商业替代品）
综合考量：迭咏（音译+意译）
综合考量：艾楷渊 (AI + 开源（谐音））

How do I inference from a file?

As per title, I cannot seem to infer a new project using main.py, and any of the samples projects. Seems to be an error with ph_dur being unable to split due to value being null.

Also I am using a Japanese model (albeit I haven't fixed all of the bugs) with a Japanese dict that I made. Training does seem to run fine, however.

Sample DS File

I think it would be good to add the sample DS file to the repo

创建config与训练时不匹配

refactor-v2分支
我在data文件夹下的数据集文件夹中启用了fixed_pitch_shifting和random_time_stretching，但是python data_gen/binarize.py --config data/expname/config.yaml时控制台显示

| Hparams chains:  ['configs/base.yaml', 'configs/acoustic.yaml', 'data/liuchan_23.06.25/config.yaml']
| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {'random_pitch_shifting': {'enabled': False, 'range': [-5.0, 5.0], 'scale': 1.0}, 'fixed_pitch_shifting': {'enabled': False, 'targets': [-5.0, 5.0], 'scale': 0.75}, 'random_time_stretching': {'enabled': False, 'range': [0.65, 2.0], 'domain': 'log', 'scale': 2.0}},
base_config: ['configs/acoustic.yaml'], binarization_args: {'shuffle': True, 'num_workers': 0}, binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer, binary_data_dir: data/liuchan_23.06.25/binary, breathiness_smooth_width: 0.12,
clip_grad_norm: 1, dataloader_prefetch_factor: 2, ddp_backend: nccl, dictionary: dictionaries/opencpop-extension.txt, diff_accelerator: ddim,

其中fixed_pitch_shifting': {'enabled': False还有fixed_pitch_shifting': {'enabled': False,都与我之前在acoustic_preparation.ipynb中选的不符
训练时也是这样，我认为可能与base_config: ['configs/acoustic.yaml'],有关，并且我在acoustic_preparation.ipynb中填写的训练集在tensorboard中只出现了一条

数据集制作3.2 cell 2

把beam后的数字调大，从100试到了10000还是报错
log导出能找到出错音频文件名字，改完后重试每次导出的log的文件名都不一样

Fix rhythm (CV to VC)

See issue: MoonInTheRiver#60

Notes below C3 are breathy

Hello, I have trained a model. It sounds amazing on notes above C3, but below C3 are very breathy. When I checked on tensorboard, it seems the ground truth has become breathy as well. The original wav files are not. I think there's something wrong on the binarization code or the vocoder.

New phone set & phoneme system

Survey at: https://www.wenjuan.com/s/UZBZJvbQJU/

Issue exporting to ONNX

I'm getting an issue exporting my ckpt file to ONNX format. When I got this error before, I used oxygen-dioxide's ONNX export instead, and it worked, but now neither of them are working. The error is as follows:

Running ONNX simplifier...
Traceback (most recent call last):
  File "onnx/export/export_acoustic.py", line 1292, in <module>
    fix(diff_model_path, diff_model_path)
  File "onnx/export/export_acoustic.py", line 836, in fix
    model, check = onnxsim.simplify(model, include_subgraph=True)
  File "/home/kei/miniconda3/envs/diffsinger2/lib/python3.8/site-packages/onnxsim/onnx_simplifier.py", line 186, in simplify
    model_opt_bytes = C.simplify(
onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] (op_type:Add, node name: Add_1384): [TypeInferenceError] Inferred elem type differs from existing elem type: (DOUBLE) vs (FLOAT)

Acoustic model speed-up and tests

关于训练数据的准备

请问数据在训练前要如何准备呢（想训练具有某人音色的歌姬的话），可否出一些教程呢？万分感谢！
都是国人就不用英文了[doge]

About Training Epochs error, valid, save model

how to fix thí error?
bruh
Siincylit — Hôm nay lúc 05:50
the old dataset i was set the wav file is 1.wav, 2.wav,.v.v. but it doesn save any checkpoints to my computer, so this new dataset, i was set file wav name is media_1.wav, media_2.wav and set testprefix is 'media' and train in colab, but i meet this 'bug'
and when i binarize the dataset its only process the phonemes distribution and the train, valid and test its only print "0 it/s [0/0it]"

this is my config
`base_config:

configs/basics/fs2.yaml

task_cls: src.naive_task.NaiveTask
datasets: [
'HaKhanhPhuong',
]
num_spk: 1
test_prefixes: [
'media',
'1',
'2',
'3'
]
test_num: 0
valid_num: 0

vocoder: NsfHifiGAN
vocoder_ckpt: checkpoints/nsf_hifigan/model
use_nsf: true
audio_sample_rate: 44100
audio_num_mel_bins: 128
hop_size: 512 # Hop size.
fft_size: 2048 # FFT size.
win_size: 2048 # FFT size.
fmin: 40
fmax: 16000
min_level_db: -120

binarization_args:
with_wav: false
with_spk_embed: false
with_align: true
raw_data_dir: 'data/raw/HaKhanhPhuong/segments'
processed_data_dir: ''
binary_data_dir: '/content/drive/MyDrive/binary/HaKhanhPhuong'
binarizer_cls: data_gen.acoustic.AcousticBinarizer
g2p_dictionary: dictionaries/vi-strict.txt
pitch_extractor: 'parselmouth'
pitch_type: frame
content_cond_steps: [ ] # [ 0, 10000 ]
spk_cond_steps: [ ] # [ 0, 10000 ]
spec_min: [-5]
spec_max: [0]
keep_bins: 128
mel_loss: "ssim:0.5|l1:0.5"
mel_vmin: -6. #-6.
mel_vmax: 1.5
wav2spec_eps: 1e-6
save_f0: true

#pe_enable: true
#pe_ckpt: 'checkpoints/0102_xiaoma_pe'
max_frames: 8000
use_uv: false
use_midi: false
use_spk_embed: false
use_spk_id: false
use_gt_f0: false # for midi exp
use_gt_dur: false # for further midi exp
f0_embed_type: discrete
lambda_f0: 0.0
lambda_uv: 0.0
lambda_energy: 0.0
lambda_ph_dur: 0.0
lambda_sent_dur: 0.0
lambda_word_dur: 0.0
predictor_grad: 0.0

K_step: 1000
timesteps: 1000
max_beta: 0.02
predictor_layers: 5
rel_pos: true
gaussian_start: true
pndm_speedup: 10
hidden_size: 256
residual_layers: 20
residual_channels: 384
dilation_cycle_length: 4 # *
diff_decoder_type: 'wavenet'
diff_loss_type: l2
schedule_type: 'linear'

gen_tgt_spk_id: -1
num_sanity_val_steps: 1
lr: 0.0004
decay_steps: 50000
max_tokens: 80000
max_sentences: 9
val_check_interval: 2000
num_valid_plots: 10
max_updates: 320000
permanent_ckpt_start: 120000
permanent_ckpt_interval: 40000`

Support energy and breathiness control

RMS is used to represent energy
RMS of aperiodic parts is used to represent breathiness

i wanna use "forced automatic pitch modem" without F0 truth

however, following crashed

DiffSinger/inference/ds_cascade.py

Line 267 in d9f66c7

is_slur=sample['is_slur'], mel2ph=sample['mel2ph'], f0=sample['log2f0'],

File "openvpi/DiffSinger/modules/naive_frontend/encoder.py", line 72, in forward
delta_l = nframes - f0.size(1)
AttributeError: 'NoneType' object has no attribute 'size'

seems it stiil use input f0 as model param.

Tracing: deep refactoring of codebase and migration to new PyTorch & Lightning frameworks

Task list

Clean up useless codes
Re-organize the whole repository
Migrate to PyTorch 2.0 and Lightning 2.0: #72
Performance tuning and troubleshooting
Re-implement ONNX exporting scripts: #75
Update docs and README: #77, #86, #106
Move dataset making pipelines to new place

Re-implementation of the whole pipeline

你好, pipelines/asserts/*.lab 文件格式是什么样的

歌词中的标点符号怎么处理

Segmented synthesis

New vocoder with 44.1kHz sample rate

Alternatives:

iSTFTNet (https://github.com/rishikksh20/iSTFTNet-pytorch)
SawSing (https://github.com/YatingMusic/ddsp-singing-vocoders)
FastDiff (https://github.com/Rongjiehuang/FastDiff)

Suggestions for other papers/repositories are welcomed.

MFA training

添加自訂語言方法

想請問一下該如何自定義語言，畢竟被綑在中文其實有點讓人難以接觸

训练时hifigan model file is not found

Traceback (most recent call last):File "run. py"， line 15,in run_ task ( .
File "run.py"， line 11， in run_tasktask_cls.start
File "D:\DiffSinger-refactor\basics\base_task.py", line 199,in starttask = cls(
File "D:\DiffSinger-refactor\srcInaive_task.py"， line 8, in _initsuper (NaiveTask,self).init__O
File"D: \DiffSinger-refactor\srcldiffsinger_task.py"，line 271,in init_super (DiffSingerMIDITask,self).init(
File"D: \DiffSinger-refactor\lsrc\diffsinger_task.py"，line 32，in _init_super (DiffSingerTask,self)._init
File"D:\DiffSinger-refactor\src\diffspeech_task. py", line 21, in init_self.vocoder: BaseVocoder = get_vocoder_cls(hparams)()
File "D: DiffSinger-refactor \src\vocodersInsf hifigan. py"，line 18, in, _init_assert os.path.exists(model_path)，'HifiGAN model file is not found!'
AssertionError: HifiGAN model file is not found !

使用ddsp

您好啊~

我根据这个ddps生成了jit文件，但是调用的时候报错：

感觉是jit文件没有保存好。保存的时候我使用的是pytorch 1.8.2(同本程序的教程)

期待您的回复

Fusion algorithm for timbres and styles

Rhythmizers for other languages

Hi again, I was wondering if there were any documentation regarding the rhythmizers, as I would like to train them for Japanese...

Side question, would a rhythmizer work on CVVC languages such as English or Polish?

教程链接失效了

中文教程

切片时，两段歌声之间会留存超长的无声片段

比如切片结果里，有一段15秒的切片，0~~4秒是有歌声的，4~~11秒是静音，然后11~15秒又是有歌声的，请问slicer中的参数该如何修改，能让切片更精准。

目前参数：
db_threshold_ = -7.5, min_length_ = 5000, win_l_ = 400, win_s_ = 20, max_silence_kept_ = 100

AttributeError: 'DiffSingerE2EInfer' object has no attribute 'spk_map'

file base_svs_infer.py. Class BaseSVSInfer has no spk_map singletone if use_spk_id is false.

it cause me a error on d2_e2e.py

crash when running variance branch

DiffSinger/modules/toplevel.py

Line 71 in 30e26d4

self.predict_dur = hparams['predict_dur']

Exception has occurred: KeyError
'predict_dur'

hparams has no key 'predict_dur'

I wonder where the corresponding midi-version ckpt file and config.yaml could be downloaded?

Support phoneme mode

Feature proposal: finetuning from checkpoint

This proposed feature allows finetuning specific parameters from a given checkpoint. If finetuning is enabled, the trainer loads all matched parameters from the source ckpt when a new training task is started.

Need 4 new configuration keys:

finetune_enabled: whether finetuning is enabled
finetune_ckpt: path to the ckpt to be finetuned
finetune_ignored_params: params (name prefixes) to be dropped from the ckpt before finetuning starts
finetune_strict_shapes: if set to True, raise errors when tensor shapes mismatch; otherwise skip the mismatching parameters

Training data augmentation

Mainly aimed to expand the vocal range

mfa提示 There were no files found for this corpus

跟着pipelines中的那个笔记本进行操作，前面的步骤都没有报错，到了对齐这一步就不行了
输出

Pretrained model already exists.

Run the following command in your terminal manually if it fails here:
mfa align pipelines/segments dictionaries/opencpop-extension.txt pipelines/assets/mfa-opencpop-extension.zip pipelines/textgrids --beam 100 --clean --overwrite
INFO - Setting up corpus information...
INFO - Loading corpus from source files...
  0%|                                                   | 0/100 [00:02<?, ?it/s]
INFO - Stopped parsing early (0.019270027000000134 seconds)
ERROR - There was an error in the run, please see the log.
CorpusError:

  There were no files found for this corpus. Please validate the corpus.

请问一下这种情况是我的数据集有问题吗？我该怎么解决？

如何推理variance模型？

--predict这个参数填什么
看起来貌似是预测模型的预测参数
填pitch会报错

| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {'fixed_pitch_shifting': {'enabled': True, 'scale': 0.75, 'targets': [-5.0, 5.0]}, 'random_pitch_shifting': {'enabled': False, 'range': [-5.0, 5.0], 'scale': 1.0}, 'random_time_stretching': {'domain': 'log', 'enabled': True, 'range': [0.65, 2.0], 'scale': 2.0}},
base_config: [], binarization_args: {'num_workers': 0, 'shuffle': True}, binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer, binary_data_dir: data/liuchan_23.06.26/binary, breathiness_smooth_width: 0.12,
clip_grad_norm: 1, dataloader_prefetch_factor: 2, ddp_backend: nccl, dictionary: dictionaries/opencpop-extension.txt, diff_accelerator: ddim,
diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4,
enc_ffn_kernel_size: 9, enc_layers: 4, energy_smooth_width: 0.12, exp_name: 0627_liuchan_ds1000_23.06.26, f0_embed_type: continuous,
ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40,
hidden_size: 256, hop_size: 512, infer: True, interp_uv: True, log_interval: 100,
lr_scheduler_args: {'gamma': 0.5, 'scheduler_cls': 'torch.optim.lr_scheduler.StepLR', 'step_size': 52500, 'warmup_steps': 2000}, max_batch_frames: 80000, max_batch_size: 14, max_beta: 0.02, max_updates: 420000,
max_val_batch_frames: 60000, max_val_batch_size: 1, mel_vmax: 1.5, mel_vmin: -6.0, num_ckpt_keep: 2,
num_heads: 2, num_pad_tokens: 1, num_sanity_val_steps: 1, num_spk: 3, num_valid_plots: 10,
optimizer_args: {'beta1': 0.9, 'beta2': 0.98, 'lr': 0.00035, 'optimizer_cls': 'torch.optim.AdamW', 'weight_decay': 0}, permanent_ckpt_interval: 42000, permanent_ckpt_start: 120000, pl_trainer_accelerator: auto, pl_trainer_devices: auto,
pl_trainer_num_nodes: 1, pl_trainer_precision: 32-true, pl_trainer_strategy: auto, pndm_speedup: 10, raw_data_dir: ['data/liuchan_23.06.26/raw'],
rel_pos: True, residual_channels: 512, residual_layers: 20, sampler_frame_count_grid: 6, save_codes: ['configs', 'modules', 'training', 'utils'],
schedule_type: linear, seed: 1234, sort_by_len: True, speakers: ['liuchan'], spec_max: [0],
spec_min: [-5], spk_ids: [], task_cls: training.acoustic_task.AcousticTask, test_prefixes: ['p_1_jz yq_(Vocals)_1_cq_185', 'p_1_jz yq_(Vocals)_1_cq_208', 'p_1_jz yq_(Vocals)_1_cq_280', 'p_1_jz yq_(Vocals)_2_cq_211', 'p_1_jz yq_(Vocals)_3_cq_215', 'p_1_jz yq_(Vocals)_4_cq_146', 'p_1_jz yq_(Vocals)_4_cq_270', 'p_1_jz yq_(Vocals)_4_cq_271', 'p_1_jz yq_(Vocals)_6_cq_194', 'sample2_-4key_liuchan_0.5_sovdiff_1'], timesteps: 1000,
train_set_name: train, use_breathiness_embed: False, use_energy_embed: False, use_key_shift_embed: False, use_pos_embed: True,
use_speed_embed: True, use_spk_id: True, val_check_interval: 3000, val_with_vocoder: True, valid_set_name: valid,
vocoder: NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, win_size: 2048, work_dir: checkpoints\0627_liuchan_ds1000_23.06.26,
', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
Traceback (most recent call last):
  File ".\scripts\infer.py", line 218, in <module>
    main()
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File ".\scripts\infer.py", line 208, in variance
    infer_ins = DiffSingerVarianceInfer(ckpt_steps=ckpt, predictions=set(predict))
  File "E:\DiffSinger\inference\ds_variance.py", line 42, in __init__
    self.model: DiffSingerVariance = self.build_model(ckpt_steps=ckpt_steps)
  File "E:\DiffSinger\inference\ds_variance.py", line 67, in build_model
    model = DiffSingerVariance(
  File "E:\DiffSinger\modules\toplevel.py", line 71, in __init__
    self.predict_dur = hparams['predict_dur']
KeyError: 'predict_dur'```
填dur也不行
另外问一下导出成onnx模型后openutau支持这个预测吗

crash runing variance branch

DiffSinger/basics/base_svs_infer.py

Line 34 in 30e26d4

def preprocess_input(self, param: dict, idx=0) -> dict[str, torch.Tensor]:

following error: