yxlllc / ddsp-svc Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 234.0 28.14 MB

Real-time end-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing)

License: MIT License

Python 100.00%

pytorch

ddsp-svc's People

Contributors

Stargazers

Watchers

Forkers

huanlinoto maxmax2016 ishine erythrocyte3803 leedaga shaun95 nakatsusizuru kakaruhayate 6bpencle ai-jie01 justinjohn0306 veneno1213822 tarepan hongwen-sun xcyoloxcy origamistationery chunping-xt entropyriser bohemian-self fatinghenji williamrbt08 f0restw0w hexiaozhidi tps-f ddpn08 wow55qq l4ph aiczk ondorela cardroid objectin wlsdml1114 ms903x1 jigokuai muruganr96 pewpew11111 magic-akari childbird katakk decem6er petyin nakamotojp xuyaqin imiskolee kimzuo ryuanerin rygtx tam9life shoorito aylitat bee10301 yi-shi-qy kaking-and-keqing whitefu fujohnwang sdlibowen therealkamisama hironow tylorshine kkpan11 riteahoo ifgcguitarclub fox2011622 estrangement-29 umoufuton likeboo algacez shy19960813 dmql98 12si27 narusemioshirakana asksasasa83 juo2 rrrmannn yida-9527 garjune xuanqingdao lafi2333 trueto w4l6 ylzz1997 yingaiyong aron-prc neomindstd drinkwang bei123 raffaelelu jjxjun qsily anycall blockchain-pro panzy2023 hanshifu2023 beo202202 chinabiubiubiu scottsln nullnan2023 kennyhuangml100 jonnyboyhc f901107

ddsp-svc's Issues

Training

How long does it take usually to train a voice model? I have 1 hour wav file, GTX 1660 super.

推理时报错

使用的命令：python main.py -i G:\DDSP-SVC\samples\source.wav -m G:\DDSP-SVC\exp\combsub-test\model_68000.pt -o G:\DDSP-SVC\test.wav --enhancer_adaptive_key 0 -id 1 -k 0 -e true -pe crepe
日志：

 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] G:\DDSP-SVC\exp\combsub-test\model_68000.pt
Pitch extractor type: crepe
Extracting the pitch curve of the input audio...
Extracting the volume envelope of the input audio...
 [Encoder Model] Content Vec
 [Loading] pretrain/hubert/checkpoint_best_legacy_500.pt
2023-04-01 22:15:12 | INFO | fairseq.tasks.hubert_pretraining | current directory is G:\DDSP-SVC
2023-04-01 22:15:12 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-04-01 22:15:12 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Enhancer type: nsf-hifigan
| Load HifiGAN:  pretrain/nsf_hifigan/model
Traceback (most recent call last):
  File "main.py", line 197, in <module>
    enhancer = Enhancer(args.enhancer.type, args.enhancer.ckpt, device=device)
  File "G:\DDSP-SVC\enhancer.py", line 15, in __init__
    self.enhancer = NsfHifiGAN(enhancer_ckpt, device=self.device)
  File "G:\DDSP-SVC\enhancer.py", line 85, in __init__
    self.model, self.h = load_model(model_path, device=self.device)
  File "G:\DDSP-SVC\nsf_hifigan\models.py", line 17, in load_model
    with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'pretrain/nsf_hifigan\\config.json'

错误日志中提到pretrain/nsf_hifigan\\config.json，但是我不清楚该文件应该如何生成或者下载。请问应该如何解决？

change voice fail by command

I use command to convert wav files，sounds is change，but not change to chino's voice, did I miss anything? this is command:

cd F:\sd-webui\DDSP\DDSP-SVC; ./runtime/Scripts/activate.bat; ./runtime/python.exe main.py -i F:\sd-webui\DDSP\test\test.wav -m exp/model_chino.pt -o F:\sd-webui\DDSP\test\chino.wav -k 0 -id 1 -e true -eak 0

output:
`
2023-04-17 10:05:36 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX

[DDSP Model] Combtooth Subtractive Synthesiser

[Loading] exp/model_chino.pt

Pitch extractor type: crepe

Extracting the pitch curve of the input audio...

Extracting the volume envelope of the input audio...

[Encoder Model] HuBERT Soft

[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt

Enhancer type: nsf-hifigan

| Load HifiGAN: pretrain/nsf_hifigan/model

Removing weight norm...

Speaker ID: 1

Cut the input audio into 2 slices

100%
`

this command fail too,
cd F:\sd-webui\DDSP\DDSP-SVC; ./runtime/Scripts/activate.bat; ./runtime/python.exe main.py -i F:\sd-webui\DDSP\test\test.wav -m exp/model_chino.pt -o F:\sd-webui\DDSP\test\chino.wav -k 0 -id 1 -e false

output:
`
2023-04-17 10:07:58 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX

[DDSP Model] Combtooth Subtractive Synthesiser

[Loading] exp/model_chino.pt

Pitch extractor type: crepe

Extracting the pitch curve of the input audio...

Extracting the volume envelope of the input audio...

[Encoder Model] HuBERT Soft

[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt

Enhancer type: none (using raw output of ddsp)

Speaker ID: 1

Cut the input audio into 2 slices

100%
`

推理时报错：OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')我已经在正确的位置放置了checkpoint_best_legacy_500.pt，但报错依旧

推理时报错：OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')我已经在正确的位置放置了checkpoint_best_legacy_500.pt，但报错依旧
报错信息：
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1431, in process_api
result = await self.call_function(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "E:\GIT\so-vits-svc\webUI.py", line 129, in modelAnalysis
raise gr.Error(e)
gradio.exceptions.Error: OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')

Traceback

I saw the following traceback after I was trying to upload the vocal-only targeted audio and infer based on the two models I'd just trainned (i.e., ddsp model: 4000 steps, diffusion model: 6000 steps), how can I fix this?
Can anyone pls help? Thanks!

Traceback (most recent call last):
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\routes.py", line 393, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 1111, in process_api
data = self.postprocess_data(fn_index, result["prediction"], state)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 1036, in postprocess_data
prediction_value = postprocess_update_dict(
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 432, in postprocess_update_dict
prediction_value["value"] = block.postprocess(prediction_value["value"])
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 2427, in postprocess
file_path = self.make_temp_copy_if_needed(y)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 245, in make_temp_copy_if_needed
temp_dir = self.hash_file(file_path)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 217, in hash_file
with open(file_path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'output\1751_Vocal_WAV__1688874956.wav'

Samples

Hello author, Thank you for the work if you don't mind I would like to listen to one of the samples if there are any of them , I would like to hear the best quality ones that came out of that algorithm.

Thanks in advance.

音域问题

训练人的音域如果和目标人不一致，有什么好的处理办法吗

打开gui.py提示sounddevice.PortAudioError: Error opening Stream: Illegal combination of I/O devices [PaErrorCode -9993]

python 3.8
cuda11.8
torch2.0.0
torchaudio 2.0.1
windows10
所有的模型都配好了，报错如下，请问应如何解决
PS D:\AI\Audio\DDSP-SVC\DDSP-SVC> D:\AI\Audio\DDSP-SVC\Python38\python.exe gui.py --help
2023-05-06 07:16:43 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
event: stop_vc
event: start_vc
input device:1:阵列麦克风 (AMD Audio Device) (MME)
output device:7:扬声器 (Realtek(R) Audio) (Windows DirectSound)
crossfade_time:0.04
buffer_num:4
samplerate:44100
block_time:0.3
prefix_pad_length:1.13
mix_mode:None
enhancer:True
using_cuda:True
[DDSP Model] Combtooth Subtractive Synthesiser
[Loading] exp\multi_speaker\model_300000.pt
[Encoder Model] HuBERT Soft
[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt
| Load HifiGAN: pretrain/nsf_hifigan/model
Removing weight norm...
Exception in thread Thread-1:
Traceback (most recent call last):
File "D:\AI\Audio\DDSP-SVC\Python38\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "D:\AI\Audio\DDSP-SVC\Python38\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "gui.py", line 370, in soundinput
with sd.Stream(callback=self.audio_callback, blocksize=self.block_frame, samplerate=self.config.samplerate,
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 1800, in init
_StreamBase.init(self, kind='duplex', wrap_callback='array',
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 898, in init
_check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 2747, in _check
raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening Stream: Illegal combination of I/O devices [PaErrorCode -9993]

I can't get it to work.

I tried to do it, but I get this.
Traceback (most recent call last):
File "D:\DDSP-SVC\train.py", line 94, in
train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)
File "D:\DDSP-SVC\solver.py", line 83, in train
for batch_idx, data in enumerate(loader_train):
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1348, in _next_data
return self._process_data(data)
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1374, in _process_data
data.reraise()
File "D:\DDSP-SVC\venv\lib\site-packages\torch_utils.py", line 665, in reraise
raise exception
RecursionError: Caught RecursionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
[Previous line repeated 1988 more times]
File "D:\DDSP-SVC\data_loaders.py", line 186, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

哪位大佬能教我如何看懂这个图

实在不知道loss值要训练到多少最合适

gui_diff.py -> ValueError: [x] Unknown Model: DiffusionNew

ran on python 3.8 (windows11) + cuda 11.8 + torch 2.0.0 + torchaudio 2.0.1
gui_diff.py would close itself on "start conversion", got it ran on pycharm and see what happened:

Traceback (most recent call last):
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 580, in
gui = GUI()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 229, in init
self.launcher() # start
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 311, in launcher
self.event_handler()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 335, in event_handler
self.start_vc()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 447, in start_vc
self.svc_model.update_model(self.config.checkpoint_path)
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 52, in update_model
self.model, self.args = load_model(model_path, device=self.device)
File "[...file name...]\DDSP-SVC-4.0\ddsp\vocoder.py", line 486, in load_model
raise ValueError(f" [x] Unknown Model: {args.model.type}")
ValueError: [x] Unknown Model: DiffusionNew

model name "DiffusionNew" comes from this https://github.com/yxlllc/DDSP-SVC/blob/master/train_diff.py while certainly shouldnt be read from this
https://github.com/yxlllc/DDSP-SVC/blob/master/ddsp/vocoder.py

Does this mean I loaded the wrong model?
model downloaded from https://github.com/yxlllc/DDSP-SVC/releases/download/4.0/opencpop+kiritan.zip
and had .../DDSP-SVC-4.0/exp/diffusion-new-demo/model_200000.pt loaded (i suppose thats done by chosing it) chosen.
Also the demo itself works fine (command lines) but gui does close itself.

Failure at "audio_callback" in gui_diff.py preventing usage

Of my sound devices, it works fine with my USB headset, but attempting to use pipewire, default (which is a Pulse backend), or Jack results in different errors. I'm not convinced one (or all) of these aren't a sounddevice issue.

Still, the result is no audio with any device selections besides directly to my USB headset.

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5f96b6f70>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

When using "pipewire":

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
Audio block passed.
Removing weight norm...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5801e1700>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

The last one, JACK, is the most baffling. It dies with SIGKILL, which I'm not issuing myself. I see no messages in the journalctl about it whatsoever, either, so I'm not sure what's causing it:

event: start_vc
input device:22:G733 Gaming Headset Mono (JACK Audio Connection Kit)
output device:25:G733 Gaming Headset Analog Stereo (JACK Audio Connection Kit)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt

Starting callback
Infering...
Audio block passed.
Killed
(venv) [doneill@galena DDSP-SVC]$

Console error "name 'f0_extractor' is not defined"

So I wrote a small helper Python script a longer while back and updated it so it would fit a better folder structure.
For some reason I misspelled one of the pitch extraction methods and came across this issue in the console:

Looking at the line and the surrounding code here, it seems it's not using self.f0_extractor and thus doesn't print it properly
https://github.com/yxlllc/DDSP-SVC/blob/master/ddsp/vocoder.py#L136

Since it's a single line I thought I'd bring it to your attention since I don't think a PR for a single line would be necessary :)
(And even then it's in my opinion good to track this as a small issue anyway)

损失降到多少效果就可以了？

大佬大佬，我训练的时候batch/s只有1左右，训练100000epochs太久了，而且用预训练模型的话，训练的loss基本都不会降了。。。

ModuleNotFoundError: jax requires jaxlib to be installed. Windows 11

Hello. Recently, I've downgraded my CUDA to version 11.8(V11.8.89) and cuDNN to 8.6.0 for running another program. I'm not sure if this has caused the issue, but when I try to run DDSP, I encounter an error stating that 'jaxlib' is not installed, and thus I'm unable to use it.

Here are the commands I ran:

PS C:\Users\Pawn\DDSP-SVC> python main_diff.py -i input\audio.wav -ddsp exp\test\conbsub\model_90000.pt -diff exp\test\diff\model_15000.pt -o output\audio.wav -k 4 -id 1 -diffid 1 -speedup 1 -method dpm-solver -kstep 30
Traceback (most recent call last):
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\lib_init_.py", line 24, in
import jaxlib as jaxlib
ModuleNotFoundError: No module named 'jaxlib'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\Pawn\DDSP-SVC\main_diff.py", line 12, in
from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
File "C:\Users\Pawn\DDSP-SVC\ddsp\vocoder.py", line 10, in
from transformers import HubertModel, Wav2Vec2FeatureExtractor
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers_init_.py", line 26, in
from . import dependency_versions_check
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\dependency_versions_check.py", line 17, in
from .utils.versions import require_version, require_version_core
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils_init_.py", line 30, in
from .generic import (
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\generic.py", line 33, in
import jax.numpy as jnp
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_init_.py", line 35, in
from jax import config as _config_module
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax\config.py", line 17, in
from jax._src.config import config # noqa: F401
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\config.py", line 24, in
from jax.src import lib
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\lib_init.py", line 26, in
raise ModuleNotFoundError(
ModuleNotFoundError: jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.

and I have already tried these:
C:\Users\Pawn>pip install --upgrade "jax"
Requirement already satisfied: jax in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (0.4.12)
Requirement already satisfied: ml-dtypes>=0.1.0 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (0.2.0)
Requirement already satisfied: numpy>=1.21 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (1.23.5)
Requirement already satisfied: opt-einsum in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (3.3.0)
Requirement already satisfied: scipy>=1.7 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (1.9.3)

C:\Users\Pawn>pip install --upgrade "jaxlib"
ERROR: Could not find a version that satisfies the requirement jaxlib (from versions: none)
ERROR: No matching distribution found for jaxlib

How can I resolve this issue?

推理出来的声音有底噪是什么问题？

不知道为什么，用46K的模型推理出来的声音有底噪，就像听收音机的那种杂音的感觉，有哪位知道是什么问题么？该如何调整呢？

【DDSP + Diff-SVC 重构版】如何使用预训练声码器增强 DDSP 的输出结果，适配于更高的音域

如题，单独使用DDSP时使用enhancer_adaptive_key > 0 可将增强器适配于更高的音域，但是使用扩散模型+DDSP时应该如何实现音域扩展呢

訓練階段遇到的問題

Traceback (most recent call last):
File "train_diff.py", line 86, in
train(args, initial_global_step, model, optimizer, scheduler, vocoder, loader_train, loader_valid)
File "/home/csmxj/DDSP-SVC/diffusion/solver_new.py", line 188, in train
test_ddsp_loss, test_diff_loss = test(args, model, vocoder, loader_test, saver)
File "/home/csmxj/DDSP-SVC/diffusion/solver_new.py", line 25, in test
for bidx, data in enumerate(loader_test):
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
[Previous line repeated 989 more times]
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 202, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

在訓練完一個區間後會做一次validation,就在這邊陷入RecursionError了。

未来是否会支持cn_hubert，以改善标注？

地址：chinese_speech_pretrain 汉语语音预训练

除了音色转换，何时能更改歌词呢？

感谢您出色的工作，除了音色转换，如何能更改歌词呢？
相比sovits，如果增加歌词修改能力，这可能将会是巨大的进步，谢谢~

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

don't know what to do
please help

Can I use a TPU?

If anyone has used it, I was wondering if it's faster than the TASLA T4?

Multi gpu support?

Hi, how can i use more than one gpu for training models?

问题

有没有地方有大家分享的训练好的模型

I wonder what enhancer_adaptive_key does

I'm not sure what the function of enhancer_adaptive_key is and how it differs from a simple key variable. After using it, the key of the original music is the same, but the singer's tone seems to be applied as the tone when the tone is a little higher. Is this correct?

diff模型保存问题

大佬这个interval_force_save指的是保存最近2w的模型么，
现在diff模型只会保存一个是不是应该改成 interval_force_save==0

Language

When I downloaded and installed, it was in Chinese (I think) and I couldn't change it to English

Error in the inference process

Traceback (most recent call last):
File "D:\mnt\0)DDSP-SVC\main.py", line 261, in
seg_output, _, (s_h, s_n) = model(seg_units, seg_f0, seg_volume, spk_id = spk_id, spk_mix_dict = spk_mix_dict)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\mnt\0)DDSP-SVC\ddsp\vocoder.py", line 628, in forward
ctrls, hidden = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict, aug_shift=aug_shift)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\mnt\0)DDSP-SVC\ddsp\unit2control.py", line 78, in forward
x = self.stack(units.transpose(1,2)).transpose(1,2)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\container.py", line 215, in forward
input = module(input)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [256, 768, 3], expected input[1, 256, 450] to have 768 channels, but got 256 channels instead

Is there a way to fix this?

加载底模进行训练时报错

PS G:\DDSP-SVC> python train.py -c configs/combsub.yaml
 > config: configs/combsub.yaml
 >    exp: exp/combsub-test
 [DDSP Model] Combtooth Subtractive Synthesiser
 [*] restoring model from exp/combsub-test\model_300000.pt
Traceback (most recent call last):
  File "train.py", line 68, in <module>
    initial_global_step, model, optimizer = utils.load_model(args.env.expdir, model, optimizer, device=args.device)
  File "G:\DDSP-SVC\logger\utils.py", line 119, in load_model
    model.load_state_dict(ckpt['model'])
  File "C:\Users\29099\.virtualenvs\DDSP-SVC-YOgpXN-h\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CombSubFast:
        Unexpected key(s) in state_dict: "unit2ctrl.spk_embed.weight".

配置文件：

data:
  f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
  f0_min: 65 # about C2
  f0_max: 800 # about G5
  sampling_rate: 44100
  block_size: 512 # Equal to hop_length
  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
  encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
  encoder_sample_rate: 16000
  encoder_hop_size: 320
  encoder_out_channels: 256
  encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
model:
  type: 'CombSubFast'
  n_spk: 1 # max number of different speakers
enhancer:
    type: 'nsf-hifigan'
    ckpt: 'pretrain/nsf_hifigan/model'
loss:
  fft_min: 256
  fft_max: 2048
  n_scale: 4 # rss kernel numbers
device: cuda
env:
  expdir: exp/combsub-test
  gpu_id: 0
train:
  num_workers: 0 # If your cpu and gpu are both very strong, set to 0 may be faster!
  batch_size: 24
  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
  cache_device: 'cuda' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
  cache_fp16: true
  epochs: 100000
  interval_log: 10
  interval_val: 2000
  lr: 0.0005
  weight_decay: 0

Solution to serious memory leaks in preprocessing under Linux | 在Linux下面进行预处理发生严重内存泄漏的解决方法

Please use the following command to force pytorch to update to the nightly version
请用下面的命令把pytorch强制更新到nightly版本

cu118：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 --force-reinstall
cu121：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --force-reinstall

Is there a pre-trained model for finetuning?

Hi, as title says, was wondering if there is a pre-trained model available somewhere that could be finetuned to obtain better results quicker.

Questions about preprocessing methods

Is there a big difference between using the Combtooth subtractive synthesizer method and the Sinusoids additive synthesizer method in the preprocessing process? If so, which one produces better results?

CUDA out of Memory Error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.66 GiB (GPU 0; 8.00 GiB total capacity; 809.64 MiB already allocated; 5.02 GiB free; 1.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for wonderful program! But i have this problem. Can you teach me how to fix it?

error while starting training

File "D:\New folder (4)\DDSP-SVC\diffusion\data_loaders.py", line 202, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

提问

大佬，我想问一下，为啥我用ddsp做预处理的时候crepef0算法老是报错，RuntimeError: cuFFT error: CUFFT_INVALID_SIZE

使用的是b站于羽毛布球UP的整合包

有4G显存

训练没效果

我使用 4.0 训练了 10k 和 100k 后进行了对比，转换出来的音频没有任何差异，与目标音色差距也非常大。

训练过程均使用的默认配置与默认的预训练模型，没有做任何改动。

Questions in reasoning

When I proceed with inference using the model I have trained, it seems that there are a lot of voices of the original sound source left. Can I increase the ratio of the voices I have trained? (Same function as add_noise_step of diff-svc model)

Is there a way to finetune a pre-existing model?

Nan Loss everytime i train

Traceback (most recent call last):
File "C:\Users\bencj\Desktop\DDSP\DDSP-SVC-master\train.py", line 92, in
train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)
File "C:\Users\bencj\Desktop\DDSP\DDSP-SVC-master\solver.py", line 100, in train
raise ValueError(' [x] nan loss ')
ValueError: [x] nan loss

do you know how to resolve this issue?

关于webUI突然打不开了，一直显示loading？（The webUI won't open, it keeps showing loading.）

正常成功运行了1周左右，一直能正常使用（成功训练+成功推理），但1个小时前突然webUI打不开了，一直显示loading，中间换了很多浏览器，关闭了所有浏览器插件，关闭了所有电脑软件以及杀毒软件，然后重启了好多次电脑都没有用。

I've been running it normally and successfully for about 1 week, and it's been working fine (successful training + successful reasoning), but 1 hour ago suddenly the webUI wouldn't open, it kept showing loading, and in between I switched browsers a lot, turned off all the browser plugins, turned off all the computer software as well as the antivirus software, and then restarted the computer many times to no avail.

DDSP Gui meaning

what does kstep and phase vocoder means? is it accent control?

Which Units_Encoder is preferable?

Hi, @yxlllc Great work. Thank you

'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12'

To balance the problem of content information loss and timbre leakage, Which Units_Encoder is preferable?

速通版Colab（停止更新）

ImportError 内存资源不足，无法处理此命令

❯ python train_diff.py -c configs/diffusion-new.yaml
Traceback (most recent call last):
  File "train_diff.py", line 7, in <module>
    from diffusion.vocoder import Vocoder, Unit2Mel, Unit2Wav
  File "D:\AI\DDSP-SVC\diffusion\vocoder.py", line 11, in <module>
    from ddsp.vocoder import CombSubFast
  File "D:\AI\DDSP-SVC\ddsp\vocoder.py", line 7, in <module>
    import parselmouth
ImportError: DLL load failed while importing parselmouth: 内存资源不足，无法处理此命令。

但是我还有20G空闲内存？

real-time实时推理貌似有点问题

反正跑起来啥声音没有，而且音频输入输出只能选windows direct的，其他都会卡在加载模型

DDSP-SVC Question Area

FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

Hi,
First, I got error of ValueError: [x] Unknown Model: DiffusionNew. After reading your solution i left the model address on the left side of gui, empty and i read and saved config file in exp/diffusion-test directory in gui. when i press start conversion, i see this error: FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'
I checked, config.yaml is in exp/diffusion-test directory.
Please let me know what to do. Thanks.

Every 2000 steps during DDSP-SVC learning This error occurs.

Every 2000 steps during DDSP-SVC learning This error occurs.
I reinstalled the closing torch and Xformers.
Could the above be the cause of the error?

What should I do?

Is there a pre-trained model for ContentVec encoder?

Thanks for releasing the pretrained model for DDSP training, but the model seems to only be applicable to the Hubertsoft encoder. I would like to ask if there are any pre-trained models based on ContentVec(768layer12). If not, are there any plans to release such models in the future?