plachtaa / vits-fast-fine-tuning Goto Github PK

This repo is a pipeline of VITS finetuning for fast speaker adaptation TTS, and many-to-many voice conversion

License: Apache License 2.0

Python 99.39% Cython 0.61%

vits-fast-fine-tuning's Introduction

VITS Fast Fine-tuning

This repo will guide you to add your own character voices, or even your own voice, into existing VITS TTS model to make it able to do the following tasks in less than 1 hour:

Many-to-many voice conversion between any characters you added & preset characters in the model.
English, Japanese & Chinese Text-to-Speech synthesis with the characters you added & preset characters

Welcome to play around with the base models!
Chinese & English & Japanese： Author: Me

Chinese & Japanese： Author: SayaSS

Chinese only：(No running huggingface spaces) Author: Wwwwhy230825

Currently Supported Tasks:

Clone character voice from 10+ short audios
Clone character voice from long audio(s) >= 3 minutes (one audio should contain single speaker only)
Clone character voice from videos(s) >= 3 minutes (one video should contain single speaker only)
Clone character voice from BILIBILI video links (one video should contain single speaker only)

Currently Supported Characters for TTS & VC:

Any character you wish as long as you have their voices! (Note that voice conversion can only be conducted between any two speakers in the model)

Fine-tuning

See LOCAL.md for local training guide.
Alternatively, you can perform fine-tuning on Google Colab

How long does it take?

Install dependencies (3 min)
Choose pretrained model to start. The detailed differences between them are described in Colab Notebook
Upload the voice samples of the characters you wish to add，see DATA.MD for detailed uploading options.
Start fine-tuning. Time taken varies from 20 minutes ~ 2 hours, depending on the number of voices you uploaded.

Inference or Usage (Currently support Windows only)

Remember to download your fine-tuned model!
Download the latest release
Put your model & config file into the folder inference, which are named G_latest.pth and finetune_speaker.json, respectively.
The file structure should be as follows:

inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth

run inference.exe, the browser should pop up automatically.
Note: you must install ffmpeg to enable voice conversion feature.

Use in MoeGoe

Prepare downloaded model & config file, which are named G_latest.pth and moegoe_config.json, respectively.
Follow MoeGoe page instructions to install, configure path, and use.

Looking for help?

If you have any questions, please feel free to open an issue or join our Discord server.

vits-fast-fine-tuning's People

Contributors

Stargazers

Watchers

Forkers

shaun95 barleyj21 0x114514bb shiciki jiangzhiyu1016 great1001 ikaros-521 clyimeng kaillliu cj0596 qjmxjly justinjohn0306 vic233333 yutian-wang entn-at hufeihu techthiyanes moonmiracle jianyuyanyu lopezjuanma96 iraday ning-quantvortex oceans0423 droughtsoftware soulflarerc dumpmemory lwd-temp owwwo dreamleepku 412137794 lererun eminvita yuki1sntsnow aleixa3102018 ctf2023 eve2ptp quellamc elaina1919 tukinokage snowcold liangwh2001 lrioxh nekonabe-tokumori echoyey jackylee1 ylyayzbl daabaa jjandnn larthas atarasin kerwin-z-shadows lovemino monup165 zeroblock0 chh2000day allforwife leonard2z oxforevero zsxplace a-sunhw yagamihikari moyvyv233 rainbloo guxingan1 newaiguy meteor-2023 wujohns wuhu404 tocmike mzltest ayanamiroyal april727 ari-yan nuooos bobo-paopao goooice csyangwen nkxingxh magu-yu suhu777 dasudiy gjlsx heheboom beacon1096 notfoundnamelol baizhi958216 joeggg0401 ultraman-tiga-sky chiaoso jiange91 bychen-asm hakureiranun colijian mistletoe0628 z1141922290 cggpro qinglan2002 coryyi logsattuber suwanwa

vits-fast-fine-tuning's Issues

Chinese TTS generation error using latest release.

The console outputted:

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\MyPath\\inference\\jieba\\dict.txt'

It turned out to be fine after I manually add this file from the repository of Jieba.

To reproduce this error I deleted dict.txt and found that Jieba will use cache:

DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\0x114514BB\AppData\Local\Temp\jieba.cache
DEBUG:jieba:Loading model from cache C:\Users\0x114514BB\AppData\Local\Temp\jieba.cache
Loading model cost 0.657 seconds.
DEBUG:jieba:Loading model cost 0.657 seconds.

I reproduced this error after deleting jieba.cache. Please add jieba/dict.txt in the release pack.

Colab 步骤四报错，错误代码

进行步骤2，录音后保存，进行步骤2.5时似乎没有处理录音，进行步骤4后出现以下错误

请教一下为什么到训练这一步会报错

提示ZeroDivisionError: integer division or modulo by zero
之前成功过一次，后来再试就一直报这个错误

Error in step 3:超出数据范围，最长支持 16 位

ValueError Traceback (most recent call last)
/content/VITS-fast-fine-tuning/preprocess_v2.py in
67 if len(txt) > 150:
68 continue
---> 69 cleaned_text = text._clean_text(txt, hps['data']['text_cleaners'])
70 cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
71 cleaned_new_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text)

7 frames
/usr/local/lib/python3.8/dist-packages/cn2an/an2cn.py in __integer_convert(self, integer_data, mode)
154 len_integer_data = len(integer_data)
155 if len_integer_data > len(unit_list):
--> 156 raise ValueError(f"超出数据范围，最长支持 {len(unit_list)} 位")
157
158 output_an = ""

ValueError: 超出数据范围，最长支持 16 位

跑了25个epoch 声音非常糊

请问如果音源有一点点噪音（几段音源有声音很小的bgm）的话是不是就会这样？

STEP 5 Error rearrange_speaker

Hi ! I got an error on step 5 for rearrange_speaker
It's appear

rearrange_speaker.py:18: SyntaxWarning: list indices must be integers or slices, not str; perhaps you missed a comma?
old_emb_g = model_sd(['model']['emb_g.weight'])
Traceback (most recent call last):
File "rearrange_speaker.py", line 18, in
old_emb_g = model_sd(['model']['emb_g.weight'])
TypeError: list indices must be integers or slices, not str

关于训练过程中属性错误

作者您好，我在尝试训练的时候出现了如下错误。上面的部分都没有报错，但是如果不勾选的话就会出现bug，同时tensorborad的GUI也没有输出。这里我确认了一下预处理最后的那个文件 路径|说话人名|因素标注 的final_anno....train和val.txt文件是有输出的（两个文件是相同的），不知道哪里出了问题。

步骤4报错

萌新请教一下怎么判断训练情况

文档中提到** “根据声线和样本质量的不同，所需的训练epochs数也不同。但是一般建议设置为30 epochs。你也可以在Tensorboard中预览合成效果，若效果满意可提前停止。” **
我试着提取了150段短音频和3段长音频，30epochs 之后感觉差点意思，又试着跑了60 epochs，但是跑完了看那几个曲线都还在跳。网上的资料看完一知半解，想请教一下大佬怎么简要地判断模型已经训练完成或者过拟合了？
附我的例子

运行inference时报错

我下载了模型文件和配置文件并且都改名放在了同一目录下，确定没有中文和空格，但还是报错

It's possible to train VITS in other language?

Hi again!
I found Thai cleaner in the text folder so I have a question that possible to train the model in Thai or other language?

Thank!

STEP 4 line 279: local variable 'x' referenced before assignment

The end of output from STEP 3:

Detected language: ja
こんな便利なもの持ってたんだ
Detected language: ja
あの人はもう戦わなくていいって
Detected language: ja
今の私は 誰が何と言おうと
Downloading: "https://github.com/r9y9/open_jtalk/releases/download/v1.11.1/open_jtalk_dic_utf_8-1.11.tar.gz"
dic.tar.gz: 100% 22.6M/22.6M [00:01<00:00, 18.3MB/s]
Extracting tar file /usr/local/lib/python3.8/dist-packages/pyopenjtalk/dic.tar.gz
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.172 seconds.
DEBUG:jieba:Loading model cost 1.172 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
*** buffer overflow detected ***: terminated

The output of STEP 4:

Reusing TensorBoard on port 6006 (pid 37700), started 0:02:42 ago. (Use '!kill 37700' to kill it.)

INFO:OUTPUT_MODEL:{'train': {'log_interval': 100, 'eval_interval': 1000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 12, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'final_annotation_train.txt', 'validation_files': 'final_annotation_val.txt', 'text_cleaners': ['cjke_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 1001, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'symbols': ['_', ',', '.', '!', '?', '-', '~', '…', 'N', 'Q', 'a', 'b', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ɑ', 'æ', 'ʃ', 'ʑ', 'ç', 'ɯ', 'ɪ', 'ɔ', 'ɛ', 'ɹ', 'ð', 'ə', 'ɫ', 'ɥ', 'ɸ', 'ʊ', 'ɾ', 'ʒ', 'θ', 'β', 'ŋ', 'ɦ', '⁼', 'ʰ', '`', '^', '#', '*', '=', 'ˈ', 'ˌ', '→', '↓', '↑', ' '], 'speakers': {'特别周 Special Week (Umamusume Pretty Derby)': 0, '无声铃鹿 Silence Suzuka (Umamusume Pretty Derby)': 1, '东海帝王 Tokai Teio (Umamusume Pretty Derby)': 2, '丸善斯基 Maruzensky (Umamusume Pretty Derby)': 3, '富士奇迹 Fuji Kiseki (Umamusume Pretty Derby)': 4, '小栗帽 Oguri Cap (Umamusume Pretty Derby)': 5, '黄金船 Gold Ship (Umamusume Pretty Derby)': 6, '伏特加 Vodka (Umamusume Pretty Derby)': 7, '大和赤骥 Daiwa Scarlet (Umamusume Pretty Derby)': 8, '大树快车 Taiki Shuttle (Umamusume Pretty Derby)': 9, '草上飞 Grass Wonder (Umamusume Pretty Derby)': 10, '菱亚马逊 Hishi Amazon (Umamusume Pretty Derby)': 11, '目白麦昆 Mejiro Mcqueen (Umamusume Pretty Derby)': 12, '神鹰 El Condor Pasa (Umamusume Pretty Derby)': 13, '好歌剧 T.M. Opera O (Umamusume Pretty Derby)': 14, '成田白仁 Narita Brian (Umamusume Pretty Derby)': 15, '鲁道夫象征 Symboli Rudolf (Umamusume Pretty Derby)': 16, '气槽 Air Groove (Umamusume Pretty Derby)': 17, '爱丽数码 Agnes Digital (Umamusume Pretty Derby)': 18, '青云天空 Seiun Sky (Umamusume Pretty Derby)': 19, '玉藻十字 Tamamo Cross (Umamusume Pretty Derby)': 20, '美妙姿势 Fine Motion (Umamusume Pretty Derby)': 21, '琵琶晨光 Biwa Hayahide (Umamusume Pretty Derby)': 22, '重炮 Mayano Topgun (Umamusume Pretty Derby)': 23, '曼城茶座 Manhattan Cafe (Umamusume Pretty Derby)': 24, '美普波旁 Mihono Bourbon (Umamusume Pretty Derby)': 25, '目白雷恩 Mejiro Ryan (Umamusume Pretty Derby)': 26, '雪之美人 Yukino Bijin (Umamusume Pretty Derby)': 28, '米浴 Rice Shower (Umamusume Pretty Derby)': 29, '艾尼斯风神 Ines Fujin (Umamusume Pretty Derby)': 30, '爱丽速子 Agnes Tachyon (Umamusume Pretty Derby)': 31, '爱慕织姬 Admire Vega (Umamusume Pretty Derby)': 32, '稻荷一 Inari One (Umamusume Pretty Derby)': 33, '胜利奖券 Winning Ticket (Umamusume Pretty Derby)': 34, '空中神宫 Air Shakur (Umamusume Pretty Derby)': 35, '荣进闪耀 Eishin Flash (Umamusume Pretty Derby)': 36, '真机伶 Curren Chan (Umamusume Pretty Derby)': 37, '川上公主 Kawakami Princess (Umamusume Pretty Derby)': 38, '黄金城市 Gold City (Umamusume Pretty Derby)': 39, '樱花进王 Sakura Bakushin O (Umamusume Pretty Derby)': 40, '采珠 Seeking the Pearl (Umamusume Pretty Derby)': 41, '新光风 Shinko Windy (Umamusume Pretty Derby)': 42, '东商变革 Sweep Tosho (Umamusume Pretty Derby)': 43, '超级小溪 Super Creek (Umamusume Pretty Derby)': 44, '醒目飞鹰 Smart Falcon (Umamusume Pretty Derby)': 45, '荒漠英雄 Zenno Rob Roy (Umamusume Pretty Derby)': 46, '东瀛佐敦 Tosen Jordan (Umamusume Pretty Derby)': 47, '中山庆典 Nakayama Festa (Umamusume Pretty Derby)': 48, '成田大进 Narita Taishin (Umamusume Pretty Derby)': 49, '西野花 Nishino Flower (Umamusume Pretty Derby)': 50, '春乌拉拉 Haru Urara (Umamusume Pretty Derby)': 51, '青竹回忆 Bamboo Memory (Umamusume Pretty Derby)': 52, '待兼福来 Matikane Fukukitaru (Umamusume Pretty Derby)': 55, '名将怒涛 Meisho Doto (Umamusume Pretty Derby)': 57, '目白多伯 Mejiro Dober (Umamusume Pretty Derby)': 58, '优秀素质 Nice Nature (Umamusume Pretty Derby)': 59, '帝王光环 King Halo (Umamusume Pretty Derby)': 60, '待兼诗歌剧 Matikane Tannhauser (Umamusume Pretty Derby)': 61, '生野狄杜斯 Ikuno Dictus (Umamusume Pretty Derby)': 62, '目白善信 Mejiro Palmer (Umamusume Pretty Derby)': 63, '大拓太阳神 Daitaku Helios (Umamusume Pretty Derby)': 64, '双涡轮 Twin Turbo (Umamusume Pretty Derby)': 65, '里见光钻 Satono Diamond (Umamusume Pretty Derby)': 66, '北部玄驹 Kitasan Black (Umamusume Pretty Derby)': 67, '樱花千代王 Sakura Chiyono O (Umamusume Pretty Derby)': 68, '天狼星象征 Sirius Symboli (Umamusume Pretty Derby)': 69, '目白阿尔丹 Mejiro Ardan (Umamusume Pretty Derby)': 70, '八重无敌 Yaeno Muteki (Umamusume Pretty Derby)': 71, '鹤丸刚志 Tsurumaru Tsuyoshi (Umamusume Pretty Derby)': 72, '目白光明 Mejiro Bright (Umamusume Pretty Derby)': 73, '樱花桂冠 Sakura Laurel (Umamusume Pretty Derby)': 74, '成田路 Narita Top Road (Umamusume Pretty Derby)': 75, '也文摄辉 Yamanin Zephyr (Umamusume Pretty Derby)': 76, '真弓快车 Aston Machan (Umamusume Pretty Derby)': 80, '骏川手纲 Hayakawa Tazuna (Umamusume Pretty Derby)': 81, '小林历奇 Kopano Rickey (Umamusume Pretty Derby)': 83, '奇锐骏 Wonder Acute (Umamusume Pretty Derby)': 85, '秋川理事长 President Akikawa (Umamusume Pretty Derby)': 86, '綾地 寧々 Ayachi Nene (Sanoba Witch)': 87, '因幡 めぐる Inaba Meguru (Sanoba Witch)': 88, '椎葉 紬 Shiiba Tsumugi (Sanoba Witch)': 89, '仮屋 和奏 Kariya Wakama (Sanoba Witch)': 90, '戸隠 憧子 Togakushi Touko (Sanoba Witch)': 91, '九条裟罗 Kujou Sara (Genshin Impact)': 92, '芭芭拉 Barbara (Genshin Impact)': 93, '派蒙 Paimon (Genshin Impact)': 94, '荒泷一斗 Arataki Itto (Genshin Impact)': 96, '早柚 Sayu (Genshin Impact)': 97, '香菱 Xiangling (Genshin Impact)': 98, '神里绫华 Kamisato Ayaka (Genshin Impact)': 99, '重云 Chongyun (Genshin Impact)': 100, '流浪者 Wanderer (Genshin Impact)': 102, '优菈 Eula (Genshin Impact)': 103, '凝光 Ningguang (Genshin Impact)': 105, '钟离 Zhongli (Genshin Impact)': 106, '雷电将军 Raiden Shogun (Genshin Impact)': 107, '枫原万叶 Kaedehara Kazuha (Genshin Impact)': 108, '赛诺 Cyno (Genshin Impact)': 109, '诺艾尔 Noelle (Genshin Impact)': 112, '八重神子 Yae Miko (Genshin Impact)': 113, '凯亚 Kaeya (Genshin Impact)': 114, '魈 Xiao (Genshin Impact)': 115, '托马 Thoma (Genshin Impact)': 116, '可莉 Klee (Genshin Impact)': 117, '迪卢克 Diluc (Genshin Impact)': 120, '夜兰 Yelan (Genshin Impact)': 121, '鹿野院平藏 Shikanoin Heizou (Genshin Impact)': 123, '辛焱 Xinyan (Genshin Impact)': 124, '丽莎 Lisa (Genshin Impact)': 125, '云堇 Yun Jin (Genshin Impact)': 126, '坎蒂丝 Candace (Genshin Impact)': 127, '罗莎莉亚 Rosaria (Genshin Impact)': 128, '北斗 Beidou (Genshin Impact)': 129, '珊瑚宫心海 Sangonomiya Kokomi (Genshin Impact)': 132, '烟绯 Yanfei (Genshin Impact)': 133, '久岐忍 Kuki Shinobu (Genshin Impact)': 136, '宵宫 Yoimiya (Genshin Impact)': 139, '安柏 Amber (Genshin Impact)': 143, '迪奥娜 Diona (Genshin Impact)': 144, '班尼特 Bennett (Genshin Impact)': 146, '雷泽 Razor (Genshin Impact)': 147, '阿贝多 Albedo (Genshin Impact)': 151, '温迪 Venti (Genshin Impact)': 152, '空 Player Male (Genshin Impact)': 153, '神里绫人 Kamisato Ayato (Genshin Impact)': 154, '琴 Jean (Genshin Impact)': 155, '艾尔海森 Alhaitham (Genshin Impact)': 156, '莫娜 Mona (Genshin Impact)': 157, '妮露 Nilou (Genshin Impact)': 159, '胡桃 Hu Tao (Genshin Impact)': 160, '甘雨 Ganyu (Genshin Impact)': 161, '纳西妲 Nahida (Genshin Impact)': 162, '刻晴 Keqing (Genshin Impact)': 165, '荧 Player Female (Genshin Impact)': 169, '埃洛伊 Aloy (Genshin Impact)': 179, '柯莱 Collei (Genshin Impact)': 182, '多莉 Dori (Genshin Impact)': 184, '提纳里 Tighnari (Genshin Impact)': 186, '砂糖 Sucrose (Genshin Impact)': 188, '行秋 Xingqiu (Genshin Impact)': 190, '奥兹 Oz (Genshin Impact)': 193, '五郎 Gorou (Genshin Impact)': 198, '达达利亚 Tartalia (Genshin Impact)': 202, '七七 Qiqi (Genshin Impact)': 207, '申鹤 Shenhe (Genshin Impact)': 217, '莱依拉 Layla (Genshin Impact)': 228, '菲谢尔 Fishl (Genshin Impact)': 230, 'User': 999}, 'model_dir': '././OUTPUT_MODEL', 'max_epochs': 20}
2023-02-23 03:10:15.392600: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2023-02-23 03:10:16.901032: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-23 03:10:16.901213: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-23 03:10:16.901242: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
...
...
...
  0% 0/55 [00:34<?, ?it/s]
Traceback (most recent call last):
  File "finetune_speaker.py", line 320, in <module>
    main()
  File "finetune_speaker.py", line 55, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/VITS_voice_conversion/finetune_speaker.py", line 133, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/VITS_voice_conversion/finetune_speaker.py", line 241, in train_and_evaluate
    evaluate(hps, net_g, eval_loader, writer_eval)
  File "/content/VITS_voice_conversion/finetune_speaker.py", line 279, in evaluate
    y_hat, attn, mask, *_ = generator.module.infer(x, x_lengths, speakers, max_len=1000)
UnboundLocalError: local variable 'x' referenced before assignment

Colab finetune error

I've loaded the dataset into colab but got this error on step 4:

0% 0/55 [00:37<?, ?it/s]
Traceback (most recent call last):
File "finetune_speaker.py", line 320, in
main()
File "finetune_speaker.py", line 55, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in wrap
fn(i, *args)
File "/content/VITS_voice_conversion/finetune_speaker.py", line 133, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/content/VITS_voice_conversion/finetune_speaker.py", line 241, in train_and_evaluate
evaluate(hps, net_g, eval_loader, writer_eval)
File "/content/VITS_voice_conversion/finetune_speaker.py", line 279, in evaluate
y_hat, attn, mask, * = generator.module.infer(x, x_lengths, speakers, max_len=1000)
UnboundLocalError: local variable 'x' referenced before assignment

My dataset is 10 voices which include 10-30 10 second .mp3s.

下载安装后运行inference报错，请问要如何解决。

Traceback (most recent call last):File "inference.py"，line 86，in File "utils.py"，line 194in get_hparams_from_fileFileNotFoundError: [Errno 2] No such file or directory: ./finetune-speaker.json![23136] Failed to execute script iinference' due to unhandled exception!

步骤3报错

3.1没报错，但看下面运行好像出错，已经开了GPU加速，应该是出问题了，文件目录和格式应该没问题

3.2报错情况和issue26的情况一样，样本是60多条wav

请问如何替换pretrain models?

请问如何替换pretrain models?我在.../configs/modified_finetune_speaker.json中没有看到pretrain model的参数，是不是将“pretrain_models”文件夹中的的G_0.pth替换为自己的模型即可？是否要改名为G_0.pth?

BTW：我的数据集大概有10个小时（600MB的wav）的语音，训练了（带aux数据）30epoch之后声音很清晰，但是说出来的并不是地球话:rofl:。请问下这个量级大概需要多少epoch？将产出的G_lastest.pth替换上述的G_0.pth能否实现在之前的训练的基础上继续训练？

步骤三报错

使用视频链接上传，一直报错。也尝试过在谷歌云盘上传长音频，也报错，错误信息相同。

denoised_audio 和 /separated/htdemucs文件夹中无文件

使用txt文件:
characters.txt
报错信息:

100%|██████████| 59539/59539 [00:28<00:00, 2061.93it/s][MoviePy] Done.

[MoviePy] Writing audio in ./raw_audio/TeacherShen_62695.wav
100%|██████████| 47538/47538 [00:22<00:00, 2101.01it/s][MoviePy] Done.

[MoviePy] Writing audio in ./raw_audio/TeacherShen_881627.wav
100%|██████████| 67062/67062 [00:34<00:00, 1969.92it/s][MoviePy] Done.

Important: the default model was recently changed to `htdemucs` the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use `-n mdx_extra_q`.
Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
100% 80.2M/80.2M [00:06<00:00, 13.3MB/s]
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/TeacherShen_881627.wav
100%|██████████████████████████████████████████████████████████████████████| 3042.0/3042.0 [02:25<00:00, 20.88seconds/s]
Killed
Important: the default model was recently changed to `htdemucs` the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use `-n mdx_extra_q`.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/TeacherShen_62695.wav
100%|████████████████████████████████████████████████████████████████████| 2158.65/2158.65 [01:45<00:00, 20.50seconds/s]
Killed
Important: the default model was recently changed to `htdemucs` the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use `-n mdx_extra_q`.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/TeacherShen_37232.wav
100%|██████████████████████████████████████████████████████████████████████| 2702.7/2702.7 [02:11<00:00, 20.55seconds/s]
Killed
Traceback (most recent call last):
  File "denoise_audio.py", line 12, in <module>
    wav, sr = torchaudio.load(f"./separated/htdemucs/{file}/vocals.wav", frame_offset=0, num_frames=-1, normalize=True,
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/backend/sox_io_backend.py", line 246, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/io/_compat.py", line 103, in load_audio
    s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError: Failed to open the input "./separated/htdemucs/TeacherShen_881627/vocals.wav" (No such file or directory).
2023-03-02 12:30:14.745081: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:30:18.161604: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-02 12:30:18.161791: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-02 12:30:18.161814: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Warning: no long audios & videos found, this IS expected if you have only uploaded short audios
this IS NOT expected if you have uploaded any long audios, videos or video links. Please check your file structure or make sure your audio/video language is supported.
2023-03-02 12:30:56.773756: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:30:57.789432: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-02 12:30:57.789562: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-02 12:30:57.789583: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Warning: no short audios found, this IS expected if you have only uploaded long audios, videos or video links.
this IS NOT expected if you have uploaded a zip file of short audios. Please check your file structure or make sure your audio language is supported.```

虽然不是在谷歌colab上运行的，但还是想问问

压缩包文件层级如图


报错

请问如何在原版vits中使用预训练的底模？

我在其它远程Linux服务器上无法正常部署该项目环境。所以我想直接使用该项目中已有的模型并在此基础上微调。但是配置文件似乎有点问题，一开始训练就会直接覆盖掉原本的预训练模型。
请问如何实现原本vits中的预训练模型微调部署？非常感谢！

请教一下，英文合成效果非常糟糕

英语合成出来几乎无法正常发音，是需要更换其他的底模吗？有推荐的么？

步骤4报错

内容为约1500条mp3格式短语音。
未勾选加入辅助训练数据。

错误信息如下：

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/VITS-fast-fine-tuning/finetune_speaker_v2.py", line 133, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/VITS-fast-fine-tuning/finetune_speaker_v2.py", line 153, in train_and_evaluate
    for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(tqdm(train_loader)):
  File "/usr/local/lib/python3.8/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 988, in __init__
    super(_MultiProcessingDataLoaderIter, self).__init__(loader)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 598, in __init__
    self._sampler_iter = iter(self._index_sampler)
  File "/content/VITS-fast-fine-tuning/data_utils.py", line 233, in __iter__
    ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero

您好，请问如何更换预训练模型（例如使用AIDATATANG）

【期望】Linux版的语音合成，提供API

期望如图

关于在其他云平台部署的问题

经过b站热心用户测试，只要下载通过colab生成的音频标注文件，就可以正常进行后续训练。
想问问标注文件的格式，以便通过其他方式进行生成

AttributeError: 'SynthesizerTrn' object has no attribute 'emb_g'

=============================================================
ni↓↑xɑʊ↓↑.
length:10
length:10
Traceback (most recent call last):
File "gradio\routes.py", line 380, in run_predict
event_id=event_id,
File "gradio\blocks.py", line 1018, in process_api
fn_index, inputs, iterator, request, event_id
File "gradio\blocks.py", line 836, in call_function
fn, *processed_input, limiter=self.limiter
File "anyio\to_thread.py", line 32, in run_sync
File "anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
File "anyio_backends_asyncio.py", line 867, in run
File "inference.py", line 42, in tts_fn
File "models_infer.py", line 370, in infer
File "torch\nn\modules\module.py", line 1270, in getattr
type(self).name, name))
AttributeError: 'SynthesizerTrn' object has no attribute 'emb_g'

程序报错，colab上导出前是正常的，导出后到本地就报错

步骤4报错

步骤3：
Your-zip-file.zip(application/x-zip-compressed) - 9877593 bytes, last modified: 2023/2/24 - 100% done
Saving Your-zip-file.zip to Your-zip-file.zip
Archive: ./custom_character_voice/custom_character_voice.zip
creating: ./custom_character_voice/Your-zip-file/
creating: ./custom_character_voice/Your-zip-file/RPK16/
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_ALLHALLOWS_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_BREAK_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_BUILDOVER_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_COMBINE_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_DIALOGUE1_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_DIALOGUE3_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_FEED_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_FORMATION_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_GAIN_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_GOATTACK_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_HELLO_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_LOADING_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_MEET_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_NEWYEAR_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_OPERATIONBEGIN_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_OPERATIONOVER_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_RETREAT_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_TIP_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_VALENTINE_JP.wav
inflating: ./custom_character_voice/Your-zip-file/RPK16/RPK16_WIN_JP.wav
2023-02-24 00:46:22.335055: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 00:46:23.282110: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 00:46:23.282236: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 00:46:23.282258: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
finished

步骤4节选：
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in wrap
fn(i, *args)
File "/content/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/finetune_speaker.py", line 133, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/content/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/finetune_speaker.py", line 241, in train_and_evaluate
evaluate(hps, net_g, eval_loader, writer_eval)
File "/content/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/VITS_voice_conversion/finetune_speaker.py", line 279, in evaluate
y_hat, attn, mask, * = generator.module.infer(x, x_lengths, speakers, max_len=1000)
UnboundLocalError: local variable 'x' referenced before assignment

Colab 步骤4~5 错误代码

Pretraining details

Great work! Was hoping you could give some brief details on your pretraining - mostly how many hours of data per speaker and how many epochs.

Quality

Hello, I was wondering if you reached any groundbreaking quality with your method or at least the same quality as the many-to-many in the original vits demo , and if you can share the results

P.S about the architecture , I also thought about any-to-any vits before , it was to extract speaker embeddings from a pretrained speaker encoder like escapa tdnn and train on large dataset of so many speakers 1k+ and scale up the parameter.

Thanks in advance

请问一下这个错误是什么意思

昨天按照文件要求格式的压缩包层级上传了文件成功了，但是今天换了个同样格式带有更多语音的压缩包后就又开始报错
UnboundLocalError: local variable 'x' referenced before assignment
中途上传后也有识别到语音内容，请问是什么原因呢
$XP{K )E3K9J8Y_$PE2U(5$

新步骤三报错+如何获得transcibe标注文件。

训练一个多少时候出现报错，与之前issue中的error似乎不同：
Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 45929032/45929032 bytes (100.0%)
Done
File saved as /root/.imageio/ffmpeg/ffmpeg-linux64-v3.3.1.
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
100% 80.2M/80.2M [00:04<00:00, 20.3MB/s]
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_6.wav
100%|██████████████████████████████████████████████| 3708.8999999999996/3708.8999999999996 [01:16<00:00, 48.46seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_0.wav
100%|████████████████████████████████████████████████| 620.0999999999999/620.0999999999999 [00:14<00:00, 43.96seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_3.wav
100%|██████████████████████████████████████████████████████████████████████| 2843.1/2843.1 [00:55<00:00, 50.85seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_5.wav
100%|████████████████████████████████████████████████████████████████████| 4124.25/4124.25 [01:20<00:00, 51.52seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_7.wav
100%|██████████████████████████████████████████████| 3650.3999999999996/3650.3999999999996 [01:11<00:00, 51.35seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_4.wav
100%|██████████████████████████████████████████████| 3469.0499999999997/3469.0499999999997 [01:07<00:00, 51.24seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_2.wav
100%|██████████████████████████████████████████████████████████████████████| 3077.1/3077.1 [01:00<00:00, 50.85seconds/s]
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS-fast-fine-tuning/separated/htdemucs
Separating track raw_audio/66_1.wav
100%|████████████████████████████████████████████████████████████████████| 4592.25/4592.25 [01:29<00:00, 51.51seconds/s]
2023-02-27 11:45:40.268184: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 11:45:40.419090: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-02-27 11:45:41.854468: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-27 11:45:41.854570: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-27 11:45:41.854591: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
100%|██████████████████████████████████████| 1.42G/1.42G [00:06<00:00, 228MiB/s]
transcribing ./denoised_audio/66_6.wav...

transcribing ./denoised_audio/66_0.wav...

transcribing ./denoised_audio/66_3.wav...

transcribing ./denoised_audio/66_5.wav...

transcribing ./denoised_audio/66_7.wav...

nn not supported, ignoring...

Traceback (most recent call last):
File "long_audio_transcribe.py", line 50, in
text = lang2token[lang] + text.replace("\n", "") + lang2token[lang]
KeyError: 'nn'
2023-02-27 12:32:19.528138: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 12:32:19.687193: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-02-27 12:32:20.538124: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-27 12:32:20.538216: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-27 12:32:20.538236: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

##################################################################################################
可能存在的问题是音频样本量过大，通过google drive上传了7段2个小时（700mb）左右的音频，第三步处理了一个多小时，通过打印结果似乎很多都处理成功了。

第三步的一个多少小时已经消耗了我colab pro一半的额度了:( 请问下有什么办法能获得已经成功的文字标注的txt文档吗？并且如何通过现成的短wav+txt标注的数据集进行训练？（第三步中长音频已经成功降噪，也分割成了短音频，我下载成zip了，现在需要对应的文字标注）

##################################################################################################
音频文件结构：

An error occurred while attempting to run finetune_speaker.py on Windows

Ideas on how to fix this error?

DataLoader的TextAudioSpeakerCollate错误

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "C:\Users\Yan\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap
fn(i, *args)
File "G:\VITS\finetune_speaker_v2.py", line 134, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "G:\VITS\finetune_speaker_v2.py", line 242, in train_and_evaluate
evaluate(hps, net_g, eval_loader, writer_eval)
File "G:\VITS\finetune_speaker_v2.py", line 265, in evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(eval_loader):
File "C:\Users\Yan\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 628, in next
data = self._next_data()
File "C:\Users\Yan\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Users\Yan\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data_utils\fetch.py", line 61, in fetch
return self.collate_fn(data)
File "G:\VITS\data_utils.py", line 159, in call
spec_padded[i, :, :spec.size(1)] = spec
RuntimeError: expand(torch.FloatTensor{[2, 513, 478]}, size=[513, 513]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

请问自动去除背景音包括音乐吗？

请问自动去除背景音包括音乐吗？我自己通过Vocal Remover类似的软件分离音乐和说话声数据会不会效果更好些。

Getting requirements to build wheel ... error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [25 lines of output]
setup.py:26: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_CYTHON_INSTALLED = ver >= LooseVersion(min_cython_ver)
Traceback (most recent call last):
File "C:\python\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\python\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\python\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\Night\AppData\Local\Temp\pip-build-env-xmh3jw0x\overlay\Lib\site-packages\setuptools\build_meta.py", line 162, in get_requires_for_build_wheel
return self._get_build_requires(
File "C:\Users\Night\AppData\Local\Temp\pip-build-env-xmh3jw0x\overlay\Lib\site-packages\setuptools\build_meta.py", line 143, in _get_build_requires
self.run_setup()
File "C:\Users\Night\AppData\Local\Temp\pip-build-env-xmh3jw0x\overlay\Lib\site-packages\setuptools\build_meta.py", line 267, in run_setup
super(_BuildMetaLegacyBackend,
File "C:\Users\Night\AppData\Local\Temp\pip-build-env-xmh3jw0x\overlay\Lib\site-packages\setuptools\build_meta.py", line 158, in run_setup
exec(compile(code, file, 'exec'), locals())
File "setup.py", line 153, in
File "C:\python\lib\subprocess.py", line 503, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\python\lib\subprocess.py", line 971, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\python\lib\subprocess.py", line 1440, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 系统找不到指定的文件。
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

[Bug]使用kaggle时STEP 4报错无final_annotation_train.txt

在进行STEP 3时已经完成后似乎并没有生成任何txt

Detected language: ja
ですが……
Detected language: ja
キュッ!
Detected language: ja
絶好の探検日よりですね!
Detected language: ja
天文班の皆さんのように、私も大きな夢を持ちたいです!
Detected language: ja
見てください!あんなところに昔川だった痕跡が!
Detected language: ja
昔はこうやって測量しながら地図を作っていたみたいですね。私も自分の地図を作るために、少しずつ歩いていかないと。仙里の道も一歩から、ですね。
Detected language: ja
このあたりの測量はバッチリです!
Detected language: ja
こうです!
Detected language: ja
本当に今日バーベキューをやるんですか?
Downloading: "https://github.com/r9y9/open_jtalk/releases/download/v1.11.1/open_jtalk_dic_utf_8-1.11.tar.gz"
dic.tar.gz: 100%|██████████████████████████| 22.6M/22.6M [00:02<00:00, 10.1MB/s]
Extracting tar file /opt/conda/lib/python3.7/site-packages/pyopenjtalk/dic.tar.gz
finished

但是ls ./时并未有根目录更新文件

之前

LICENSE			   download_model.py	   sampled_audio4ft
README.md		   finetune_speaker.py	   sampled_audio4ft.txt
README_EN.md		   losses.py		   sampled_audio4ft.zip
README_ZH.md		   mel_processing.py	   text
VC_inference.py		   models.py		   transforms.py
attentions.py		   models_infer.py	   user_voice
commons.py		   modules.py		   user_voice_collect.py
configs			   monotonic_align	   utils.py
custom_character_anno.txt  preprocess.py	   video_transcribe.py
custom_character_voice	   pretrained_models	   voice_upload.py
data_utils.py		   requirements.txt	   whisper_transcribe.py
demucs_denoise.py	   requirements_infer.txt

之后

LICENSE			   demucs_denoise.py	   sampled_audio4ft
OUTPUT_MODEL		   download_model.py	   sampled_audio4ft.txt
README.md		   finetune_speaker.py	   sampled_audio4ft.zip
README_EN.md		   losses.py		   text
README_ZH.md		   mel_processing.py	   transforms.py
VC_inference.py		   models.py		   user_voice
__pycache__		   models_infer.py	   user_voice_collect.py
attentions.py		   modules.py		   utils.py
commons.py		   monotonic_align	   video_transcribe.py
configs			   preprocess.py	   voice_upload.py
custom_character_anno.txt  pretrained_models	   whisper_transcribe.py
custom_character_voice	   requirements.txt
data_utils.py		   requirements_infer.txt

导致STEP 4无法找到final_annotation_train.txt

Traceback (most recent call last):
  File "finetune_speaker.py", line 320, in <module>
    main()
  File "finetune_speaker.py", line 55, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/kaggle/working/VITS_voice_conversion/finetune_speaker.py", line 72, in run
    train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps.data)
  File "/kaggle/working/VITS_voice_conversion/data_utils.py", line 164, in __init__
    self.audiopaths_sid_text = load_filepaths_and_text(audiopaths_sid_text)
  File "/kaggle/working/VITS_voice_conversion/utils.py", line 144, in load_filepaths_and_text
    with open(filename, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'final_annotation_train.txt'

但是因为看见 #14 中的问题，所以将colab运行完成的STEP 3中生成的custom_character_anno.txt下载下来上传到根目录，但是也不行

custom_character_anno.txt大概内容

./custom_character_voice/mai/processed_0.wav|1000|ga↑sʃɯ*kɯmi↓taina mo↑no↓desɯ*ka?
./custom_character_voice/mai/processed_1.wav|1000|a↑ɾi↓gatoo go↑zaima↓sɯ*!

请问如何才能生成final_annotation_train.txt？（或者final_annotation_train.txt是在哪里？）

kaggle笔记本链接

CUDA out of memory

1660Ti 6GB，类似vits项目改batch_size为4勉强能跑，但是这里依然爆显存

然后是一点小建议：
训练的 backend 部分建议这么写

from sys import platform
if platform == "win32":
    backend = 'gloo'
else:
    backend = 'nccl'
dist.init_process_group(backend=backend, init_method='env://', world_size=n_gpus, rank=rank)

重采样部分建议拆出来用多进程处理
参考 resample.py

现在用谷歌也生成不了标注文件了

输出里显示识别出了音频的语种，但是输出的txt是空白的

你好我是B站北海水母。

我想请问能否替换掉那个大佐口音的预训练模型？

能否支持挂载谷歌硬盘？

COLAB自带的上传和下载功能非常不好用，限速非常严重，而且会因为意外的断连导致数据全部丢失。可以添加几行简单的代码就集成这个功能，请问能不能集成到那个Jupiter里面呢？
举个例子，so-vits-svc的Jupiter就实现了这个功能。

想问问这个是什么意思

求助关于windows环境下本地部署需要哪些注意事项呢

首先感谢大佬的分享和优化！我自己在尝试本地windows 环境下CONDA python3.9部署后，安装环境未出现错误，于是尝试下一步
Q1，两个预训练模型都下载了（CJ与CJE），如果想更换其他标准汉语单语种（嗯……现在大佐味道很浓），这个是不是就是关键的部分？（当然还有configs/finetune_speaker.json的配置文件）
Q2，sampled_audio4ft这个文件以及.TXT是用来做什么的呢，（应该是数据集验证吧，看到里面有音素的标注，但是800多条，缺失了很大一部分中文的标注）自己更换的话，是否需要同步进行替换？或者如何进行音素的转换？
Q3，在本地命令行输入“python denoise_audio.py”去噪，“python short_audio_transcribe.py --languages "CJE" --whisper_size medium”后，发现并未识别出中文字幕，只是一直一行一行显示“Detected language: zh”直到结束。
Q4，由于失败未进行下一步。若使用辅助训练，则是否命令行输入“python finetune_speaker_v2.py -m "./OUTPUT_MODEL" --max_epochs "20"”
期待大佬您的答疑解惑！

unzip: cannot find or open

大佬，我不是很懂github，我尝试问个我不太懂的
我上传结构就是压缩包压缩包里就一个文件夹放着对应的语音结构是压缩包/文件夹a/多个语音
但是我上传会 unzip: cannot find or open

wo我不是很懂，问问大佬:(

Step 3 Error

When trying to run step 3 after step 1, I get the following error:
unzip: cannot find or open ./custom_character_voice/custom_character_voice.zip, ./custom_character_voice/custom_character_voice.zip.zip or ./custom_character_voice/custom_character_voice.zip.ZIP.
python3: can't open file 'whisper_transcribe.py': [Errno 2] No such file or directory

Not sure what's causing this, as they're both in the file browser.

步骤2.5，步骤4报错

步骤2：Audio saved to ./user_voice/4.wav successfully!
步骤2.5
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS_voice_conversion/separated/htdemucs
Separating track user_voice/21.wav
Traceback (most recent call last):
File "/usr/local/bin/demucs", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/demucs/separate.py", line 159, in main
wav = load_track(track, model.audio_channels, model.samplerate)
File "/usr/local/lib/python3.8/dist-packages/demucs/separate.py", line 41, in load_track
wav = convert_audio(wav, sr, samplerate, audio_channels)
File "/usr/local/lib/python3.8/dist-packages/demucs/audio.py", line 175, in convert_audio
return julius.resample_frac(wav, from_samplerate, to_samplerate)
File "/usr/local/lib/python3.8/dist-packages/julius/resample.py", line 166, in resample_frac
return ResampleFrac(old_sr, new_sr, zeros, rolloff).to(x)(x, output_length, full)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/julius/resample.py", line 132, in forward
x = x.reshape(-1, length)
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
Important: the default model was recently changed to htdemucs the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use -n mdx_extra_q.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /content/VITS_voice_conversion/separated/htdemucs
Separating track user_voice/12.wav
Traceback (most recent call last):
File "/usr/local/bin/demucs", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/demucs/separate.py", line 159, in main
wav = load_track(track, model.audio_channels, model.samplerate)
File "/usr/local/lib/python3.8/dist-packages/demucs/separate.py", line 41, in load_track
wav = convert_audio(wav, sr, samplerate, audio_channels)
File "/usr/local/lib/python3.8/dist-packages/demucs/audio.py", line 175, in convert_audio
return julius.resample_frac(wav, from_samplerate, to_samplerate)
File "/usr/local/lib/python3.8/dist-packages/julius/resample.py", line 166, in resample_frac
return ResampleFrac(old_sr, new_sr, zeros, rolloff).to(x)(x, output_length, full)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/julius/resample.py", line 132, in forward
x = x.reshape(-1, length)
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

步骤3（节选）：
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 06:49:25.956540: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-21 06:49:25.956650: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-21 06:49:25.956689: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
100%|██████████████████████████████████████| 1.42G/1.42G [00:11<00:00, 133MiB/s]
Detected language: ja
よしよし、ありがとうよ
Detected language: ja
龍は大嫌いだ。だが食ったらうまい。
Detected language: ja
今気づいたが、むしみだな、俺。殴り合いが好きってのはダメか?ダメだよな。
Detected language: ja
安心しろって。バーサーカーでもサーヴァント。お前の身は俺が守ってやるぞ。
Detected language: ja
なかなかやるな、てめえ。ま、楽しめたぜ。
Detected language: ja
聖杯ね。ま、欲しいんならいいんじゃねえか。俺はいらねえよ。
Detected language: ja
こいつは最高だ!
Detected language: ja
いい汗かいたぜ。じゃあ二回戦やるか。ダメか?そりゃ残念。
Detected language: ja
なかなかいいパンチだったぜ。しかし、俺の方が殴り慣れてる。
Detected language: ja
しょうがねえ。腹割って付き合ってやろうじゃねえか。何が望みだ?
Detected language: ja
おっと、いい感じじゃねーか
Detected language: ja
ありがとうよお前さんのおかげださあ一緒殴り合うかああ断るそうか残念だはっはっは
Detected language: ja
ああ、くそ。悪いな。先行くわ。
Detected language: ja
たけ、何か用か?もてやましてんだが
Detected language: ja
気なくさいな。何かあるんだろう。行ってみるか。
Detected language: ja
悪いことするときは目を背けてやるさもちろん限度ってもんがあるがな
Detected language: ja
おいおいマスター、引きこもって何になる?え?
Detected language: ja
いいね、強くなってらしい
Detected language: ja
ソラよ、クレティール
Detected language: ja
悪い悪い、なんでもねえよ
Detected language: ja
いいじゃねーか気に入ったこれからも気に食わない連中は殴って殴ってもう一度殴っちまえよ
Detected language: ja
オラオラオラ、どしたどした!
Detected language: ja
さーて、ぶん殴り合いのお時間だ。男女問わず倒れるまでやろうや!
Detected language: ja
来たか?ならいいさ、殴って蹴ってそっぱりしてやる!
Detected language: ja
サーヴァント・バーサーか。真名・ベオウルフ。じゃあ殴りに行こうぜ、マスター。おいおい、引くなよ。
Detected language: ja
これが戦いの根源だ。要するに、殴って蹴って立っていた方の勝ちってやつよ!オラオラオラ!ぶっ飛ぶへ!
Detected language: ja
おっと、てめえの生まれた日じゃねえか。おら、空に向かって感謝しな。
Downloading: "https://github.com/r9y9/open_jtalk/releases/download/v1.11.1/open_jtalk_dic_utf_8-1.11.tar.gz"
dic.tar.gz: 100% 22.6M/22.6M [00:01<00:00, 12.9MB/s]
Extracting tar file /usr/local/lib/python3.8/dist-packages/pyopenjtalk/dic.tar.gz
finished

步骤4：
DEBUG:matplotlib:CACHEDIR=/root/.cache/matplotlib
DEBUG:matplotlib.font_manager:Using fontManager instance from /root/.cache/matplotlib/fontlist-v310.json
DEBUG:matplotlib.pyplot:Loaded backend agg version unknown.
DEBUG:matplotlib.pyplot:Loaded backend agg version unknown.
0% 0/84 [00:33<?, ?it/s]
Traceback (most recent call last):
File "finetune_speaker.py", line 320, in
main()
File "finetune_speaker.py", line 55, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/VITS_voice_conversion/finetune_speaker.py", line 133, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/content/VITS_voice_conversion/finetune_speaker.py", line 241, in train_and_evaluate
evaluate(hps, net_g, eval_loader, writer_eval)
File "/content/VITS_voice_conversion/finetune_speaker.py", line 264, in evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(eval_loader):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/VITS_voice_conversion/data_utils.py", line 248, in getitem
return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
File "/content/VITS_voice_conversion/data_utils.py", line 206, in get_audio_text_speaker_pair
spec, wav = self.get_audio(audiopath)
File "/content/VITS_voice_conversion/data_utils.py", line 223, in get_audio
spec = spectrogram_torch(audio_norm, self.filter_length,
File "/content/VITS_voice_conversion/mel_processing.py", line 52, in spectrogram_torch
if torch.min(y) < -1.:
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

步骤三报错

用的是b站视频链接上传
步骤四也有错误，不知道是否有关，同样贴上

这是步骤四的错误

第五步下载不了

总是显示
python3: can't open file 'rearrange_speaker.py': [Errno 2] No such file or directory
ERROR:root:File 'download_model.py' not found.

本地部署后使用日语时报错

在colab上进行微调后本地部署使用，中文与英语正常，但是使用日语时会报错。

执行 `STEP 3 自动处理所有上传的数据` 时报错RuntimeError

在步骤2中上传的文件：

啊啊啊啊啊_000001.mp3 30秒英语
啊啊啊啊啊_000002.mp3 40秒英语
啊啊啊啊啊_000003.mp3 120秒中文

在执行 `STEP 3 自动处理所有上传的数据` 时报错

Traceback (most recent call last):
  File "denoise_audio.py", line 12, in <module>
    wav, sr = torchaudio.load(f"./separated/htdemucs/{file}/vocals.wav", frame_offset=0, num_frames=-1, normalize=True,
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/backend/sox_io_backend.py", line 246, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/io/_compat.py", line 103, in load_audio
    s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError: Failed to open the input "./separated/htdemucs/啊啊啊啊啊_000001.mp3/vocals.wav" (No such file or directory).
2023-03-01 04:05:04.300610: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-01 04:05:05.375819: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-01 04:05:05.375961: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-01 04:05:05.375984: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-01 04:05:31.699668: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-01 04:05:32.933781: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-01 04:05:32.933914: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-01 04:05:32.933939: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

运行`nvidia-smi`输出

Wed Mar  1 04:14:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    31W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

若继续执行下面的代码块会出现断言错误no speaker found

Previous Colab Notebook

Hello!

Yesterday I tested the Colab notebook that included an interface to record your own voice clips and fine tune with that, and did not have the video part. I was starting to build a fork of the repo to repeat the process in spanish but it seems the notebook was updated along with these new video features, is there a chance I can still access to that old notebook, to adapt it for my fork?

Of course as soon as I manage to adapt it I can provide you with the changes I made and try to adapt it so that you can have an extra language option.

Thanks in advance, and congratulations on such an amazing work, I had already contacted you from the hugging face repo but the more and more I look at your work I am more amazed.

Greeting from Argentina,
Juanma

PS: btw, my fork of the repo is here, I just started it today so it has almost nothing changed but you can take a look at the ToDo list and give me your feedback.. Otherwise I will still contact you as soon as I have something solid working on spanish.

plachtaa / vits-fast-fine-tuning Goto Github PK

vits-fast-fine-tuning's Introduction

VITS Fast Fine-tuning

Currently Supported Tasks:

Currently Supported Characters for TTS & VC:

Fine-tuning

How long does it take?

Inference or Usage (Currently support Windows only)

Use in MoeGoe

Looking for help?

vits-fast-fine-tuning's People

Contributors

Stargazers

Watchers

Forkers

vits-fast-fine-tuning's Issues

在步骤2中上传的文件：

在执行 STEP 3 自动处理所有上传的数据 时报错

运行nvidia-smi输出

Recommend Projects

Recommend Topics

Recommend Org

在执行 `STEP 3 自动处理所有上传的数据` 时报错

运行`nvidia-smi`输出