babysor / mockingbird Goto Github PK

View Code? Open in Web Editor NEW

34.8K 34.8K 5.2K 127.59 MB

🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time

License: Other

Python 99.54% Dockerfile 0.07% Shell 0.24% Cython 0.15%

ai deep-learning pytorch speech text-to-speech tts

mockingbird's Introduction

Hi there 👋

My Interested Topics

Artificial Intelligence 🎉🎉🎉🎉
Robot 🎉🎉🎉🎉
Metaverse/Virtual Reality 🎉🎉🎉🎉
Cloud Native 🎉🎉🎉
Front-end/Client-side Programming/Eco 🎉🎉🎉
Distributed System 🎉🎉

My Languages

Java 🌟🌟🌟🌟🌟
Javascript(ReactJS) 🌟🌟🌟🌟🌟
Javascript(NodeJS) 🌟🌟🌟🌟
Python 🌟🌟🌟🌟
PHP 🌟🌟🌟🌟
Golang 🌟🌟🌟🌟
C++ 🌟🌟🌟
C# 🌟🌟🌟
Mandarin 🌟🌟🌟
Dart 🌟🌟
Rust 🌟🌟
English 🌟🌟

mockingbird's People

Contributors

Stargazers

Watchers

Forkers

mockingbird's Issues

在文字框写了文字，可是出來的是其它聲音，是不是文字框bug了

谁能解决

有个问题，他显示Exception:Error(s) in loading state_dict for Tacotron :
Size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70,512]) from checkpoint, the shape in current model is torch.Size([75,512])

如何使用训练好的数据集呢

如题~
我将百度云下载好的训练结果放在E:\Voice\trainmodel，执行python demo_toolbox.py -d E:\Voice\trainmodelc好像并不能成功运行

LibriSpeech alignments?

(base) F:\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py "F:\Realtime-Voice-Clone-Chinese-main/data1"
Arguments:
datasets_root: F:\Realtime-Voice-Clone-Chinese-main\data1
out_dir: F:\Realtime-Voice-Clone-Chinese-main\data1\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
F:\Realtime-Voice-Clone-Chinese-main\data1\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [02:47<00:00, 2.51speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "F:\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

在此处发现同样问题:https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/486

關於 Train synthesizer 的問題，求指導 !

你好
我已經下載了aidatatang_200zh這個數據集，並且把 aidatatang_200zh\corpus\train 底下的檔案都解壓縮完畢
但是當我要開始執行 python synthesizer_preprocess_audio.py D:\google download(我把檔案放在 D:\google download 這個路徑下 )
卻發生以下狀況:
D:\python_demo\Realtime-Voice-Clone-Chinese>python synthesizer_preprocess_audio.py D:\google download\ D:\python_demo\Realtime-Voice-Clone-Chinese\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.") usage: synthesizer_preprocess_audio.py [-h] [-o OUT_DIR] [-n N_PROCESSES] [-s] [--hparams HPARAMS] [--no_trim] [--no_alignments] [--dataset DATASET] datasets_root synthesizer_preprocess_audio.py: error: unrecognized arguments: download\

請問我可以怎麼解決問題呢? 我有查看之前 issues 的討論並沒有發現有類似問題，以下是我想到可能有問題的地方，還請作者為我解答，謝謝！

1.我只有解壓縮 aidatatang_200zh\corpus\train 底下的檔案，是否其他資料夾下的檔案也要解壓縮?
2.是不是只需要將所有 wav 檔單獨拉出來放在 aidatatang_200zh\corpus\train 底下然後再執行python synthesizer_preprocess_audio.py D:\google download ?
3. 輸入的指令不對
4. wav 檔與 txt 檔是不是要預先處理，而我沒有進行處理?

在 Preprocess the embeddings 時自動關機

有人有跟我一樣的問題嗎，剛執行 python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer 不久就自動關機，有遇過同樣問題的人用什麼辦法解決呢?

200zh数据集解压后，第一步预处理报错

(RVCC) D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py D:\data
Arguments:
datasets_root: D:\data
out_dir: D:\data\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
D:\data\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [01:02<00:00, 6.71speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

使用预训练模型获得了奇怪的mel spectrogram和杂音

voicepart1.mp3 是一段时长为10秒钟、含7个句子的录音片段

voicepart2.wav 是一段时长为5秒钟的类似片段

合成结果均为约2秒的背景杂音，无论输入内容长度。

关于该项目的一些想法。

目前来看，该项目在实际使用的时候远达不到“可用”的程度。包括以下几种问题：
1、合成的音频会出现不包含正常人声，而是噪声和残缺的声音。
2、合成的音色跟目标音色不一致，差别很大。

目前分析出现问题一的原因应该是因为
1、asr数据中有些数据存在明显过强底噪，音频和文本或者音素数据无法对齐。（加入一些数据清洗的手段）
2、目前的d-vector和vocoder部分都是使用的英文数据集上训练的universal的版本，在中文数据集上使用肯定会出现mismatch的问题。（我理解d-vector和vocoder应该需要在中文数据集上重新训练以获得更好的结果）
3、数据集中音色过少，导致很难找到跟目标音色较为一致的”参考音色“用于生成。（混合多种asr和tts数据集，构建一个大型的数据集，以提高对目标音色的适配程度）。

这块我应该也会着手做一些工作以尝试改进模型，希望有机会和作者合作。

谁有模型，仓库中的百度网盘下载出现网络错误

使用百度网盘最新预训练模型，spectrogram不正常，只有两秒杂音

关于训练和推理的疑问

据我了解，datatang和slr68数据集都是针对ASR的数据，所以没有标注phoneme，那训练的时候是直接使用文字token还是先将文字转换成phoneme在进行训练。另外在您的演示视频中，我貌似看到是使用phoneme作为输入，如果是使用文字训练，inference的时候用phoneme，这之间又有什么样的处理。

训练模型时显存爆了

Variable._execution_engine.run_backward(RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 4.00 GiB totalcapacity; 2.68 GiB already allocated; 0 bytes free; 2.85 GiB reserved in total by PyTorch)

能不能提供一个调batch_size的参数? 我目前用的显卡显存只有4G(GTX1050Ti)，默认参数正常训练时经常爆掉显存....

请问如何恰当调整CPU和GPU的占用率呢

请教一下，GPU和CPU利用率只有13%左右，该怎么调整训练参数？

deploy as webservice

is there anyway to deploy it as http service ,we can call it remote
I have two computer~

训练模型时不调用GPU

3990x+两张3090
0.12-0.086step/s
CPU占用为70%
GPU无占用

使用百度云上的模型，训练播放后都是杂音

环境

Windows 10
Python 3.7

描述

百度云的pt模型放入synthesizer/saved_models/后，python .\demo_toolbox.py可执行，但产生结果都是杂音，中文和拼音都不太行

问题截图

本人纯小白，希望大佬有空给予指点。

输入一个mp3报错了，请问是啥原因？

Where can I download aidatatang_200zh dataset?

如何解决运行python synthesizer_preprocess_audio.py时报错 DLL load failed:页面文件太小，无法完成操作

我在运行 python synthesizer_preprocess_audio.py时遇到如上错误，在CSDN上找到解决方法：1.如果python 运行环境不在C盘查看高级系统设置->高级->性能设置->高级->虚拟内存->更改 ->取消自动管理所有驱动器的分页文件大小-> 自定义大小 ->初始大小和最大值设为10240 2. 更改DateLoade 中的参数num_worker 改为0 但我现在不清楚具体怎样把参数设为0

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

训练一半后出现这个，谁能解决help

用这里的模型跑出现这个RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).

誤觸，新開到issue，不好意思。

支持中文版toolbox，直接输入中文

关于aidatatang_200zh的问题

我尝试从aidatatang_200zh的官网上下载，是要把aidatatang_200zh\aidatatang_200zh\aidatatang_200zh\corpus\train下的文件全部解压吗？

Python

能出一个视频教程嘛

本人是一个小白，真的尝试去做了，好在一些安装下载配置别人有出教程，但不同人出的并不连贯，让我产生一种莫名其妙的感觉，很多东西在于细节，也许他所讲授的方法适用于这个特定的问题，但并不适用于项目，拜托了

speaker encoder的输出向量是什么样的？

SVT2TTS的评论区过来的，自己训练的speaker encoder，因为用的aishell3数据集，214个说话人，而输出的speaker embedding是256维的，这就导致每个说话人的向量很稀疏，大部分维度是0，几乎是one-hot形式的。所以用来训练synthesizer的话根本训练不了，loss是Nan。
你这个模型训练synthesizer时有注意到speaker embedding向量大概是什么样的吗？

Backend Qt5Agg is interactive backend. Turning interactive mode on.

直接运行没有问题，但是debug demo_toolbox.py时报错：
Traceback (most recent call last):
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev\pydevd.py", line 1438, in exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/instance/tts/Realtime-Voice-Clone-Chinese-main/demo_toolbox.py", line 43, in
Toolbox(**vars(args))
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox_init.py", line 75, in init
self.ui = UI()
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox\ui.py", line 450, in init
self.projections_layout.addWidget(FigureCanvas(fig))
TypeError: addWidget(self, QWidget, stretch: int = 0, alignment: Union[Qt.Alignment, Qt.AlignmentFlag] = Qt.Alignment()): argument 1 has unexpected type 'FigureCanvasQTAgg'
Backend Qt5Agg is interactive backend. Turning interactive mode on.

Suggestion! Maybe you can list the basic hardware requirements of this project.

Just as the title.

声音样本

大佬想问下若声音样本是歌曲的话，能不能克隆出其声音主人的声音出来？

请问可以提供预训练的编码器/声码器吗？

python synthesizer_preprocess_embeds.py <path-to-datasets_root>/SV2TTS/synthesizer

Output:

Arguments:
    synthesizer_root:      <path-to-datasets_root>/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0% 0/25308 [00:00<?, ?utterances/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 242, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "<path-to-Realtime-Voice-Clone-Chinese>/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))    
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 268, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
Embedding:   0% 0/25308 [00:01<?, ?utterances/s]

这个项目需要自己训练吗？

Pretrained-models下了放根目录不行拷贝了到相对应的文件目录才能启动工具箱
只能load数据集的语音
无法使用解析功能
说明不详细不会用啊
出一个详细的步骤文档吧

請問有aadatatang_200zh數據集的下載網址嗎?

請問有aadatatang_200zh數據集的下載網址嗎

Support new dataset

这次训练一半会出现这个EOFError: Ran out of input，怎么回事 PermissionError: [WinError 5] 拒绝访问。

来看看，解决一下

在生成录音时闪退

Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
Aborted (core dumped)

bilibili 演示视频已消失

Bilibili

在运行 python synthesizer_preprocess_audio.py 时报错

kiwisolver是个什么东西。。。。

Traceback (most recent call last):
File "D:\code\Realtime-Voice-Clone-Chinese\demo_toolbox.py", line 2, in
from toolbox import Toolbox
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox_init_.py", line 1, in
from toolbox.ui import UI
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox\ui.py", line 1, in
import matplotlib.pyplot as plt
File "D:\software\install place\python3\lib\site-packages\matplotlib_init_.py", line 157, in
check_versions()
File "D:\software\install place\python3\lib\site-packages\matplotlib_init.py", line 151, in check_versions
module = importlib.import_module(modname)
File "D:\software\install place\python3\lib\importlib_init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'kiwisolver'

我什么我使用首页提供的模型生成出来的音频都是杂音呢

是否能让生成出来的语音语速变慢

用了readme网友的数据集，感觉语速过快有点不自然

保姆级别教程（持续更新各类社区/非官方教程----

（作者借楼编辑ing
社区视频教程：
奶糖 https://www.bilibili.com/video/BV1dq4y137pH

Pre Trained Model

Hi, I am from outside China

is it possible to have the pre-trained model download from google drive?

sounddevice报错问题

在win10默认情况下系统编码格式为gbk，在运行demo_toolbox.py时会报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

打开D:\Env\anaconda3\Lib\site-packages\sounddevice.py移动到573行，有相关报错的issue，更改为mbcs后错误变成：

UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1:xxxxxxxxxxxxxxxxxxx

运行python -m sounddevice会报相同的错误

按照上述步骤更改系统编码格式后重启，再次运行python -m sounddevice就没报错了

C:\Users\LM>python -m sounddevice
   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 mic (USBAudio2.0), MME (2 in, 0 out)
   2 麦克风阵列 (Realtek High Definition , MME (2 in, 0 out)
   3 立体声混音 (Realtek High Definition , MME (2 in, 0 out)
   4 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  5 ear (15- Meizu HiFi DAC Headpho, MME (0 in, 2 out)
   6 Speaker (Realtek High Definitio, MME (0 in, 2 out)
   7 DELL U2414H (NVIDIA High Defini, MME (0 in, 2 out)
   8 主声音捕获驱动程序, Windows DirectSound (2 in, 0 out)
   9 mic (USBAudio2.0), Windows DirectSound (2 in, 0 out)
  10 麦克风阵列 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  11 立体声混音 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  12 主声音驱动程序, Windows DirectSound (0 in, 2 out)
  13 Speaker (Realtek High Definition Audio), Windows DirectSound (0 in, 2 out)
  14 DELL U2414H (NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
  15 DSD 转码器 (DoP/Native), ASIO (0 in, 2 out)
  16 ear (15- Meizu HiFi DAC Headphone Amplifier), Windows WASAPI (0 in, 2 out)
  17 Speaker (Realtek High Definition Audio), Windows WASAPI (0 in, 2 out)
  18 DELL U2414H (NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
  19 麦克风阵列 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  20 立体声混音 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  21 mic (USBAudio2.0), Windows WASAPI (2 in, 0 out)
  22 Output (), Windows WDM-KS (0 in, 2 out)
  23 耳机 (), Windows WDM-KS (0 in, 2 out)
  24 Headphones (Meizu HiFi DAC Headphone Amplifier), Windows WDM-KS (0 in, 2 out)
  25 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  26 立体声混音 (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  27 麦克风阵列 (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  28 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (0 in, 1 out)
  29 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (1 in, 0 out)
  30 麦克风 (USBAudio2.0), Windows WDM-KS (2 in, 0 out)

再次运行demo_toolbox.py就能正常打开

想要支持更多数据集？在这里提建议

已支持的有 aidatatang（已验证200zh）, Magic Data(已验证open SLR68)
需要更多请在这里提建议，并+1投票，将为大家补充支持

在运行demo_cli.py时出错

我同时下载了原模型和你的模型，但是在运行demo_cli.py时出现以下错误：

RuntimeError: Error(s) in loading state_dict for Tacotron:
size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([70, 512]).