Giter Club home page Giter Club logo

mockingbird's Introduction

Hi there 👋

My Interested Topics

  • Artificial Intelligence 🎉🎉🎉🎉
  • Robot 🎉🎉🎉🎉
  • Metaverse/Virtual Reality 🎉🎉🎉🎉
  • Cloud Native 🎉🎉🎉
  • Front-end/Client-side Programming/Eco 🎉🎉🎉
  • Distributed System 🎉🎉

My Languages

  • Java 🌟🌟🌟🌟🌟
  • Javascript(ReactJS) 🌟🌟🌟🌟🌟
  • Javascript(NodeJS) 🌟🌟🌟🌟
  • Python 🌟🌟🌟🌟
  • PHP 🌟🌟🌟🌟
  • Golang 🌟🌟🌟🌟
  • C++ 🌟🌟🌟
  • C# 🌟🌟🌟
  • Mandarin 🌟🌟🌟
  • Dart 🌟🌟
  • Rust 🌟🌟
  • English 🌟🌟

mockingbird's People

Contributors

1044690543 avatar alexzhangji avatar babysor avatar castleking1997 avatar cocucola avatar earmer avatar everschen avatar fawenyo avatar flysmart avatar hertz-pj avatar ibb233 avatar illustar0 avatar jenkey2011 avatar jerryuhoo avatar jethrochow avatar kagurazakanyaa avatar kslz avatar lonelyman0108 avatar lzy2006 avatar maxoyed avatar moosewoler avatar oceanarium avatar pansila avatar wei-z-git avatar wenqingl avatar whitescent avatar wwdok avatar xiuchen-liu avatar xumeng avatar zzxiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mockingbird's Issues

torch.Size的问题

有个问题,他显示Exception:Error(s) in loading state_dict for Tacotron :
Size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70,512]) from checkpoint, the shape in current model is torch.Size([75,512])

如何使用训练好的数据集呢

如题~
我将百度云下载好的训练结果放在E:\Voice\trainmodel,执行python demo_toolbox.py -d E:\Voice\trainmodelc好像并不能成功运行

LibriSpeech alignments?

(base) F:\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py "F:\Realtime-Voice-Clone-Chinese-main/data1"
Arguments:
datasets_root: F:\Realtime-Voice-Clone-Chinese-main\data1
out_dir: F:\Realtime-Voice-Clone-Chinese-main\data1\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
F:\Realtime-Voice-Clone-Chinese-main\data1\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [02:47<00:00, 2.51speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "F:\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence
image

在此处发现同样问题:https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/486

關於 Train synthesizer 的問題,求指導 !

你好
我已經下載了aidatatang_200zh這個數據集,並且把 aidatatang_200zh\corpus\train 底下的檔案都解壓縮完畢
但是當我要開始執行 python synthesizer_preprocess_audio.py D:\google download(我把檔案放在 D:\google download 這個路徑下 )
卻發生以下狀況:
D:\python_demo\Realtime-Voice-Clone-Chinese>python synthesizer_preprocess_audio.py D:\google download\ D:\python_demo\Realtime-Voice-Clone-Chinese\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.") usage: synthesizer_preprocess_audio.py [-h] [-o OUT_DIR] [-n N_PROCESSES] [-s] [--hparams HPARAMS] [--no_trim] [--no_alignments] [--dataset DATASET] datasets_root synthesizer_preprocess_audio.py: error: unrecognized arguments: download\

請問我可以怎麼解決問題呢? 我有查看之前 issues 的討論並沒有發現有類似問題,以下是我想到可能有問題的地方,還請作者為我解答,謝謝!

1.我只有解壓縮 aidatatang_200zh\corpus\train 底下的檔案,是否其他資料夾下的檔案也要解壓縮?
2.是不是只需要將所有 wav 檔單獨拉出來放在 aidatatang_200zh\corpus\train 底下然後再執行python synthesizer_preprocess_audio.py D:\google download ?
3. 輸入的指令不對
4. wav 檔 與 txt 檔是不是要預先處理,而我沒有進行處理?

在 Preprocess the embeddings 時自動關機

有人有跟我一樣的問題嗎,剛執行 python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer 不久就自動關機,有遇過同樣問題的人用什麼辦法解決呢?

200zh数据集解压后,第一步预处理报错

(RVCC) D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main>python synthesizer_preprocess_audio.py D:\data
Arguments:
datasets_root: D:\data
out_dir: D:\data\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
dataset: aidatatang_200zh

Using data from:
D:\data\aidatatang_200zh\corpus\train
aidatatang_200zh: 100%|████████████████████████████████████████████████████████| 420/420 [01:02<00:00, 6.71speakers/s]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 64, in
preprocess_dataset(**vars(args))
File "D:\Realtime-Voice-Clone-Chinese-main\Realtime-Voice-Clone-Chinese-main\synthesizer\preprocess.py", line 76, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

关于该项目的一些想法。

目前来看,该项目在实际使用的时候远达不到“可用”的程度。包括以下几种问题:
1、合成的音频会出现不包含正常人声,而是噪声和残缺的声音。
2、合成的音色跟目标音色不一致,差别很大。

目前分析出现问题一的原因应该是因为
1、asr数据中有些数据存在明显过强底噪,音频和文本或者音素数据无法对齐。(加入一些数据清洗的手段)
2、目前的d-vector和vocoder部分都是使用的英文数据集上训练的universal的版本,在中文数据集上使用肯定会出现mismatch的问题。(我理解d-vector和vocoder应该需要在中文数据集上重新训练以获得更好的结果)
3、数据集中音色过少,导致很难找到跟目标音色较为一致的”参考音色“用于生成。(混合多种asr和tts数据集,构建一个大型的数据集,以提高对目标音色的适配程度)。

这块我应该也会着手做一些工作以尝试改进模型,希望有机会和作者合作。

关于训练和推理的疑问

据我了解,datatang和slr68数据集都是针对ASR的数据,所以没有标注phoneme,那训练的时候是直接使用文字token还是先将文字转换成phoneme在进行训练。另外在您的演示视频中,我貌似看到是使用phoneme作为输入,如果是使用文字训练,inference的时候用phoneme,这之间又有什么样的处理。

训练模型时显存爆了

Variable._execution_engine.run_backward(RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 4.00 GiB totalcapacity; 2.68 GiB already allocated; 0 bytes free; 2.85 GiB reserved in total by PyTorch)

能不能提供一个调batch_size的参数? 我目前用的显卡显存只有4G(GTX1050Ti),默认参数正常训练时经常爆掉显存....

deploy as webservice

is there anyway to deploy it as http service ,we can call it remote
I have two computer~

使用百度云上的模型,训练播放后都是杂音

环境

Windows 10
Python 3.7

描述

百度云的pt模型放入synthesizer/saved_models/后,python .\demo_toolbox.py可执行,但产生结果都是杂音,中文和拼音都不太行

问题截图

image
image

本人纯小白,希望大佬有空给予指点。

如何解决运行python synthesizer_preprocess_audio.py时报错 DLL load failed:页面文件太小,无法完成操作

我在运行 python synthesizer_preprocess_audio.py时遇到如上错误 ,在CSDN上找到解决方法:1.如果python 运行环境不在C盘 查看高级系统设置->高级->性能 设置->高级->虚拟内存->更改 ->取消自动管理所有驱动器的分页文件大小-> 自定义大小 ->初始大小和最大值设为10240 2. 更改DateLoade 中的参数num_worker 改为0 但我现在不清楚具体怎样把参数设为0

关于aidatatang_200zh的问题

我尝试从aidatatang_200zh的官网上下载,是要把aidatatang_200zh\aidatatang_200zh\aidatatang_200zh\corpus\train下的文件全部解压吗?

能出一个视频教程嘛

本人是一个小白,真的尝试去做了,好在一些安装下载配置别人有出教程,但不同人出的并不连贯,让我产生一种莫名其妙的感觉,很多东西在于细节,也许他所讲授的方法适用于这个特定的问题,但并不适用于项目,拜托了

speaker encoder的输出向量是什么样的?

SVT2TTS的评论区过来的,自己训练的speaker encoder,因为用的aishell3数据集,214个说话人,而输出的speaker embedding是256维的,这就导致每个说话人的向量很稀疏,大部分维度是0,几乎是one-hot形式的。所以用来训练synthesizer的话根本训练不了,loss是Nan。
你这个模型训练synthesizer时有注意到speaker embedding向量大概是什么样的吗?

Backend Qt5Agg is interactive backend. Turning interactive mode on.

直接运行没有问题,但是debug demo_toolbox.py时 报错:
Traceback (most recent call last):
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev\pydevd.py", line 1438, in exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\work\python\ide\pycharm\PyCharm 2020.1.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/instance/tts/Realtime-Voice-Clone-Chinese-main/demo_toolbox.py", line 43, in
Toolbox(**vars(args))
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox_init
.py", line 75, in init
self.ui = UI()
File "E:\instance\tts\Realtime-Voice-Clone-Chinese-main\toolbox\ui.py", line 450, in init
self.projections_layout.addWidget(FigureCanvas(fig))
TypeError: addWidget(self, QWidget, stretch: int = 0, alignment: Union[Qt.Alignment, Qt.AlignmentFlag] = Qt.Alignment()): argument 1 has unexpected type 'FigureCanvasQTAgg'
Backend Qt5Agg is interactive backend. Turning interactive mode on.

声音样本

大佬想问下若声音样本是歌曲的话,能不能克隆出其声音主人的声音出来?

请问可以提供预训练的编码器/声码器吗?

python synthesizer_preprocess_embeds.py <path-to-datasets_root>/SV2TTS/synthesizer

Output:

Arguments:
    synthesizer_root:      <path-to-datasets_root>/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0% 0/25308 [00:00<?, ?utterances/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 242, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "<path-to-Realtime-Voice-Clone-Chinese>/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))    
  File "<path-to-Realtime-Voice-Clone-Chinese>/synthesizer/preprocess.py", line 268, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
Embedding:   0% 0/25308 [00:01<?, ?utterances/s]

这个项目需要自己训练吗?

Pretrained-models下了放根目录不行 拷贝了到相对应的文件目录才能启动工具箱
只能load数据集的语音
无法使用解析功能
说明不详细不会用啊
出一个详细的步骤文档吧

在生成录音时闪退

Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
Aborted (core dumped)

kiwisolver是个什么东西。。。。

Traceback (most recent call last):
File "D:\code\Realtime-Voice-Clone-Chinese\demo_toolbox.py", line 2, in
from toolbox import Toolbox
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox_init_.py", line 1, in
from toolbox.ui import UI
File "D:\code\Realtime-Voice-Clone-Chinese\toolbox\ui.py", line 1, in
import matplotlib.pyplot as plt
File "D:\software\install place\python3\lib\site-packages\matplotlib_init_.py", line 157, in
check_versions()
File "D:\software\install place\python3\lib\site-packages\matplotlib_init
.py", line 151, in check_versions
module = importlib.import_module(modname)
File "D:\software\install place\python3\lib\importlib_init
.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'kiwisolver'

Pre Trained Model

Hi, I am from outside China

is it possible to have the pre-trained model download from google drive?

sounddevice报错问题

在win10默认情况下系统编码格式为gbk,在运行demo_toolbox.py时会报错:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

打开D:\Env\anaconda3\Lib\site-packages\sounddevice.py移动到573行,有相关报错的issue,更改为mbcs后错误变成:

UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1:xxxxxxxxxxxxxxxxxxx

运行python -m sounddevice会报相同的错误
image
按照上述步骤更改系统编码格式后重启,再次运行python -m sounddevice就没报错了

C:\Users\LM>python -m sounddevice
   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 mic (USBAudio2.0), MME (2 in, 0 out)
   2 麦克风阵列 (Realtek High Definition , MME (2 in, 0 out)
   3 立体声混音 (Realtek High Definition , MME (2 in, 0 out)
   4 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  5 ear (15- Meizu HiFi DAC Headpho, MME (0 in, 2 out)
   6 Speaker (Realtek High Definitio, MME (0 in, 2 out)
   7 DELL U2414H (NVIDIA High Defini, MME (0 in, 2 out)
   8 主声音捕获驱动程序, Windows DirectSound (2 in, 0 out)
   9 mic (USBAudio2.0), Windows DirectSound (2 in, 0 out)
  10 麦克风阵列 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  11 立体声混音 (Realtek High Definition Audio), Windows DirectSound (2 in, 0 out)
  12 主声音驱动程序, Windows DirectSound (0 in, 2 out)
  13 Speaker (Realtek High Definition Audio), Windows DirectSound (0 in, 2 out)
  14 DELL U2414H (NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
  15 DSD 转码器 (DoP/Native), ASIO (0 in, 2 out)
  16 ear (15- Meizu HiFi DAC Headphone Amplifier), Windows WASAPI (0 in, 2 out)
  17 Speaker (Realtek High Definition Audio), Windows WASAPI (0 in, 2 out)
  18 DELL U2414H (NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
  19 麦克风阵列 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  20 立体声混音 (Realtek High Definition Audio), Windows WASAPI (2 in, 0 out)
  21 mic (USBAudio2.0), Windows WASAPI (2 in, 0 out)
  22 Output (), Windows WDM-KS (0 in, 2 out)
  23 耳机 (), Windows WDM-KS (0 in, 2 out)
  24 Headphones (Meizu HiFi DAC Headphone Amplifier), Windows WDM-KS (0 in, 2 out)
  25 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  26 立体声混音 (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  27 麦克风阵列 (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  28 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (0 in, 1 out)
  29 耳机 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free AG Audio%0
;(LM’s AirPods Pro)), Windows WDM-KS (1 in, 0 out)
  30 麦克风 (USBAudio2.0), Windows WDM-KS (2 in, 0 out)

再次运行demo_toolbox.py就能正常打开

在运行demo_cli.py时出错

我同时下载了原模型和你的模型,但是在运行demo_cli.py时出现以下错误:

RuntimeError: Error(s) in loading state_dict for Tacotron:
size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([70, 512]).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.