svc-develop-team / so-vits-svc Goto Github PK

SoftVC VITS Singing Voice Conversion

License: GNU Affero General Public License v3.0

Python 97.22% Jupyter Notebook 2.78%

ai audio-analysis generative-adversarial-network singing-voice-conversion so-vits-svc sovits variational-inference vc vits voice

so-vits-svc's Introduction

SoftVC VITS Singing Voice Conversion

English | 中文简体

This round of limited time update is coming to an end, the warehouse will enter the Archieve state, please know

✨ A studio that contains visible f0 editor, speaker mix timeline editor and other features (Where the Onnx models are used) : MoeVoiceStudio

✨ A fork with a greatly improved user interface: 34j/so-vits-svc-fork

✨ A client supports real-time conversion: w-okada/voice-changer

This project differs fundamentally from VITS, as it focuses on Singing Voice Conversion (SVC) rather than Text-to-Speech (TTS). In this project, TTS functionality is not supported, and VITS is incapable of performing SVC tasks. It's important to note that the models used in these two projects are not interchangeable or universally applicable.

Announcement

The purpose of this project was to enable developers to have their beloved anime characters perform singing tasks. The developers' intention was to focus solely on fictional characters and avoid any involvement of real individuals, anything related to real individuals deviates from the developer's original intention.

Disclaimer

This project is an open-source, offline endeavor, and all members of SvcDevelopTeam, as well as other developers and maintainers involved (hereinafter referred to as contributors), have no control over the project. The contributors have never provided any form of assistance to any organization or individual, including but not limited to dataset extraction, dataset processing, computing support, training support, inference, and so on. The contributors do not and cannot be aware of the purposes for which users utilize the project. Therefore, any AI models and synthesized audio produced through the training of this project are unrelated to the contributors. Any issues or consequences arising from their use are the sole responsibility of the user.

This project is run completely offline and does not collect any user information or gather user input data. Therefore, contributors to this project are not aware of all user input and models and therefore are not responsible for any user input.

This project serves as a framework only and does not possess speech synthesis functionality by itself. All functionalities require users to train the models independently. Furthermore, this project does not come bundled with any models, and any secondary distributed projects are independent of the contributors of this project.

📏 Terms of Use

Warning: Please ensure that you address any authorization issues related to the dataset on your own. You bear full responsibility for any problems arising from the usage of non-authorized datasets for training, as well as any resulting consequences. The repository and its maintainer, svc develop team, disclaim any association with or liability for the consequences.

This project is exclusively established for academic purposes, aiming to facilitate communication and learning. It is not intended for deployment in production environments.
Any sovits-based video posted to a video platform must clearly specify in the introduction the input source vocals and audio used for the voice changer conversion, e.g., if you use someone else's video/audio and convert it by separating the vocals as the input source, you must give a clear link to the original video or music; if you use your own vocals or a voice synthesized by another voice synthesis engine as the input source, you must also state this in your introduction.
You are solely responsible for any infringement issues caused by the input source and all consequences. When using other commercial vocal synthesis software as an input source, please ensure that you comply with the regulations of that software, noting that the regulations of many vocal synthesis engines explicitly state that they cannot be used to convert input sources!
Engaging in illegal activities, as well as religious and political activities, is strictly prohibited when using this project. The project developers vehemently oppose the aforementioned activities. If you disagree with this provision, the usage of the project is prohibited.
If you continue to use the program, you will be deemed to have agreed to the terms and conditions set forth in README and README has discouraged you and is not responsible for any subsequent problems.
If you intend to employ this project for any other purposes, kindly contact and inform the maintainers of this repository in advance.

📝 Model Introduction

The singing voice conversion model uses SoftVC content encoder to extract speech features from the source audio. These feature vectors are directly fed into VITS without the need for conversion to a text-based intermediate representation. As a result, the pitch and intonations of the original audio are preserved. Meanwhile, the vocoder was replaced with NSF HiFiGAN to solve the problem of sound interruption.

🆕 4.1-Stable Version Update Content

Feature input is changed to the 12th Layer of Content Vec Transformer output, And compatible with 4.0 branches.
Update the shallow diffusion, you can use the shallow diffusion model to improve the sound quality.
Added Whisper-PPG encoder support
Added static/dynamic sound fusion
Added loudness embedding
Added Functionality of feature retrieval from RVC

🆕 Questions about compatibility with the 4.0 model

To support the 4.0 model and incorporate the speech encoder, you can make modifications to the config.json file. Add the speech_encoder field to the "model" section as shown below:

  "model": {
    .........
    "ssl_dim": 256,
    "n_speakers": 200,
    "speech_encoder":"vec256l9"
  }

🆕 Shallow diffusion

💬 Python Version

Based on our testing, we have determined that the project runs stable on Python 3.8.9.

📥 Pre-trained Model Files

Required

You need to select one encoder from the list below

1. If using contentvec as speech encoder(recommended)

vec768l12 and vec256l9 require the encoder

ContentVec: checkpoint_best_legacy_500.pt
- Place it under the pretrain directory

Or download the following ContentVec, which is only 199MB in size but has the same effect:

ContentVec: hubert_base.pt
- Change the file name to checkpoint_best_legacy_500.pt and place it in the pretrain directory

# contentvec
wget -P pretrain/ https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt -O checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory

2. If hubertsoft is used as the speech encoder

soft vc hubert: hubert-soft-0d54a1f4.pt
- Place it under the pretrain directory

3. If whisper-ppg as the encoder

download model at medium.pt, the model fits whisper-ppg
or download model at large-v2.pt, the model fits whisper-ppg-large
- Place it under the pretrain directory

4. If cnhubertlarge as the encoder

download model at chinese-hubert-large-fairseq-ckpt.pt
- Place it under the pretrain directory

5. If dphubert as the encoder

download model at DPHuBERT-sp0.75.pth
- Place it under the pretrain directory

6. If WavLM is used as the encoder

download model at WavLM-Base+.pt, the model fits wavlmbase+
- Place it under the pretrain directory

7. If OnnxHubert/ContentVec as the encoder

download model at MoeSS-SUBModel
- Place it under the pretrain directory

List of Encoders

"vec768l12"
"vec256l9"
"vec256l9-onnx"
"vec256l12-onnx"
"vec768l9-onnx"
"vec768l12-onnx"
"hubertsoft-onnx"
"hubertsoft"
"whisper-ppg"
"cnhubertlarge"
"dphubert"
"whisper-ppg-large"
"wavlmbase+"

Optional(Strongly recommend)

Pre-trained model files: G_0.pth D_0.pth
- Place them under the logs/44k directory
Diffusion model pretraining base model file: model_0.pt
- Put it in the logs/44k/diffusion directory

Get Sovits Pre-trained model from svc-develop-team(TBD) or anywhere else.

Diffusion model references Diffusion-SVC diffusion model. The pre-trained diffusion model is universal with the DDSP-SVC's. You can go to Diffusion-SVC's repo to get the pre-trained diffusion model.

While the pretrained model typically does not pose copyright concerns, it is essential to remain vigilant. It is advisable to consult with the author beforehand or carefully review the description to ascertain the permissible usage of the model. This helps ensure compliance with any specified guidelines or restrictions regarding its utilization.

Optional(Select as Required)

NSF-HIFIGAN

If you are using the NSF-HIFIGAN enhancer or shallow diffusion, you will need to download the pre-trained NSF-HIFIGAN model.

Pre-trained NSF-HIFIGAN Vocoder: nsf_hifigan_20221211.zip
- Unzip and place the four files under the pretrain/nsf_hifigan directory

# nsf_hifigan
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL: https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1

RMVPE

If you are using the rmvpe F0 Predictor, you will need to download the pre-trained RMVPE model.

download model at rmvpe.zip, this weight is recommended.
- unzip rmvpe.zip，and rename the model.pt file to rmvpe.pt and place it under the pretrain directory.

~~download model at rmvpe.pt~~
- ~~Place it under the pretrain directory~~

FCPE(Preview version)

FCPE(Fast Context-base Pitch Estimator) is a dedicated F0 predictor designed for real-time voice conversion and will become the preferred F0 predictor for sovits real-time voice conversion in the future.(The paper is being written)

If you are using the fcpe F0 Predictor, you will need to download the pre-trained FCPE model.

download model at fcpe.pt
- Place it under the pretrain directory

📊 Dataset Preparation

Simply place the dataset in the dataset_raw directory with the following file structure:

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

There are no specific restrictions on the format of the name for each audio file (naming conventions such as 000001.wav to 999999.wav are also valid), but the file type must be `WAV``.

You can customize the speaker's name as showed below:

dataset_raw
└───suijiSUI
    ├───1.wav
    ├───...
    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav

🛠️ Preprocessing

0. Slice audio

To avoid video memory overflow during training or pre-processing, it is recommended to limit the length of audio clips. Cutting the audio to a length of "5s - 15s" is more recommended. Slightly longer times are acceptable, however, excessively long clips may cause problems such as torch.cuda.OutOfMemoryError.

To facilitate the slicing process, you can use audio-slicer-GUI or audio-slicer-CLI

In general, only the Minimum Interval needs to be adjusted. For spoken audio, the default value usually suffices, while for singing audio, it can be adjusted to around 100 or even 50, depending on the specific requirements.

After slicing, it is recommended to remove any audio clips that are excessively long or too short.

If you are using whisper-ppg encoder for training, the audio clips must shorter than 30s.

1. Resample to 44100Hz and mono

python resample.py

Cautions

Although this project has resample.py scripts for resampling, mono and loudness matching, the default loudness matching is to match to 0db. This can cause damage to the sound quality. While python's loudness matching package pyloudnorm does not limit the level, this can lead to sonic boom. Therefore, it is recommended to consider using professional sound processing software, such as adobe audition for loudness matching. If you are already using other software for loudness matching, add the parameter -skip_loudnorm to the run command:

python resample.py --skip_loudnorm

2. Automatically split the dataset into training and validation sets, and generate configuration files.

python preprocess_flist_config.py --speech_encoder vec768l12

speech_encoder has the following options

vec768l12
vec256l9
hubertsoft
whisper-ppg
cnhubertlarge
dphubert
whisper-ppg-large
wavlmbase+

If the speech_encoder argument is omitted, the default value is vec768l12

Use loudness embedding

Add --vol_aug if you want to enable loudness embedding:

python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug

After enabling loudness embedding, the trained model will match the loudness of the input source; otherwise, it will match the loudness of the training set.

You can modify some parameters in the generated config.json and diffusion.yaml

keep_ckpts: Keep the the the number of previous models during training. Set to 0 to keep them all. Default is 3.
all_in_mem: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.
batch_size: The amount of data loaded to the GPU for a single training session can be adjusted to a size lower than the GPU memory capacity.
vocoder_name: Select a vocoder. The default is nsf-hifigan.

diffusion.yaml

cache_all_data: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.
duration: The duration of the audio slicing during training, can be adjusted according to the size of the video memory, Note: this value must be less than the minimum time of the audio in the training set!
batch_size: The amount of data loaded to the GPU for a single training session can be adjusted to a size lower than the video memory capacity.
timesteps: The total number of steps in the diffusion model, which defaults to 1000.
k_step_max: Training can only train k_step_max step diffusion to save training time, note that the value must be less than timesteps, 0 is to train the entire diffusion model, Note: if you do not train the entire diffusion model will not be able to use only_diffusion!

List of Vocoders

nsf-hifigan
nsf-snake-hifigan

3. Generate hubert and f0

python preprocess_hubert_f0.py --f0_predictor dio

f0_predictor has the following options

crepe
dio
pm
harvest
rmvpe
fcpe

If the training set is too noisy,it is recommended to use crepe to handle f0

If the f0_predictor parameter is omitted, the default value is rmvpe

If you want shallow diffusion (optional), you need to add the --use_diff parameter, for example:

python preprocess_hubert_f0.py --f0_predictor dio --use_diff

Speed Up preprocess

If your dataset is pretty large,you can increase the param --num_processes like that:

python preprocess_hubert_f0.py --f0_predictor dio --num_processes 8

All the worker will be assigned to different GPU if you have more than one GPUs.

After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

🏋️‍ Training

Sovits Model

python train.py -c configs/config.json -m 44k

Diffusion Model (optional)

If the shallow diffusion function is needed, the diffusion model needs to be trained. The diffusion model training method is as follows:

python train_diff.py -c configs/diffusion.yaml

During training, the model files will be saved to logs/44k, and the diffusion model will be saved to logs/44k/diffusion

🤖 Inference

Use inference_main.py

# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"

Required parameters:

-m | --model_path: path to the model.
-c | --config_path: path to the configuration file.
-n | --clean_names: a list of wav file names located in the raw folder.
-t | --trans: pitch shift, supports positive and negative (semitone) values.
-s | --spk_list: Select the speaker ID to use for conversion.
-cl | --clip: Forced audio clipping, set to 0 to disable(default), setting it to a non-zero value (duration in seconds) to enable.

Optional parameters: see the next section

-lg | --linear_gradient: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use the default value of 0.
-f0p | --f0_predictor: Select a F0 predictor, options are crepe, pm, dio, harvest, rmvpe,fcpe, default value is pm(note: f0 mean pooling will be enable when using crepe)
-a | --auto_predict_f0: automatic pitch prediction, do not enable this when converting singing voices as it can cause serious pitch issues.
-cm | --cluster_model_path: Cluster model or feature retrieval index path, if left blank, it will be automatically set as the default path of these models. If there is no training cluster or feature retrieval, fill in at will.
-cr | --cluster_infer_ratio: The proportion of clustering scheme or feature retrieval ranges from 0 to 1. If there is no training clustering model or feature retrieval, the default is 0.
-eh | --enhance: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is disabled by default.
-shd | --shallow_diffusion: Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is disabled by default. When this option is enabled, NSF_HIFIGAN enhancer will be disabled
-usm | --use_spk_mix: whether to use dynamic voice fusion
-lea | --loudness_envelope_adjustment：The adjustment of the input source's loudness envelope in relation to the fusion ratio of the output loudness envelope. The closer to 1, the more the output loudness envelope is used
-fr | --feature_retrieval：Whether to use feature retrieval If clustering model is used, it will be disabled, and cm and cr parameters will become the index path and mixing ratio of feature retrieval

Shallow diffusion settings:

-dm | --diffusion_model_path: Diffusion model path
-dc | --diffusion_config_path: Diffusion config file path
-ks | --k_step: The larger the number of k_steps, the closer it is to the result of the diffusion model. The default is 100
-od | --only_diffusion: Whether to use Only diffusion mode, which does not load the sovits model to only use diffusion model inference
-se | --second_encoding：which involves applying an additional encoding to the original audio before shallow diffusion. This option can yield varying results - sometimes positive and sometimes negative.

Cautions

If inferencing using whisper-ppg speech encoder, you need to set --clip to 25 and -lg to 1. Otherwise it will fail to infer properly.

🤔 Optional Settings

If you are satisfied with the previous results, or if you do not feel you understand what follows, you can skip it and it will have no effect on the use of the model. The impact of these optional settings mentioned is relatively small, and while they may have some impact on specific datasets, in most cases the difference may not be significant.

Automatic f0 prediction

During the training of the 4.0 model, an f0 predictor is also trained, which enables automatic pitch prediction during voice conversion. However, if the results are not satisfactory, manual pitch prediction can be used instead. Please note that when converting singing voices, it is advised not to enable this feature as it may cause significant pitch shifting.

Set auto_predict_f0 to true in inference_main.py.

Cluster-based timbre leakage control

Introduction: The clustering scheme implemented in this model aims to reduce timbre leakage and enhance the similarity of the trained model to the target's timbre, although the effect may not be very pronounced. However, relying solely on clustering can reduce the model's clarity and make it sound less distinct. Therefore, a fusion method is adopted in this model to control the balance between the clustering and non-clustering approaches. This allows manual adjustment of the trade-off between "sounding like the target's timbre" and "have clear enunciation" to find an optimal balance.

No changes are required in the existing steps. Simply train an additional clustering model, which incurs relatively low training costs.

Training process:
- Train on a machine with good CPU performance. According to extant experience, it takes about 4 minutes to train each speaker on a Tencent Cloud machine with 6-core CPU.
- Execute python cluster/train_cluster.py. The output model will be saved in logs/44k/kmeans_10000.pt.
- The clustering model can currently be trained using the gpu by executing python cluster/train_cluster.py --gpu
Inference process:
- Specify cluster_model_path in inference_main.py. If not specified, the default is logs/44k/kmeans_10000.pt.
- Specify cluster_infer_ratio in inference_main.py, where 0 means not using clustering at all, 1 means only using clustering, and usually 0.5 is sufficient.

Feature retrieval

Introduction: As with the clustering scheme, the timbre leakage can be reduced, the enunciation is slightly better than clustering, but it will reduce the inference speed. By employing the fusion method, it becomes possible to linearly control the balance between feature retrieval and non-feature retrieval, allowing for fine-tuning of the desired proportion.

Training process: First, it needs to be executed after generating hubert and f0:

python train_index.py -c configs/config.json

The output of the model will be in logs/44k/feature_and_index.pkl

Inference process:
- The --feature_retrieval needs to be formulated first, and the clustering mode automatically switches to the feature retrieval mode.
- Specify cluster_model_path in inference_main.py. If not specified, the default is logs/44k/feature_and_index.pkl.
- Specify cluster_infer_ratio in inference_main.py, where 0 means not using feature retrieval at all, 1 means only using feature retrieval, and usually 0.5 is sufficient.

🗜️ Model compression

The generated model contains data that is needed for further training. If you confirm that the model is final and not be used in further training, it is safe to remove these data to get smaller file size (about 1/3).

# Example
python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth"

👨‍🔧 Timbre mixing

Static Tone Mixing

Refer to webUI.py file for stable Timbre mixing of the gadget/lab feature.

Introduction: This function can combine multiple models into one model (convex combination or linear combination of multiple model parameters) to create mixed voice that do not exist in reality

Note:

This feature is only supported for single-speaker models
If you force a multi-speaker model, it is critical to make sure there are the same number of speakers in each model. This will ensure that sounds with the same SpeakerID can be mixed correctly.
Ensure that the model fields in config.json of all models to be mixed are the same
The mixed model can use any config.json file from the models being synthesized. However, the clustering model will not be functional after mixed.
When batch uploading models, it is best to put the models into a folder and upload them together after selecting them
It is suggested to adjust the mixing ratio between 0 and 100, or to other numbers, but unknown effects will occur in the linear combination mode
After mixing, the file named output.pth will be saved in the root directory of the project
Convex combination mode will perform Softmax to add the mix ratio to 1, while linear combination mode will not

Dynamic timbre mixing

Refer to the spkmix.py file for an introduction to dynamic timbre mixing

Character mix track writing rules:

Role ID: [[Start time 1, end time 1, start value 1, start value 1], [Start time 2, end time 2, start value 2]]

The start time must be the same as the end time of the previous one. The first start time must be 0, and the last end time must be 1 (time ranges from 0 to 1).

All roles must be filled in. For unused roles, fill [[0., 1., 0., 0.]]

The fusion value can be filled in arbitrarily, and the linear change from the start value to the end value within the specified period of time. The

internal linear combination will be automatically guaranteed to be 1 (convex combination condition), so it can be used safely

Use the --use_spk_mix parameter when reasoning to enable dynamic timbre mixing

📤 Exporting to Onnx

Use onnx_export.py

Create a folder named checkpoints and open it
Create a folder in the checkpoints folder as your project folder, naming it after your project, for example aziplayer
Rename your model as model.pth, the configuration file as config.json, and place them in the aziplayer folder you just created
Modify "NyaruTaffy" in path = "NyaruTaffy" in onnx_export.py to your project name, path = "aziplayer"（onnx_export_speaker_mix makes you can mix speaker's voice）
Run onnx_export.py
Wait for it to finish running. A model.onnx will be generated in your project folder, which is the exported model.

Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)

📎 Reference

URL	Designation	Title	Implementation Source
2106.06103	VITS (Synthesizer)	Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech	jaywalnut310/vits
2111.02392	SoftVC (Speech Encoder)	A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion	bshall/hubert
2204.09224	ContentVec (Speech Encoder)	ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers	auspicious3000/contentvec
2212.04356	Whisper (Speech Encoder)	Robust Speech Recognition via Large-Scale Weak Supervision	openai/whisper
2110.13900	WavLM (Speech Encoder)	WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	microsoft/unilm/wavlm
2305.17651	DPHubert (Speech Encoder)	DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models	pyf98/DPHuBERT
DOI:10.21437/Interspeech.2017-68	Harvest (F0 Predictor)	Harvest: A high-performance fundamental frequency estimator from speech signals	mmorise/World/harvest
aes35-000039	Dio (F0 Predictor)	Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech	mmorise/World/dio
8461329	Crepe (F0 Predictor)	Crepe: A Convolutional Representation for Pitch Estimation	maxrmorrison/torchcrepe
DOI:10.1016/j.wocn.2018.07.001	Parselmouth (F0 Predictor)	Introducing Parselmouth: A Python interface to Praat	YannickJadoul/Parselmouth
2306.15412v2	RMVPE (F0 Predictor)	RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music	Dream-High/RMVPE
2010.05646	HIFIGAN (Vocoder)	HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis	jik876/hifi-gan
1810.11946	NSF (Vocoder)	Neural source-filter-based waveform model for statistical parametric speech synthesis	openvpi/DiffSinger/modules/nsf_hifigan
2006.08195	Snake (Vocoder)	Neural Networks Fail to Learn Periodic Functions and How to Fix It	EdwardDixon/snake
2105.02446v3	Shallow Diffusion (PostProcessing)	DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism	CNChTu/Diffusion-SVC
K-means	Feature K-means Clustering (PreProcessing)	Some methods for classification and analysis of multivariate observations	This repo
	Feature TopK Retrieval (PreProcessing)	Retrieval based Voice Conversion	RVC-Project/Retrieval-based-Voice-Conversion-WebUI
	whisper ppg	whisper ppg	PlayVoice/whisper_ppg
	bigvgan	bigvgan	PlayVoice/so-vits-svc-5.0

☀️ Previous contributors

For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.

Some members have not listed according to their personal wishes.

_MistEO

_XiaoMiku01

_しぐれ

_{TomoGaSukunai}

_Plachtaa

_zd小达

_凍聲響世

📚 Some legal provisions for reference

Any country, region, organization, or individual using this project must comply with the following laws.

《民法典》

第一千零一十九条

任何组织或者个人不得以丑化、污损，或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意，不得制作、使用、公开肖像权人的肖像，但是法律另有规定的除外。未经肖像权人同意，肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护，参照适用肖像权保护的有关规定。

第一千零二十四条

【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。

第一千零二十七条

【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。

《中华人民共和国宪法》

《中华人民共和国刑法》

《中华人民共和国民法典》

《中华人民共和国合同法》

💪 Thanks to all contributors for their efforts

so-vits-svc's People

Contributors

Stargazers

Watchers

Forkers

lemon-miaow bruce2233 great1001 kazusasetsunahaluki linhaow jiatern miuna-chan maxmax2016 ustr0ate upright2003 39moemiku kslz yan233th yagamihikari dairor easy-forks wujohns kokoro-ele kamikuz aki894 solopro-git arensc flytoyourmoon deepin-leng leonshay bear-zd liquidbhoper uborz ayusummer supermikimiki macroustc coderexamples huhang-koko5 gongyong728125 falling-light sumght-z clewm beyondchenlin hanhaimitiu yujiecong simohara ysedab yikadaluosi friendmine dd-rongfa zhoujiamurong zbuibe phisten cybermanhao redsparkie mon030 wmings2020 firejohnny sibeit criticalpulsar blackmady skymxf vince-c98 skadiiou faithererer awakingswings a-sunhw kakatasky louisjune schezuk ukaserge suzoosuagr 694689604 hmnsn chinoll refrain-am kirafanyj stardust-sjf zhangruijay suburbanpuma usamireko irumeria reimurin ikaros-521 abinea icepro jasonpro22 pofengzhiyi unnedius ahacad traveler0014 bulesky29 handsomebsn putaopi1996 psf000 xavier-wrath huhai463127310 hcx-2008 sid-jacob soramik syaofox dlseed brightgu 3kanalpha mihane-ichinose

so-vits-svc's Issues

关于webui界面源音频参数的问题

训练了一个模型大改2400步后我试了一下发现可以用，在转换过程中我发现有些音频转成wav 44100 16bit后可以正常转换，但是有个wav源文件明明自己是16bit，转换就报错ValueError: Audio data cannot be converted to 16-bit int format.哪怕我在au里切一小段下来另存为44100 16bit也是这个报错。wav的参数只有这些，还有什么因素会影响这个源文件无法被转换？

最后训练时无法继续，哪位大佬帮指点下，google没有找到解决办法

系统平台: Windows

出现问题的环节: 训练

Python版本: Python 3.7.0

PyTorch版本: 1.13.1

所用分支: 4.0 e701955 Unlock the version of numpy

问题描述:
前面所有的命令都成功了，没有报错，最后这一步出现这个问题
在执行训练命令（python train.py -c configs/config.json -m 44k）提示如下错误
assert torch.cuda.is_available(), "CPU training is not allowed."
AssertionError: CPU training is not allowed.

日志截图:
(venv) (base) PS E:\voice\so-vits-svc> python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 专业版
GCC version: (Rev10, Built by MSYS2 project) 12.2.0
Clang version: Could not collect
CMake version: version 3.26.0-rc5
Libc version: N/A

Python version: 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: False
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070 SUPER
Nvidia driver version: 517.40
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\cudnn_ops_train64_8.dll
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.20.0
[pip3] torch==1.13.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchaudio==0.13.1
[conda] blas 1.0 mkl https://repo.anaconda.com/pkgs/main
[conda] mkl 2021.4.0 haa95532_640 https://repo.anaconda.com/pkgs/main
[conda] mkl-service 2.4.0 py39h2bbff1b_0 https://repo.anaconda.com/pkgs/main
[conda] mkl_fft 1.3.1 py39h277e83a_0 https://repo.anaconda.com/pkgs/main
[conda] mkl_random 1.2.2 py39hf11a4ad_0 https://repo.anaconda.com/pkgs/main
[conda] numpy 1.21.5 py39h7a0a035_3 https://repo.anaconda.com/pkgs/main
[conda] numpy-base 1.21.5 py39hca35cd5_3 https://repo.anaconda.com/pkgs/main
[conda] numpydoc 1.4.0 py39haa95532_0 https://repo.anaconda.com/pkgs/main

怎么配置开启GPU的cuda继续训练呢？

ValueError: numpy.ndarray has the wrong size, try recompiling. Expected 88, got 96

Great work! This error occurred when extracting Hubert and f0 features. Is there any misusing part about hubert/checkpoint_best_legacy_500.pt? That was downloaded from http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
BTW, I was under the 4.0 version with python3.8. Thanks for your information

_Traceback (most recent call last):
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self.kwargs)
File "/data/sdb/mike/repo/so-vits-svc/preprocess_hubert_f0.py", line 44, in process_batch
process_one(filename, hmodel)
File "/data/sdb/mike/repo/so-vits-svc/preprocess_hubert_f0.py", line 34, in process_one
f0 = utils.compute_f0_dio(wav, sampling_rate=sampling_rate, hop_length=hop_length)
File "/data/sdb/mike/repo/so-vits-svc/utils.py", line 156, in compute_f0_dio
import pyworld
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/pyworld/init.py", line 7, in
from .pyworld import *
File "init.pxd", line 199, in init pyworld.pyworld
ValueError: numpy.ndarray has the wrong size, try recompiling. Expected 88, got 96

关于多角色、声音素材、训练次数的疑问

有以下疑问请大佬们解答：
已整理了3个角色干声，每个角色大概1万多条语音，每条在2-13秒以内。
batch_size: 4，learning_rate: 0.0001，3060Laptop(6G),内存24G。
1、多角色一起训练好？还是单独每个角色训练比较好？
2、有必要每个角色都放1万多条语音数据进行训练吗？是数据多好？还是训练次数多好？
3、Epoch: per cost 236.25s，如果一起训练3个角色，给予足够时间，3万多条语音都会跑训一遍？
4、如果前期只使用小量干声数据，后期可以增加干声数据继续训练吗，如何操作稳妥？
5、如果一开始就进行单角色训练，后期就不能增加角色了吗，只能单独再训练一个模型？

感谢大佬！

AttributeError: 'HParams' object has no attribute 'dataset_type'

系统平台: windows

出现问题的环节: 推理

Python版本: 3.8

PyTorch版本: 1.13.1+cu116

所用分支: 4.0-v2

所用数据集:

授权证明截图:

问题描述: 推理报错，AttributeError: 'HParams' object has no attribute 'dataset_type'，4.0分支可以正常推

日志截图:

use_cuda, True
INFO:44k:{'log_interval': 200, 'eval_interval': 800, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 6, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 10240, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 512, 'port': '8001', 'keep_ckpts': 10}
INFO:44k:{'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': 22050}
INFO:44k:{'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 256, 'n_speakers': 200}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Traceback (most recent call last):
File "train.py", line 435, in
main()
File "train.py", line 57, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "D:\so-vits\so-vits-svc1\python38\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "D:\so-vits\so-vits-svc1\python38\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join():
File "D:\so-vits\so-vits-svc1\python38\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "D:\so-vits\so-vits-svc1\python38\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap
fn(i, *args)
File "D:\so-vits\so-vits-svc1\train.py", line 76, in run
dataset_constructor = DatasetConstructor(hps, num_replicas=n_gpus, rank=rank)
File "D:\so-vits\so-vits-svc1\data_utils.py", line 293, in init
self._get_components()
File "D:\so-vits\so-vits-svc1\data_utils.py", line 296, in _get_components
self._init_datasets()
File "D:\so-vits\so-vits-svc1\data_utils.py", line 301, in _init_datasets
self._train_dataset = self.dataset_function[self.hparams.data.dataset_type](self.hparams,
AttributeError: 'HParams' object has no attribute 'dataset_type'

先前训练的模型可以删除吗

现在训练到G9000，云盘空间不够了，可以把之前的D0 ~ D8000和G0 ~ G8000删除吗？
会不会对训练断点产生影响？

预训练底模

请问作者大大们有预定什么时候发布预训练底模吗？
我在网上（hugging face等）下载了一些底模效果都不是特别好（生成日语歌）
或者您们知道哪里可以找到支持v4的比较好的底模吗？

为什么我开始训练就把CPU跑到100%，显卡却一点占用都没有

训练一段时间便报错中断

每次训练到一定程度便报错中断
使用的是https://www.bilibili.com/video/BV1Eb411f7gX的这个一键包

在执行训练命令后出现如下错误

RuntimeError: output with shape [1, 256] doesn't match the broadcast shape [200, 256]

在Featurize平台上租卡跑的时候（linux系统），在训练时报错RuntimeError: output with shape [1, 256] doesn't match the broadcast shape [200, 256]，试了一下午了，怎么办捏

Not saving checkpoints

I am running the Colab 4.0 notebook and everything works very well but when I ran the actual training step I notice that it's not saving any of the checkpoints. It does so at the beginning but then it just doesn't at all. I've run it for over 80 epochs now with absolutely no updates.

Huggingface `sovits_pretrained` repository not found

It seems like the huggingface sovits_pretrained repository has been deleted.
Can you provide another links for the base model files G_0.pth and D_0.pht?

训练速度过慢

2023-03-19 17:26:01,202	44k	INFO	{'train': {'log_interval': 200, 'eval_interval': 800, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 6, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 10240, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 512, 'port': '8001', 'keep_ckpts': 10}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': 22050}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 256, 'n_speakers': 200}, 'spk': {'miria': 0}, 'model_dir': './logs\\44k'}
2023-03-19 17:26:22,608	44k	INFO	Train Epoch: 1 [0%]
2023-03-19 17:26:22,608	44k	INFO	Losses: [5.9978346824646, 5.234327793121338, 0.53354412317276, 109.33016204833984, 392.0679626464844], step: 0, lr: 0.0001
2023-03-19 17:26:27,268	44k	INFO	Saving model and optimizer state at iteration 1 to ./logs\44k\G_0.pth
2023-03-19 17:26:28,176	44k	INFO	Saving model and optimizer state at iteration 1 to ./logs\44k\D_0.pth
2023-03-19 17:28:11,282	44k	INFO	====> Epoch: 1, cost 130.08 s
2023-03-19 17:29:18,080	44k	INFO	Train Epoch: 2 [61%]
2023-03-19 17:29:18,080	44k	INFO	Losses: [2.2427873611450195, 2.418828010559082, 4.7479071617126465, 51.7297477722168, 3.8419177532196045], step: 200, lr: 9.99875e-05
2023-03-19 17:29:54,746	44k	INFO	====> Epoch: 2, cost 103.46 s
2023-03-19 17:31:38,868	44k	INFO	====> Epoch: 3, cost 104.12 s
2023-03-19 17:32:10,387	44k	INFO	Train Epoch: 4 [23%]
2023-03-19 17:32:10,387	44k	INFO	Losses: [1.7529761791229248, 3.1881754398345947, 5.160671234130859, 43.579063415527344, 2.4448511600494385], step: 400, lr: 9.996250468730469e-05
2023-03-19 17:33:23,387	44k	INFO	====> Epoch: 4, cost 104.52 s
2023-03-19 17:34:53,327	44k	INFO	Train Epoch: 5 [84%]
2023-03-19 17:34:53,327	44k	INFO	Losses: [2.7421751022338867, 1.9685697555541992, 2.512032985687256, 37.84978103637695, 1.9366132020950317], step: 600, lr: 9.995000937421877e-05
2023-03-19 17:35:08,414	44k	INFO	====> Epoch: 5, cost 105.03 s
2023-03-19 17:36:53,399	44k	INFO	====> Epoch: 6, cost 104.99 s

抱歉在Video Encode这项里没有找到CUDA的占用选项，config.json里只修改了保存ckpt的数量

训练集有747个。现在就是训练的很慢，看大家用GPU的话都是几秒一步，想知道如何提高训练速度。

请问onnx导出时出现的错误是什么原因导致的？

运行后出现如下字段，CPU也确实短暂工作了一下，但最后生成的onnx模型只有114mb（pth有517MB），放到MoeSS里也无法加载。
想请教一下各位大佬如何解决？非常感谢！
另外我的模型是在云端跑出来的，下载到本地后推理没问题但转onnx出现了问题，觉得是环境不一致的原因又去云端转了下onnx，也失败了。

load
2023-03-20 22:40:07 | INFO | root | Loaded checkpoint 'checkpoints/tomovoice/model.pth' (iteration 204)
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:2020: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input c
warnings.warn(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:2020: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input f0
warnings.warn(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:2020: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input mel2ph
warnings.warn(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:2020: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input uv
warnings.warn(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:2020: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input noise
warnings.warn(
E:\AI-barbara.v4.0\utils.py:178: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (f0_coarse.max(), f0_coarse.min())
E:\AI-barbara.v4.0\modules\attentions.py:203: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert t_s == t_t, "Relative attention is only available for self-attention."
E:\AI-barbara.v4.0\modules\attentions.py:248: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
pad_length = max(length - (self.window_size + 1), 0)
E:\AI-barbara.v4.0\modules\attentions.py:249: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
slice_start_position = max((self.window_size + 1) - length, 0)
E:\AI-barbara.v4.0\modules\attentions.py:251: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if pad_length > 0:
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx_internal\jit_utils.py:258: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx_internal\jit_utils.py:258: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\shape_type_inference.cpp:1888.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:687: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\shape_type_inference.cpp:1888.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:687: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:1178: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\shape_type_inference.cpp:1888.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\TAKATSUKI\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\onnx\utils.py:1178: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(

Many speaker voice conversion task

If i use your model for a voice conversion task with around 100 speakers, would its performance be better than that of freevc?
And can i get the checkpoint repository link?

cuda out of memory

File "D:\so-vits-svc\inference_main.py", line 51, in
out_audio, out_sr = svc_model.infer(spk, tran, raw_path)
File "D:\so-vits-svc\inference\infer_tool.py", line 224, in infer
audio = self.net_g_ms.infer(x_tst, f0=f0, g=sid)[0,0].data.float()
File "D:\so-vits-svc\models.py", line 346, in infer
z_p, m_p, logs_p, c_mask = self.enc_p_(c, c_lengths, f0=f0_to_coarse(f0))
File "C:\Users\userAppData\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "D:\so-vits-svc\models.py", line 119, in forward
x = self.enc(x * x_mask, x_mask)
File "C:\Users\userAppData\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "D:\so-vits-svc\attentions.py", line 39, in forward
y = self.attn_layers[i](x, x, attn_mask)
File "C:\Users\userAppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "D:\so-vits-svc\attentions.py", line 143, in forward
x, self.attn = self.attention(q, k, v, mask=attn_mask)
File "D:\so-vits-svc\attentions.py", line 160, in attention
scores_local = self._relative_position_to_absolute_position(rel_logits)
File "D:\so-vits-svc\attentions.py", line 221, in _relative_position_to_absolute_position
x = F.pad(x, commons.convert_pad_shape([[0,0],[0,0],[0,0],[0,1]]))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.39 GiB (GPU 0; 6.00 GiB total capacity; 3.23 GiB already allocated; 53.94 MiB free; 3.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

/hubert/checkpoint_best_legacy_500.pt box.com直连方法分享

https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr

將 url 中的 /s/ 替換為 /shared/static

所以改成 https://ibm.ent.box.com/shared/static/z1wgl1stco8ffooyatzdwsqn2psd9lrr

!curl -L https://ibm.ent.box.com/shared/static/z1wgl1stco8ffooyatzdwsqn2psd9lrr --output hubert/checkpoint_best_legacy_500.pt

!wget -O hubert/checkpoint_best_legacy_500.pt https://ibm.ent.box.com/shared/static/z1wgl1stco8ffooyatzdwsqn2psd9lrr

教程来源：https://stackoverflow.com/questions/46239248/how-to-download-a-file-from-box-using-wget

validation loss

There are many types of losses. Considering the fact a run can ezly generate thousands of check points, a validation loss would be particularly useful when trying to determine a set time for the training termination. May I ask if it's already there(if so which one?)

thanks

train开始之后一堆 xxx is not in the checkpoint的错

本地和colab都是一样的错，环境都是没问题的。训练了10000次但是推出来的声音都是只有噪音。
请问是g_0和d_0的问题嘛但是看log都是loaded了
请大佬帮忙看看谢谢！
INFO:44k:{'train': {'log_interval': 200, 'eval_interval': 800, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 6, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 10240, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 512, 'port': '8001', 'keep_ckpts': 3}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': 22050}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 256, 'n_speakers': 200}, 'spk': {'owen': 0}, 'model_dir': './logs/44k'}
2023-03-13 13:35:44.410924: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2023-03-13 13:35:45.367774: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-13 13:35:45.367898: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-13 13:35:45.367920: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
DEBUG:h5py.conv:Creating converter from 7 to 5
DEBUG:h5py.conv:Creating converter from 5 to 7
DEBUG:h5py.conv:Creating converter from 7 to 5
DEBUG:h5py.conv:Creating converter from 5 to 7
DEBUG:jaxlib.mlir.mlir_libs:Initializing MLIR with module: site_initialize_0
DEBUG:jaxlib.mlir.mlir_libs:Registering dialects from initializer <module 'jaxlib.mlir.mlir_libs.site_initialize_0' from '/usr/local/lib/python3.9/dist-packages/jaxlib/mlir/mlir_libs/site_initialize_0.so'>
DEBUG:jax.src.path:etils.epath found. Using etils.epath for file I/O.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/44k/G_0.pth
error, emb_g.weight is not in the checkpoint
INFO:44k:emb_g.weight is not in the checkpoint
error, pre.weight is not in the checkpoint
INFO:44k:pre.weight is not in the checkpoint
error, pre.bias is not in the checkpoint
INFO:44k:pre.bias is not in the checkpoint
error, enc_p.proj.weight is not in the checkpoint
INFO:44k:enc_p.proj.weight is not in the checkpoint
error, enc_p.proj.bias is not in the checkpoint
INFO:44k:enc_p.proj.bias is not in the checkpoint
error, enc_p.f0_emb.weight is not in the checkpoint
INFO:44k:enc_p.f0_emb.weight is not in the checkpoint
error, enc_p.enc.attn_layers.0.emb_rel_k is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.emb_rel_k is not in the checkpoint
error, enc_p.enc.attn_layers.0.emb_rel_v is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.emb_rel_v is not in the checkpoint
error, enc_p.enc.attn_layers.0.conv_q.weight is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.conv_q.weight is not in the checkpoint
error, enc_p.enc.attn_layers.0.conv_q.bias is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.conv_q.bias is not in the checkpoint
error, enc_p.enc.attn_layers.0.conv_k.weight is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.conv_k.weight is not in the checkpoint
error, enc_p.enc.attn_layers.0.conv_k.bias is not in the checkpoint
INFO:44k:enc_p.enc.attn_layers.0.conv_k.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.0.conv_v.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.0.conv_v.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.0.conv_v.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.0.conv_v.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.0.conv_o.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.0.conv_o.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.0.conv_o.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.0.conv_o.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.1.emb_rel_k is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.emb_rel_k is not in the checkpoint
error, enc_p.enc_.attn_layers.1.emb_rel_v is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.emb_rel_v is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_q.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_q.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_q.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_q.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_k.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_k.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_k.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_k.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_v.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_v.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_v.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_v.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_o.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_o.weight is not in the checkpoint
error, enc_p.enc_.attn_layers.1.conv_o.bias is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.1.conv_o.bias is not in the checkpoint
error, enc_p.enc_.attn_layers.2.emb_rel_k is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.2.emb_rel_k is not in the checkpoint
error, enc_p.enc_.attn_layers.2.emb_rel_v is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.2.emb_rel_v is not in the checkpoint
error, enc_p.enc_.attn_layers.2.conv_q.weight is not in the checkpoint
INFO:44k:enc_p.enc_.attn_layers.2.conv_q.weight is not in the checkpoint

高音失真问题

我的模型经过一段时间使用对话语音进行训练以后在正常音域已经达到了很好的效果，但高音失真严重。我现在也有角色唱的歌的音频，但在用ultimate vocal remover的demucs模型和karaoke-uvr模型去除伴奏和和声后音频质量堪忧，完全比不上语音的训练集，尝试训练以后出现了严重的电流声和各种问题。
请问大大们有任何建议吗？

在colab进行重采样时无响应

系统平台: google colab

出现问题的环节: 预处理/重采样至44100hz

Python版本: colab笔记本内安装依赖时的默认版本

PyTorch版本: 1.13.1+cu116

所用分支: 4.0

问题描述: 在在线解压完数据集到dataset_raw之后，执行下一个单元格Resample to 44100hz时没有任何反应，没有报错日志并且dataset文件夹没有被生成。在手动创建dataset文件夹后也没有变化。

日志截图:

realtime voice conversion ui

I have developed realitme voice conversion ui of so-vits-svc 4.0 and 4.0v2. If you could add the link, I would be grateful.

Repository
https://github.com/w-okada/voice-changer
English documentation
https://github.com/w-okada/voice-changer/blob/master/README_en.md
Demo (sorry. japanese)
https://twitter.com/DannadoriYellow/status/1637656459650695168?s=20
https://www.youtube.com/watch?v=yvPWtq7isfI

Thanks.

训练过程中有大量不必要的DEBUG信息输出

您好！

我在pull了最新的仓库后在模型训练过程中发现了大量的DEBUG信息输出，请问是否有近期的更新打开了DEBUG开关？建议在train.py中加入args控制verbose输出，以避免大量不必要的信息污染日志。

非常感谢！

Colab inference not working at all

Whenever I try to use inference at the last inference step, it says this:

Traceback (most recent call last): File "/content/so-vits-svc/inference_main.py", line 101, in main() File "/content/so-vits-svc/inference_main.py", line 47, in main svc_model = Svc(args.model_path, args.config_path, args.device, args.cluster_model_path) File "/content/so-vits-svc/inference/infer_tool.py", line 127, in init self.load_model() File "/content/so-vits-svc/inference/infer_tool.py", line 138, in load_model self.hps_ms.data.filter_length // 2 + 1,AttributeError: 'HParams' object has no attribute 'filter_length'

Any fix?

could you compare so-vits-svc and diff-svc?

all models have their own pros and cons. could you briefly compare so-vits-svc and diff-svc?

请问4.0建议python的什么版本

弄了一天了，重开几百次，找不到这个文件。。

EOFError: Ran out of input

Hi, the error occurred when extracting the spectrogram. The error is caused by loading the empty *spec.pt file, which was interrupting the training process.

Env: ubuntu/python3.8/branch 4.0
Stage: Training. The Hubert and f0 features were extracted before the training stage.

Traceback (most recent call last):
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/sdb/mike/repo/so-vits-svc/train.py", line 122, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
File "/data/sdb/mike/repo/so-vits-svc/train.py", line 141, in train_and_evaluate
for batch_idx, items in enumerate(train_loader):
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1313, in _next_data
return self._process_data(data)
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
EOFError: Caught EOFError in DataLoader worker process 3.
Original Traceback (most recent call last):
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/sdb/mike/repo/so-vits-svc/data_utils.py", line 90, in getitem
return self.get_audio(self.audiopaths[index][0])
File "/data/sdb/mike/repo/so-vits-svc/data_utils.py", line 53, in get_audio
spec = torch.load(spec_filename)
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/mike/anaconda3/envs/sovits/lib/python3.8/site-packages/torch/serialization.py", line 1002, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Readme Edit Request: Python Version

请问一下colab的链接

之前innnky的colab链接好像不能用了，训练会报错
是把项目转移到这里了吗？vits3.0的模型还能继续使用吗？想问问vits3.0和4.0的colab链接

Questions regarding significant reorganization of the project

https://github.com/34j/so-vits-svc-fork

I forked so-vits-svc 4.0 v1, did some major refactoring and added some (external) features:

realtime voice convertion
unified CLI
GUI for inference
download pretrained models automatically
use pre-commit to format code
upload to PyPI using CI

I am considering sending a PR based on the repository above. However, if there is something wrong with the svc-develop-team, such as having trouble using git, GitHub CI, pre-commit, or having problems removing Chinese in the code, or just being a pain, I will stop. what do you think about such a refactoring?

If you want to reject this, I would appreciate it if you could refer to my project instead. Thank you.

content提取方式的比较？

您好，这个项目做的很好。
我看你用了第九层的hubert，出于什么考虑呢？如何权衡内容信息丢失、音色泄漏的问题，您有对比过其他层或者whisper这种方式吗？

动态化链路库DLL失败

没来的及截图报错代码
不小心按了下就消失了
怎么查看报错的log

4.0-v2 inference_main.py 音高需要设定下

看到音高预设 -5
麻烦修改一下,谢谢！

epch 到10000停止了，但是推理时有很严重的噪音

到10000后训练就停止，推理时还会有严重的噪音，但可以隐约听到说话声音了

是否需要把配置文件中的"epochs": 10000, 调高让他接着练呢？还是说有哪一步可能做错了呢？

我确实没有使用Pre-trained model files: G_0.pth D_0.pth, 不知道是否和这个有关呢？

A more accurate English Model Introduction

I found the English description a bit hard to understand. The English version missed the crucial part that you use VITS. The 中文简体 intro is easier to understand.

... to extract source audio speech features, and inputs them together with F0 to replace the original text input to achieve the effect of song conversion. ...

The following would be better in my humble opinion:

... to extract source audio speech features, and inputs them together with F0 into VITS instead of the original text input to achieve the effect of song conversion. ...

Thanks!

P.S. Sorry, no 中文简体 input method on this laptop, so I'm using English.

跨性别推理如何优化呢？

我所使用的训练数据集是女声，训练到了9600步。

如果推理数据集是女声的话，效果非常赞！但如果推理数据集是男声的话，效果会差一些，应该如何优化呢？我目前想到的是再多训练一些步骤，到20000左右再试试。如果效果还是不理想的话，会尝试先把男声用某些方式转化成女声，然后再推理。

请问还有什么其他建议吗？

本地安装fairseq和pyworld出现错误Could not build wheels for fairseq, pyworld, which is required to install pyproject.toml-based

请问如何解决

训练过程报错_pickle.UnpicklingError

系统平台: CentOS 7.9
出现问题的环节: 训练
Python版本: 3.8.13
PyTorch版本: Version: 1.13.1+cu116
所用分支: 4.0-v2
所用数据集: 本人
授权证明截图:

问题描述: 训练开始后报错，无法继续，报错如下，网络搜索得到的结果是与torch版本有关，但在我跟换了torch版本之后似乎也没有改变。可能是由于torch.load()引起。

日志:
在slurm环境中使用srun运行 4卡
前面都是正常的...
（部分文件夹涉及隐私用星号代替了）

INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:44k:Train Epoch: 1 [0%]
INFO:44k:Losses: [4.583632469177246, 2.16941237449646, 11.800090789794922, 124.89070129394531, 616.9237060546875], step: 0, lr: 0.0002
/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [32, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [32, 1, 4], strides() = [4, 4, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
INFO:44k:Saving model and optimizer state at iteration 1 to ./logs/44k/G_0.pth
INFO:44k:Saving model and optimizer state at iteration 1 to ./logs/44k/D_0.pth
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [32, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [32, 1, 4], strides() = [4, 4, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "train.py", line 310, in <module>
    main()
  File "train.py", line 51, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/****/so-vits-svc/train.py", line 122, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
  File "/scratch/****/so-vits-svc/train.py", line 141, in train_and_evaluate
    for batch_idx, items in enumerate(train_loader):
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
_pickle.UnpicklingError: Caught UnpicklingError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch/****/so-vits-svc/data_utils.py", line 88, in __getitem__
    return self.get_audio(self.audiopaths[index][0])
  File "/scratch/****/so-vits-svc/data_utils.py", line 51, in get_audio
    spec = torch.load(spec_filename)
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/serialization.py", line 795, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/****/project/conda_envs/sov/lib/python3.8/site-packages/torch/serialization.py", line 1002, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

之后就exit code 1了
同样的数据集在windows平台也尝试过，没有报错。

M1 mac 安装依赖失败

由于numpy==1.20.3还未支持m1 的mac导致报错ERROR: Failed building wheel for numpy，
去numpy仓库下看到在1.21.4版本后就可以解决，同时还可以解决x86_64及pipenv的一些安装问题
numpy/numpy#17784 (comment)

请问锁1.20是因为哪个包的依赖，可否麻烦升级项目numpy版本呢

KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

I'm trying to finetune 4.0-v2 using this checkpoint I found https://huggingface.co/cr941131/sovits-4.0-v2-hubert/tree/main
(not sure if its good or not)
But when I try to start training this error happens:

Traceback (most recent call last):
  File "/home/manjaro/.conda/envs/soft-vc/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/media/manjaro/NVME_2tb/NeuralNetworks/so-vits-svc-v2-44100/train.py", line 112, in run
    scheduler_g = torch.optim.lr_scheduler.ExponentialLR(optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
  File "/home/manjaro/.conda/envs/soft-vc/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 583, in __init__
    super(ExponentialLR, self).__init__(optimizer, last_epoch, verbose)
  File "/home/manjaro/.conda/envs/soft-vc/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 42, in __init__
    raise KeyError("param 'initial_lr' is not specified "
KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

Where can I find official checkpoints if that one is bad?

在Google colab上运行，到安装依赖这一步出现错误

系统平台: Google colab

出现问题的环节: 安装依赖

Python版本: Python 3.9.16

PyTorch版本: 1.13.1+cu116

所用分支: 4.0

问题描述: 安装依赖这一步出现了错误

error: subprocess-exited-with-error

× Building wheel for fairseq (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for fairseq (pyproject.toml) ... error
ERROR: Failed building wheel for fairseq
Building wheel for antlr4-python3-runtime (setup.py) ... done
Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.8-py3-none-any.whl size=141231 sha256=6430468728cf59e967aae7f0619d2ccc82bc6e6626ac52048d1b2fee50c31878
Stored in directory: /root/.cache/pip/wheels/42/3c/ae/14db087e6018de74810afe32eb6ac890ef9c68ba19b00db97a
Successfully built pyworld antlr4-python3-runtime
Failed to build fairseq
ERROR: Could not build wheels for fairseq, which is required to install pyproject.toml-based projects

日志截图:

4.0 版本训练发生错误

4.0 版本训练发生错误
INFO:44k:{'train': {'log_interval': 200, 'eval_interval': 800, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 6, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 10240, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 512, 'port': '8001', 'keep_ckpts': 3}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': 22050}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 256, 'n_speakers': 200}, 'spk': {'miaopeng': 0}, 'model_dir': './logs\44k'}
WARNING:44k:D:\Desktop\so-vits-svc-4.0 is not a git repository, therefore hash value comparison will be ignored.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs\44k\G_0.pth
error, emb_g.weight is not in the checkpoint
INFO:44k:emb_g.weight is not in the checkpoint
load
INFO:44k:Loaded checkpoint './logs\44k\G_0.pth' (iteration 1)
./logs\44k\D_0.pth
load
INFO:44k:Loaded checkpoint './logs\44k\D_0.pth' (iteration 1)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\autograd_init_.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [32, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [32, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:339.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "D:\Desktop\so-vits-svc-4.0\train.py", line 310, in
main()
File "D:\Desktop\so-vits-svc-4.0\train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes
while not context.join():
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in wrap
fn(i, *args)
File "D:\Desktop\so-vits-svc-4.0\train.py", line 119, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
File "D:\Desktop\so-vits-svc-4.0\train.py", line 202, in train_and_evaluate
scaler.step(optim_g)
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 313, in step
return optimizer.step(*args, **kwargs)
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\optim\optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\optim\adamw.py", line 162, in step
adamw(params_with_grad,
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\optim\adamw.py", line 219, in adamw
func(params,
File "C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\optim\adamw.py", line 273, in single_tensor_adamw
exp_avg.mul(beta1).add(grad, alpha=1 - beta1)
RuntimeError: output with shape [1, 256] doesn't match the broadcast shape [200, 256]

这是我包的信息
Python 3.9.16
Name: torch
Version: 1.13.1+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: c:\programdata\miniconda3\envs\so4.0\lib\site-packages
Requires: typing-extensions
Required-by: fairseq, torchaudio, torchvision

chatgpt 这样回答
根据您提供的信息，训练过程出现了以下错误信息：

vbnet
Copy code
C:\ProgramData\miniconda3\envs\so4.0\lib\site-packages\torch\autograd_init_.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [32, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [32, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:339.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
...
ProcessRaisedException:

-- Process 0 terminated with the following error:
Trace...
这个错误通常出现在使用 PyTorch 的分布式数据并行训练（DDP）时。它表明参数的梯度和梯度桶的形状不匹配，可能会影响训练的性能。

解决这个问题的方法可能因情况而异，但以下是几个可能有用的步骤：

升级 PyTorch 版本：如果您正在使用的是旧版本的 PyTorch，请尝试升级到最新版本以查看问题是否已经得到解决。

调整批次大小：尝试使用不同的批次大小来查看问题是否会消失。

更改分布式数据并行训练的配置：如果您在分布式数据并行训练期间使用了某些特殊配置，请尝试更改这些配置并查看问题是否已经得到解决。

检查梯度的形状：使用 print 语句或调试器检查梯度的形状是否与您的预期相同，以查看问题是否出现在梯度计算的过程中。

求個底模

3.0 4.0 4.0v2 已經dead-link

求大师解答colab

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.9.0 requires jedi>=0.10, which is not installed.
cvxpy 1.2.3 requires setuptools<=64.0.2, but you have setuptools 67.6.0 which is incompatible.

在Clone repository and install requirements时出现这段讯息，要理会它吗?
用的是4.0版本的colab

如何正确提issues (How to properly raise issues)

如何正确提issues

提问前建议先自己去尝试解决，可以借助一些搜索引擎（谷歌/必应等等）。如果实在无法自己解决再发issues，在提issues之前，请先仔细阅读《提问的智慧》；
提问时候必须提供如下信息，以便于定位问题所在：系统平台，出现问题的环节，python环境版本，torch版本，所用分支，所用数据集，授权证明截图，问题描述，完整的日志截图；
提问时候态度要友好。

什么样的issues会被close

伸手党；
一键包/环境包相关；
提供的信息不全；
所用的数据集是无授权数据集(游戏角色/二次元人物暂不归为此类，但是训练时候也要小心谨慎。如果能联系到官方，必须先和官方联系并核实清楚)。

参考格式(可以直接复制)

系统平台: 在此处填写你所用的平台，例如：Windows

出现问题的环节: 安装依赖/推理/训练/预处理/其它

Python版本: 在此填写你所用的Python版本，可用 python -V 查询

PyTorch版本: 在此填写你所用的PyTorch版本，可用 pip show torch 查询

所用分支: 在此填写你所用的代码分支

所用数据集: 在此填写你训练所用数据集的来源，如果只是推理，可留空

授权证明截图:
在此添加授权证明截图，如果是数据集是自己的声音或数据集为游戏角色/二次元人物或没有训练需求，此处可留空

问题描述: 在这里描述自己的问题，越详细越好

日志截图:
在此添加完整的日志截图，便于定位问题所在

svc-develop-team / so-vits-svc Goto Github PK

so-vits-svc's Introduction

SoftVC VITS Singing Voice Conversion

Announcement

Disclaimer

📏 Terms of Use

📝 Model Introduction

🆕 4.1-Stable Version Update Content

🆕 Questions about compatibility with the 4.0 model

🆕 Shallow diffusion

💬 Python Version

📥 Pre-trained Model Files

Required

1. If using contentvec as speech encoder(recommended)

2. If hubertsoft is used as the speech encoder

3. If whisper-ppg as the encoder

4. If cnhubertlarge as the encoder

5. If dphubert as the encoder

6. If WavLM is used as the encoder

7. If OnnxHubert/ContentVec as the encoder

List of Encoders

Optional(Strongly recommend)

Optional(Select as Required)

NSF-HIFIGAN

RMVPE

FCPE(Preview version)

📊 Dataset Preparation

🛠️ Preprocessing

0. Slice audio

1. Resample to 44100Hz and mono

Cautions

2. Automatically split the dataset into training and validation sets, and generate configuration files.

You can modify some parameters in the generated config.json and diffusion.yaml

diffusion.yaml

List of Vocoders

3. Generate hubert and f0

🏋️‍ Training

Sovits Model

Diffusion Model (optional)

🤖 Inference

Cautions

🤔 Optional Settings

Automatic f0 prediction

Cluster-based timbre leakage control

Feature retrieval

🗜️ Model compression

👨‍🔧 Timbre mixing

Static Tone Mixing

Dynamic timbre mixing

📤 Exporting to Onnx

📎 Reference

☀️ Previous contributors

📚 Some legal provisions for reference

Any country, region, organization, or individual using this project must comply with the following laws.

《民法典》

第一千零一十九条

第一千零二十四条

第一千零二十七条

《中华人民共和国宪法》

《中华人民共和国刑法》

《中华人民共和国民法典》

《中华人民共和国合同法》

💪 Thanks to all contributors for their efforts

so-vits-svc's People

Contributors

Stargazers

Watchers

Forkers

so-vits-svc's Issues

如何正确提issues

什么样的issues会被close

参考格式(可以直接复制)

Recommend Projects

Recommend Topics

Recommend Org