While executing the stage2 training I am getting cuda out of memory error continuously

Getting CUDA Out of memory error in Stage2 training about styletts2 HOT 13 OPEN

SandyPanda-MLDL commented on September 8, 2024

Getting CUDA Out of memory error in Stage2 training

from styletts2.

Comments (13)

Karesto commented on September 8, 2024

That just means you don't have enough memory in your GPU to run this.
Try reducing batch_size and max_len in config.

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

That just means you don't have enough memory in your GPU to run this. Try reducing batch_size and max_len in config.

But my batch size is already 2 and batch percentage is 0.5 .
I am sharing my config file here:

log_dir: "/hdd2/Sandipan/SDhar-Projects/StyleTTS2/Models/New_Hindi_Speech_2nd"
first_stage_path: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Log_files/epoch_1st_00037.pth"
save_freq: 2
#save_freq: 2
log_interval: 10
device: "cuda"
#epochs_1st: 50
epochs_1st: 200 # number of epochs for first stage training (pre-training)
#epochs_2nd: 30
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 2
max_len: 100
#max_len: 100 # maximum number of frames
pretrained_model: ""
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
#/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Utils/PLBERT_all_languages
PLBERT_dir: 'Utils/PLBERT_all_languages/'

#"/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/val_list.txt"

data_params:
train_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/train.txt"
val_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/valid.txt"
root_path: "/hdd2/Sandipan/database/Hindi_ASR_200/Hindi_Clean/"
OOD_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/odd.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts

data_params:

train_data: "Data/train_list_new.txt"

val_data: "Data/valid_list_new.txt"

root_path: "/hdd5/Sandipan/SDhar-Projects/Grad-TTS-Libri/Speech-Backbones/Grad-TTS/LJSpeech-1.1/wavs"

OOD_data: "Data/OOD_texts.txt"

min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300

model_params:
multispeaker: true #true #false

dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80

n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size

dropout: 0.2

######### config for decoder

decoder:

type: 'istftnet' # either hifigan or istftnet

resblock_kernel_sizes: [3,7,11]

upsample_rates : [10, 6]

upsample_initial_channel: 512

resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]

upsample_kernel_sizes: [20, 12]

gen_istft_n_fft: 20

gen_istft_hop_size: 5

##############################
decoder:
type: 'hifigan' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10,5,3,2]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20,10,6,4]

speech language model config

slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head

style diffusion model config

diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2

# diffusion distribution config
dist:
  sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
  estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
  mean: -3.0
  std: 1.0

loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss

lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 50 # TMA starting epoch (1st stage)

lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)

diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)

optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules

slmadv_params:
min_len: 100
#min_len: 400 # minimum length of samples
#max_len: 500 # maximum length of samples
max_len: 200
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
#batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 10 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling

from styletts2.

Karesto commented on September 8, 2024

i assume this happens right at the beginning.
It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?

Actually I am running my code in our Lab server, there are a 8 GPUs out of which 4-5 GPUs are already in used for other's code execution. I am running my code in specific GPU id (7), which is not used by anyone else as of now.

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?

output of nvidia-smi command for GPU id 7 which I am using :

7 NVIDIA L40S Off | 00000000:24:00.0 Off | 0 |
| N/A 36C P8 23W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |

from styletts2.

Karesto commented on September 8, 2024

It seems that there is some issue somewhere but i can't really put my finger on it.
GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ?
What command are you using to run the code ?
There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?

As I make changes to the specific lines of code where the issue was raised previously, next time the same issue appeared in different lines of codes. As for example:

File "train_second.py", line 827, in
main()
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "train_second.py", line 417, in main
s = model.style_encoder(st.unsqueeze(1) if multispeaker else gt.unsqueeze(1))
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/models.py", line 167, in forward
h = self.shared(x)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/models.py", line 143, in forward
x = self._shortcut(x) + self._residual(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 7; 79.15 GiB total capacity; 2.36 GiB already allocated; 9.19 MiB free; 2.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?

I am using simply this command
python train_second.py

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?

This is how I am setting the device id, and then using " to(device) " in the required parts of the code.

device_id=7
device = torch.device((device_id) if torch.cuda.is_available() else "cpu")

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?

in my code I have already replaced all "to(cuda)" with "to(device)"

from styletts2.

Karesto commented on September 8, 2024

This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay.
Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay. Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?

No, let me do then

from styletts2.

SandyPanda-MLDL commented on September 8, 2024

This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay. Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?

Thank you. Actually it seems the problem was in my end only with the GPU I am specifying. I used CUDA_VISIBLE_DEVICES command and set different GPU ids whenever I have found an idle GPU in our server. But, CUDA_VISIBLE_DEVICES=GPU id was executing my code to other GPUs instead of running my code into the specific GPU id I was specifying. That's why I set the GPU id using device_id=7
device = torch.device((device_id) if torch.cuda.is_available() else "cpu") command. However, still I was getting CUDA out of memory error.

But this time when I executed my code " CUDA_VISIBLE_DEVICES=5 python train_second.py", my code started running. I understood, I must have to do this kind of hit and trial.

Thanks for your suggestion.

from styletts2.

Comments (13)

data_params:

train_data: "Data/train_list_new.txt"

val_data: "Data/valid_list_new.txt"

root_path: "/hdd5/Sandipan/SDhar-Projects/Grad-TTS-Libri/Speech-Backbones/Grad-TTS/LJSpeech-1.1/wavs"

OOD_data: "Data/OOD_texts.txt"

min_length: 50 # sample until texts with this size are obtained for OOD texts

decoder:

type: 'istftnet' # either hifigan or istftnet

resblock_kernel_sizes: [3,7,11]

upsample_rates : [10, 6]

upsample_initial_channel: 512

resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]

upsample_kernel_sizes: [20, 12]

gen_istft_n_fft: 20

gen_istft_hop_size: 5

speech language model config

style diffusion model config

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org