Hi, I'm absolutely new here and coming in from Corridor Crew and AItrepreneur and can'

Training Issue about dreambooth-stable-diffusion HOT 5 CLOSED

joepenna commented on July 18, 2024

Training Issue

from dreambooth-stable-diffusion.

Comments (5)

dadiwonton commented on July 18, 2024

My first error was that my training got killed immediately after it starts training. So I tried for some time to train again until this error popped up.

from dreambooth-stable-diffusion.

jooshkins commented on July 18, 2024

Having similar issues
This might just be a coincidence, but I got the RuntimeError: No CUDA GPUs are available error, when running on a RTX 3090
Switched to a RTX A5000 and that error went away.
But I am now having the issue where like dadiwonton described where it is killed right after starting to train:

Epoch 0:   0%|                                         | 0/2020 [00:00<?, ?it/s]/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:72: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:229: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
  warning_cache.warn(
Epoch 0:   0%| | 1/2020 [00:02<1:40:08,  2.98s/it, loss=0.0382, v_num=0, train/lHere comes the checkpoint...
Killed

from dreambooth-stable-diffusion.

InvixGG commented on July 18, 2024

Having similar issues This might just be a coincidence, but I got the RuntimeError: No CUDA GPUs are available error, when running on a RTX 3090 Switched to a RTX A5000 and that error went away. But I am now having the issue where like dadiwonton described where it is killed right after starting to train:

Epoch 0:   0%|                                         | 0/2020 [00:00<?, ?it/s]/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:72: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:229: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
  warning_cache.warn(
Epoch 0:   0%| | 1/2020 [00:02<1:40:08,  2.98s/it, loss=0.0382, v_num=0, train/lHere comes the checkpoint...
Killed

Saw a YT Comment about your issue:

If you get the error about it being killed after 1 step, open the terminal, type "ps aux" and look for the pid for both python relauncher and webui, then type "kill (the id for either)" and kill both of them. Was stuck on that error for a while with an A5000 but this fixed my problem.

I'm having the same issue on an A5000 as @dadiwonton where it doesn't even start an iteration. Same error

from dreambooth-stable-diffusion.

jooshkins commented on July 18, 2024

Ahh thanks mate! Killing those processes seemed to clear it up, and it is training now.
Will see it finishes 🤞

from dreambooth-stable-diffusion.

djbielejeski commented on July 18, 2024

Lots of good help on discord.

from dreambooth-stable-diffusion.

Training Issue about dreambooth-stable-diffusion HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent