Comments (25)
Yes, every time I try generating an image with the Distributed checkbox ticked, it will happen, regardless if it's the first image generated on the instance or not.
from stable-diffusion-webui-distributed.
What seems most likely is that optimize_jobs
is being run multiple times before a singular request which should never happen. This could be due to an import conflict but I'm not sure yet.
from stable-diffusion-webui-distributed.
Could you add this debug statement logger.debug(f"added job for worker {worker.label}")
after this line? After, reproduce the issue and post the logs like you did before and it may show that is where the issue is occurring.
from stable-diffusion-webui-distributed.
The problem in this case is that your slave's ipm was ending up at around 120 for some reason (3 ipm like before sounds about right). The best thing to do would be to rebenchmark or manually adjust that ipm in the config. Then, the distribution logic should split your requests about evenly and this should be far less of an issue.
from stable-diffusion-webui-distributed.
After another benchmark and some restarting I was able to get it to work. Thank you for your time and help!
from stable-diffusion-webui-distributed.
Could you add --distributed-debug
to your sdwui launch arguments and continue to run sdwui to see if this happens again? If it does, this will give me more verbose logs to work with. Whenever this happens, console output from the slave instance will be useful as well. Has sdwui ever gotten stuck near 100% for you before using this extension?
Also, yes, you're using it just fine as long as the compute device for each instance is different.
from stable-diffusion-webui-distributed.
Thanks for the reply. I ran both instances with --distributed-debug
and this is the output when it hangs at that status:
main:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.28it/s]DISTRIBUTED | DEBUG Took master 18.84s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
slave node:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00, 1.09it/s]DISTRIBUTED | DEBUG Took master 21.30s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
this is the output of the log on the web-ui:
DEBUG - waiting for worker thread 'slave_request'
DEBUG - Took master 18.96s
DEBUG - had to substitute sampler index with name
DEBUG - worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a speed of 118.31 ipm
DEBUG - worker 'slave' loaded weights in 0.01s
DEBUG - Worker 'slave' 3.54/3.94 GB VRAM free
DEBUG - 'slave' job's given starting seed is 1141091032 with 1 coming before it
INFO - Job distribution:
1 * 1 iteration(s) + 40 complementary: 41 images total
'master' - 1 image(s) @ 3.47 ipm
'slave' - 40 image(s) @ 118.31 ipm
I've not encountered this issue before using the extension, and it goes away if I disable it when generating.
from stable-diffusion-webui-distributed.
How long was your main instance running until you encountered this? Also, did you happen to use the interrupt/skip button on a request in the web-ui at any point before this happened?
from stable-diffusion-webui-distributed.
I had just restarted both instances, so newly started. I didn't interrupt the generation. I tried with no prior generation to the distributed one, and with one on each instance beforehand; same result. D:
from stable-diffusion-webui-distributed.
When you say restarted you mean you fully stopped sdwui and restarted both? (didn't use the restart button built into the web interface?)
from stable-diffusion-webui-distributed.
Correct, stopped and started again.
from stable-diffusion-webui-distributed.
So this is happening everytime first try for you after rebooting sdwui?
from stable-diffusion-webui-distributed.
Could you also post your distributed-config.json
file from the extension folder
from stable-diffusion-webui-distributed.
I tried generating one image.
Master:
DISTRIBUTED | DEBUG config loaded world.py:643DISTRIBUTED | DEBUG added job for worker master world.py:401DISTRIBUTED | DEBUG added job for worker slave world.py:401DISTRIBUTED | DEBUG World initialized! distributed.py:237DISTRIBUTED | DEBUG The requested number of images(1) was not cleanly divisible by the number of world.py:483
realtime nodes(2) resulting in 1 that will be redistributed
DISTRIBUTED | DEBUG There's 19.01s of slack time available for worker 'slave' world.py:526DISTRIBUTED | DEBUG worker 'slave': world.py:531
40 complementary image(s) = 19.01s slack/ 0.47s per requested image
DISTRIBUTED | INFO Job distribution: world.py:555 1 * 1 iteration(s) + 40 complementary: 41 images total
'master' - 1 image(s) @ 3.47 ipm
'slave' - 40 image(s) @ 118.31 ipm
DISTRIBUTED | DEBUG 'slave' job's given starting seed is 143432277 with 1 coming before it distributed.py:344DISTRIBUTED | DEBUG Worker 'slave' 3.54/3.94 GB VRAM free worker.py:318
DISTRIBUTED | DEBUG worker 'slave' loaded weights in 0.01s worker.py:675DISTRIBUTED | DEBUG worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a worker.py:335 speed of 118.31 ipm
DISTRIBUTED | DEBUG local script(s): [Hypertile], [Comments] seem to be unsupported by worker worker.py:391 'slave'
DISTRIBUTED | DEBUG had to substitute sampler index with name worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.28it/s]DISTRIBUTED | DEBUG Took master 18.89s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
Slave:
DISTRIBUTED | DEBUG config loaded world.py:643DISTRIBUTED | DEBUG added job for worker master world.py:401DISTRIBUTED | DEBUG added job for worker slave world.py:401DISTRIBUTED | DEBUG World initialized! distributed.py:237DISTRIBUTED | DEBUG The requested number of images(1) was not cleanly divisible by the number of world.py:483
realtime nodes(2) resulting in 1 that will be redistributed
DISTRIBUTED | DEBUG There's 19.01s of slack time available for worker 'slave' world.py:526DISTRIBUTED | DEBUG worker 'slave': world.py:531
40 complementary image(s) = 19.01s slack/ 0.47s per requested image
DISTRIBUTED | INFO Job distribution: world.py:555 1 * 1 iteration(s) + 40 complementary: 41 images total
'master' - 1 image(s) @ 3.47 ipm
'slave' - 40 image(s) @ 118.31 ipm
DISTRIBUTED | DEBUG 'slave' job's given starting seed is 143432277 with 1 coming before it distributed.py:344DISTRIBUTED | DEBUG Worker 'slave' 3.54/3.94 GB VRAM free worker.py:318
DISTRIBUTED | DEBUG worker 'slave' loaded weights in 0.01s worker.py:675DISTRIBUTED | DEBUG worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a worker.py:335 speed of 118.31 ipm
DISTRIBUTED | DEBUG local script(s): [Hypertile], [Comments] seem to be unsupported by worker worker.py:391 'slave'
DISTRIBUTED | DEBUG had to substitute sampler index with name worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.28it/s]DISTRIBUTED | DEBUG Took master 18.89s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
Config:
{
"workers": [
{
"master": {
"avg_ipm": 3.468163269192995,
"master": true,
"address": "0.0.0.0",
"port": 7860,
"eta_percent_error": [],
"tls": false,
"state": 1,
"user": null,
"password": null,
"pixel_cap": -1
}
},
{
"slave": {
"avg_ipm": 118.3148549027504,
"master": false,
"address": "0.0.0.0",
"port": 7861,
"eta_percent_error": [],
"tls": false,
"state": 1,
"user": "None",
"password": "None",
"pixel_cap": -1
}
}
],
"benchmark_payload": {
"prompt": "A herd of cows grazing at the bottom of a sunny valley",
"negative_prompt": "",
"steps": 20,
"width": 512,
"height": 512,
"batch_size": 1
},
"job_timeout": 3,
"enabled": true,
"complement_production": true
}
from stable-diffusion-webui-distributed.
What happened to your speed ratings btw? Before it showed both of your instances were going at about 3 ipm but now the slave is at around 120 ipm?
from stable-diffusion-webui-distributed.
It's strange, it sometimes shows something normal like 3, but usually the slave is really high at about 120ipm. I wish it could do 120ipm haha.
from stable-diffusion-webui-distributed.
Can you let me know if this also happens consistently on 424a1c8
from stable-diffusion-webui-distributed.
Just tested, and on that commit I cannot generate anything with the extension enabled. :( It hangs like this.
master:
Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=0 --port 7860 --distributed-debug
DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | INFO doing initial ping sweep to see which workers are reachable distributed.py:51DISTRIBUTED | DEBUG checking if worker 'slave' is reachable... world.py:693DISTRIBUTED | INFO worker 'slave' is online world.py:720Distributed: worker 'slave' is online
Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | DEBUG config loaded world.py:658Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | DEBUG config loaded world.py:658Startup time: 17.8s (prepare environment: 2.9s, import torch: 7.7s, import gradio: 1.0s, setup paths: 2.0s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.1s, create ui: 1.0s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 8.4s (load weights from disk: 5.3s, create model: 0.6s, apply weights to model: 1.6s, apply half(): 0.7s, calculate empty prompt: 0.2s).
DISTRIBUTED | DEBUG config loaded world.py:658DISTRIBUTED | DEBUG recorded speed for worker 'master' is invalid world.py:214DISTRIBUTED | DEBUG recorded speed for worker 'slave' is invalid world.py:214DISTRIBUTED | DEBUG worker 'slave' loaded weights in 0.01s worker.py:686
slave:
Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=1 --port 7861
DISTRIBUTED | INFO doing initial ping sweep to see which workers are reachable distributed.py:51DISTRIBUTED | ERROR HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: worker.py:618 /sdapi/v1/memory (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection
refused'))
DISTRIBUTED | INFO worker 'slave' is unreachable world.py:725Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
Running on local URL: http://0.0.0.0:7861
To create a public link, set `share=True` in `launch()`.
Startup time: 15.3s (prepare environment: 2.6s, import torch: 5.4s, import gradio: 1.1s, setup paths: 2.1s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.0s, create ui: 1.1s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 13.2s (load weights from disk: 5.0s, create model: 0.5s, apply weights to model: 5.5s, apply half(): 2.0s, calculate empty prompt: 0.2s).
WebUI extension log:
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
INFO - worker 'slave' is online
DEBUG - checking if worker 'slave' is reachable...
INFO - doing initial ping sweep to see which workers are reachable
DEBUG - config loaded
It seems that it doesn't send the command to the slave instance?
I deleted the previous extension version, started SD, shut it down, installed the new one.
from stable-diffusion-webui-distributed.
Does this happen with no other extensions enabled (builtin ones should be fine)? Also can you post your extension list of what you've been using.
from stable-diffusion-webui-distributed.
This is the only non-builtin extension I'm using, and it works fine if I disable it.
from stable-diffusion-webui-distributed.
In that case:
- what commit of sdwui are you on
- what version is your python interpreter
- can you post your distributed.log file from the extension folder
Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.
from stable-diffusion-webui-distributed.
sdwui version: 1.9.3 (1c0a0c4)
python ver: 3.10.14
distributed.log:
2024-05-11 08:02:47,965 - ERROR - Config was not found at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,972 - INFO - Generated new config file at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,975 - ERROR - config is corrupt or invalid JSON, unable to load
2024-05-11 08:02:47,976 - DEBUG - cannot parse null config (present but empty config file?)
generating defaults for config
2024-05-11 08:02:47,979 - DEBUG - config saved
2024-05-11 08:02:47,981 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:02:48,115 - DEBUG - config loaded
2024-05-11 08:02:48,561 - DEBUG - config loaded
2024-05-11 08:02:49,560 - DEBUG - config loaded
2024-05-11 08:02:49,605 - DEBUG - config loaded
2024-05-11 08:02:49,754 - DEBUG - config loaded
2024-05-11 08:02:49,798 - DEBUG - config loaded
2024-05-11 08:02:50,067 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:03:12,589 - DEBUG - config saved
2024-05-11 08:03:24,649 - INFO - Redoing benchmarks...
2024-05-11 08:03:24,661 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:04:00,909 - DEBUG - master finished warming up
2024-05-11 08:04:18,209 - INFO - Sample 1: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute
2024-05-11 08:04:35,482 - INFO - Sample 2: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute
2024-05-11 08:04:52,793 - INFO - Sample 3: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute
2024-05-11 08:04:52,796 - DEBUG - Worker 'master' average ipm: 3.47
2024-05-11 08:04:52,798 - INFO - benchmarking worker 'master'
2024-05-11 08:04:52,800 - INFO - benchmarking worker 'slave'
2024-05-11 08:04:52,897 - DEBUG - Worker 'slave' 3.76/3.94 GB VRAM free
2024-05-11 08:06:14,143 - DEBUG - config loaded
2024-05-11 08:06:14,145 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:06:14,147 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:06:14,156 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:06:42,980 - DEBUG - handling interrupt signal
2024-05-11 08:06:42,984 - DEBUG - config saved
2024-05-11 08:06:57,760 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:57,769 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7a920dc857b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:06:57,772 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,163 - DEBUG - config loaded
2024-05-11 08:06:59,170 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:59,172 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:06:59,210 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,343 - DEBUG - config loaded
2024-05-11 08:06:59,784 - DEBUG - config loaded
2024-05-11 08:07:00,814 - DEBUG - config loaded
2024-05-11 08:07:00,862 - DEBUG - config loaded
2024-05-11 08:07:01,025 - DEBUG - config loaded
2024-05-11 08:07:01,073 - DEBUG - config loaded
2024-05-11 08:07:18,009 - DEBUG - config loaded
2024-05-11 08:07:18,011 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:07:18,013 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:07:18,023 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:07:47,572 - WARNING - config reports invalid speed (0 ipm) for worker 'master'
please re-benchmark
2024-05-11 08:07:47,575 - WARNING - config reports invalid speed (0 ipm) for worker 'slave'
please re-benchmark
2024-05-11 08:07:51,197 - INFO - Redoing benchmarks...
2024-05-11 08:07:51,207 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:08:39,355 - DEBUG - handling interrupt signal
2024-05-11 08:08:39,359 - DEBUG - config saved
2024-05-11 08:08:54,534 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:54,543 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7364e6e7d750>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:54,546 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,413 - DEBUG - config loaded
2024-05-11 08:08:55,420 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:55,422 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:08:55,426 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b078428d4b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:55,429 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,564 - DEBUG - config loaded
2024-05-11 08:08:56,008 - DEBUG - config loaded
2024-05-11 08:08:56,965 - DEBUG - config loaded
2024-05-11 08:08:57,010 - DEBUG - config loaded
2024-05-11 08:08:57,159 - DEBUG - config loaded
2024-05-11 08:08:57,204 - DEBUG - config loaded
2024-05-11 08:09:58,364 - DEBUG - handling interrupt signal
2024-05-11 08:09:58,369 - DEBUG - config saved
2024-05-11 08:10:15,019 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:15,028 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:10:15,031 - INFO - worker 'slave' is unreachable
2024-05-11 08:10:34,280 - DEBUG - config loaded
2024-05-11 08:10:34,287 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:34,289 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:10:34,414 - INFO - worker 'slave' is online
2024-05-11 08:10:34,550 - DEBUG - config loaded
2024-05-11 08:10:34,980 - DEBUG - config loaded
2024-05-11 08:10:35,970 - DEBUG - config loaded
2024-05-11 08:10:36,017 - DEBUG - config loaded
2024-05-11 08:10:36,163 - DEBUG - config loaded
2024-05-11 08:10:36,206 - DEBUG - config loaded
2024-05-11 08:11:05,151 - DEBUG - config loaded
2024-05-11 08:11:05,154 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:11:05,156 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:11:05,166 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:22:31,026 - DEBUG - handling interrupt signal
2024-05-11 08:22:31,029 - DEBUG - config saved
2024-05-11 12:04:14,157 - DEBUG - config loaded
2024-05-11 12:04:14,163 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 12:04:14,165 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 12:04:14,169 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x72c67007d270>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 12:04:14,172 - INFO - worker 'slave' is unreachable
2024-05-11 12:04:14,300 - DEBUG - config loaded
2024-05-11 12:04:14,736 - DEBUG - config loaded
2024-05-11 12:04:15,672 - DEBUG - config loaded
2024-05-11 12:04:15,714 - DEBUG - config loaded
2024-05-11 12:04:15,862 - DEBUG - config loaded
2024-05-11 12:04:15,906 - DEBUG - config loaded
2024-05-11 12:04:39,376 - DEBUG - config loaded
2024-05-11 12:04:39,378 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 12:04:39,380 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:00:21,789 - DEBUG - config loaded
2024-05-11 13:00:21,792 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:00:21,794 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:02:15,027 - DEBUG - config loaded
2024-05-11 13:02:15,029 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:02:15,031 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:54:30,599 - DEBUG - config loaded
2024-05-11 19:54:30,602 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:54:30,604 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:56:46,931 - DEBUG - config loaded
2024-05-11 19:56:46,935 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:56:46,937 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:58:26,201 - DEBUG - config loaded
2024-05-11 19:58:26,204 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:58:26,206 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:01:18,842 - DEBUG - config loaded
2024-05-11 20:01:18,845 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:01:18,847 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:03:07,938 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:03:07,946 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ef589479570>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:03:07,950 - INFO - worker 'slave' is unreachable
2024-05-11 20:05:22,187 - DEBUG - config loaded
2024-05-11 20:05:22,193 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:05:22,196 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:05:22,307 - DEBUG - worker 'slave' loaded weights in 0.11s
2024-05-11 20:05:52,491 - DEBUG - handling interrupt signal
2024-05-11 20:05:52,740 - DEBUG - config saved
2024-05-11 20:06:09,005 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,014 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa44527d450>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,017 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:09,909 - DEBUG - config loaded
2024-05-11 20:06:09,916 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,918 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:09,922 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x715c71e79330>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,925 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:10,059 - DEBUG - config loaded
2024-05-11 20:06:10,507 - DEBUG - config loaded
2024-05-11 20:06:11,448 - DEBUG - config loaded
2024-05-11 20:06:11,495 - DEBUG - config loaded
2024-05-11 20:06:11,643 - DEBUG - config loaded
2024-05-11 20:06:11,686 - DEBUG - config loaded
2024-05-11 20:06:34,215 - DEBUG - handling interrupt signal
2024-05-11 20:06:34,219 - DEBUG - config saved
2024-05-11 20:06:48,342 - DEBUG - config loaded
2024-05-11 20:06:48,349 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:48,351 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:48,478 - INFO - worker 'slave' is online
2024-05-11 20:06:48,623 - DEBUG - config loaded
2024-05-11 20:06:49,052 - DEBUG - config loaded
2024-05-11 20:06:49,988 - DEBUG - config loaded
2024-05-11 20:06:50,036 - DEBUG - config loaded
2024-05-11 20:06:50,185 - DEBUG - config loaded
2024-05-11 20:06:50,227 - DEBUG - config loaded
2024-05-11 20:08:55,680 - DEBUG - config loaded
2024-05-11 20:08:55,683 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:08:55,685 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:08:55,695 - DEBUG - worker 'slave' loaded weights in 0.01s
log of both instances starting, and after I tried generating an image. they just hang like this on the latest commit:
on the left is the master, on the right is the slave.
Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.
I always restart the instances by stopping and starting them. I believe the connection refused is due to the slave trying to ping itself (after fetching the worker config) before it started. Before trying to generate, I always check if the status is IDLE on the slave from the webui.
from stable-diffusion-webui-distributed.
Remove the extension from the slave worker, you only need it installed on the main instance. If you're using the same installation root for more than one instance this means that you probably need to use sdwui's command line options so you can force it to use a separate config that has the extension disabled for the slave instance. You can see on the slave instance it's trying to connect back to itself as a worker (since the port is the same), this shouldn't happen.
from stable-diffusion-webui-distributed.
Thank you - I was able to get it working by adding --disable-all-extensions
which disables all extra extensions on the slave instance, apart from the built-in ones.
But I am now facing an issue where my slave instance is seemingly running out of VRAM when generating using the extension. Low sampling steps seem to work, but anything higher than ~10-15 seems to cause it to run out of VRAM.
It's strange because it works just fine if I am to generate through its own web-ui, at any number of sampling steps, even higher resolutions. Do you think this be an issue with how the extension spreads the workload?
from stable-diffusion-webui-distributed.
If there are still issues let me know and I can reopen the issue.
from stable-diffusion-webui-distributed.
Related Issues (20)
- Compatibility with the control net extension HOT 10
- interrupt slave HOT 4
- Grid generation doesn't include slave output HOT 3
- Job counts that are not even multiples of number of slaves behaves oddly HOT 1
- option to use master instance in "thin-client" mode where images are only generated on remotes HOT 1
- ips calculation impacted by model loading time HOT 5
- Running out of linux memory and OOM when attempting benchmark HOT 11
- TypeError: unsupported operand type(s) for /: 'int' and 'NoneType' HOT 6
- ips calculation error HOT 3
- Is it possible to use the extension with Deforum? HOT 1
- Is there a way to have this extension enabled by default (on startup)? HOT 1
- about distributed-config.json HOT 1
- ERROR config is corrupt or invalid JSON HOT 1
- error with multiprompt generation HOT 6
- Compatibility with SD Forge (stable-diffusion-webui-forge) HOT 3
- Issues with a few other extensions authored by Haoming02 HOT 2
- Can't run Redo Benchmarks HOT 8
- can not installed this plugin HOT 1
- Enabling Thin Client Mode HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stable-diffusion-webui-distributed.