Hi, I first want to thank you for this project. I'm running into an issue where after

papuspartan,stable-diffusion-webui-distributed

Comments (25)

shootie22 commented on June 24, 2024 1

Yes, every time I try generating an image with the Distributed checkbox ticked, it will happen, regardless if it's the first image generated on the instance or not.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024 1

What seems most likely is that optimize_jobs is being run multiple times before a singular request which should never happen. This could be due to an import conflict but I'm not sure yet.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024 1

Could you add this debug statement logger.debug(f"added job for worker {worker.label}") after this line? After, reproduce the issue and post the logs like you did before and it may show that is where the issue is occurring.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024 1

The problem in this case is that your slave's ipm was ending up at around 120 for some reason (3 ipm like before sounds about right). The best thing to do would be to rebenchmark or manually adjust that ipm in the config. Then, the distribution logic should split your requests about evenly and this should be far less of an issue.

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024 1

After another benchmark and some restarting I was able to get it to work. Thank you for your time and help!

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

Could you add --distributed-debug to your sdwui launch arguments and continue to run sdwui to see if this happens again? If it does, this will give me more verbose logs to work with. Whenever this happens, console output from the slave instance will be useful as well. Has sdwui ever gotten stuck near 100% for you before using this extension?

Also, yes, you're using it just fine as long as the compute device for each instance is different.

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

Thanks for the reply. I ran both instances with --distributed-debug and this is the output when it hangs at that status:
main:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.28it/s]DISTRIBUTED | DEBUG Took master 18.84s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
slave node:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00, 1.09it/s]DISTRIBUTED | DEBUG Took master 21.30s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165

this is the output of the log on the web-ui:

DEBUG - waiting for worker thread 'slave_request'
DEBUG - Took master 18.96s
DEBUG - had to substitute sampler index with name
DEBUG - worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a speed of 118.31 ipm

DEBUG - worker 'slave' loaded weights in 0.01s
DEBUG - Worker 'slave' 3.54/3.94 GB VRAM free

DEBUG - 'slave' job's given starting seed is 1141091032 with 1 coming before it
INFO - Job distribution:
1 * 1 iteration(s) + 40 complementary: 41 images total
'master' - 1 image(s) @ 3.47 ipm
'slave' - 40 image(s) @ 118.31 ipm

I've not encountered this issue before using the extension, and it goes away if I disable it when generating.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

How long was your main instance running until you encountered this? Also, did you happen to use the interrupt/skip button on a request in the web-ui at any point before this happened?

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

I had just restarted both instances, so newly started. I didn't interrupt the generation. I tried with no prior generation to the distributed one, and with one on each instance beforehand; same result. D:

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

When you say restarted you mean you fully stopped sdwui and restarted both? (didn't use the restart button built into the web interface?)

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

Correct, stopped and started again.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

So this is happening everytime first try for you after rebooting sdwui?

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

Could you also post your distributed-config.json file from the extension folder

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

I tried generating one image.
Master:

DISTRIBUTED | DEBUG    config loaded                                                                  world.py:643DISTRIBUTED | DEBUG    added job for worker master                                                    world.py:401DISTRIBUTED | DEBUG    added job for worker slave                                                     world.py:401DISTRIBUTED | DEBUG    World initialized!                                                       distributed.py:237DISTRIBUTED | DEBUG    The requested number of images(1) was not cleanly divisible by the number of   world.py:483
                       realtime nodes(2) resulting in 1 that will be redistributed                                
DISTRIBUTED | DEBUG    There's 19.01s of slack time available for worker 'slave'                      world.py:526DISTRIBUTED | DEBUG    worker 'slave':                                                                world.py:531
                       40 complementary image(s) = 19.01s slack/ 0.47s per requested image                        
DISTRIBUTED | INFO     Job distribution:                                                              world.py:555                       1 * 1 iteration(s) + 40 complementary: 41 images total                                     
                       'master' - 1 image(s) @ 3.47 ipm                                                           
                       'slave' - 40 image(s) @ 118.31 ipm                                                         
                                                                                                                  
DISTRIBUTED | DEBUG    'slave' job's given starting seed is 143432277 with 1 coming before it   distributed.py:344DISTRIBUTED | DEBUG    Worker 'slave' 3.54/3.94 GB VRAM free                                         worker.py:318
                                                                                                                  
DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:675DISTRIBUTED | DEBUG    worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a     worker.py:335                       speed of 118.31 ipm                                                                        
                                                                                                                  
DISTRIBUTED | DEBUG    local script(s): [Hypertile], [Comments] seem to be unsupported by worker     worker.py:391                       'slave'                                                                                    
                                                                                                                  
DISTRIBUTED | DEBUG    had to substitute sampler index with name                                     worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.28it/s]DISTRIBUTED | DEBUG    Took master 18.89s                                                       distributed.py:161DISTRIBUTED | DEBUG    waiting for worker thread 'slave_request'                                distributed.py:165

Slave:

DISTRIBUTED | DEBUG    config loaded                                                                  world.py:643DISTRIBUTED | DEBUG    added job for worker master                                                    world.py:401DISTRIBUTED | DEBUG    added job for worker slave                                                     world.py:401DISTRIBUTED | DEBUG    World initialized!                                                       distributed.py:237DISTRIBUTED | DEBUG    The requested number of images(1) was not cleanly divisible by the number of   world.py:483
                       realtime nodes(2) resulting in 1 that will be redistributed                                
DISTRIBUTED | DEBUG    There's 19.01s of slack time available for worker 'slave'                      world.py:526DISTRIBUTED | DEBUG    worker 'slave':                                                                world.py:531
                       40 complementary image(s) = 19.01s slack/ 0.47s per requested image                        
DISTRIBUTED | INFO     Job distribution:                                                              world.py:555                       1 * 1 iteration(s) + 40 complementary: 41 images total                                     
                       'master' - 1 image(s) @ 3.47 ipm                                                           
                       'slave' - 40 image(s) @ 118.31 ipm                                                         
                                                                                                                  
DISTRIBUTED | DEBUG    'slave' job's given starting seed is 143432277 with 1 coming before it   distributed.py:344DISTRIBUTED | DEBUG    Worker 'slave' 3.54/3.94 GB VRAM free                                         worker.py:318
                                                                                                                  
DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:675DISTRIBUTED | DEBUG    worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a     worker.py:335                       speed of 118.31 ipm                                                                        
                                                                                                                  
DISTRIBUTED | DEBUG    local script(s): [Hypertile], [Comments] seem to be unsupported by worker     worker.py:391                       'slave'                                                                                    
                                                                                                                  
DISTRIBUTED | DEBUG    had to substitute sampler index with name                                     worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.28it/s]DISTRIBUTED | DEBUG    Took master 18.89s                                                       distributed.py:161DISTRIBUTED | DEBUG    waiting for worker thread 'slave_request'                                distributed.py:165

Config:

{
   "workers": [
      {
         "master": {
            "avg_ipm": 3.468163269192995,
            "master": true,
            "address": "0.0.0.0",
            "port": 7860,
            "eta_percent_error": [],
            "tls": false,
            "state": 1,
            "user": null,
            "password": null,
            "pixel_cap": -1
         }
      },
      {
         "slave": {
            "avg_ipm": 118.3148549027504,
            "master": false,
            "address": "0.0.0.0",
            "port": 7861,
            "eta_percent_error": [],
            "tls": false,
            "state": 1,
            "user": "None",
            "password": "None",
            "pixel_cap": -1
         }
      }
   ],
   "benchmark_payload": {
      "prompt": "A herd of cows grazing at the bottom of a sunny valley",
      "negative_prompt": "",
      "steps": 20,
      "width": 512,
      "height": 512,
      "batch_size": 1
   },
   "job_timeout": 3,
   "enabled": true,
   "complement_production": true
}

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

What happened to your speed ratings btw? Before it showed both of your instances were going at about 3 ipm but now the slave is at around 120 ipm?

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

It's strange, it sometimes shows something normal like 3, but usually the slave is really high at about 120ipm. I wish it could do 120ipm haha.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

Can you let me know if this also happens consistently on 424a1c8

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

Just tested, and on that commit I cannot generate anything with the extension enabled. :( It hangs like this.

master:

Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=0 --port 7860 --distributed-debug
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | INFO     doing initial ping sweep to see which workers are reachable               distributed.py:51DISTRIBUTED | DEBUG    checking if worker 'slave' is reachable...                                     world.py:693DISTRIBUTED | INFO     worker 'slave' is online                                                       world.py:720Distributed: worker 'slave' is online
Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658Startup time: 17.8s (prepare environment: 2.9s, import torch: 7.7s, import gradio: 1.0s, setup paths: 2.0s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.1s, create ui: 1.0s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 8.4s (load weights from disk: 5.3s, create model: 0.6s, apply weights to model: 1.6s, apply half(): 0.7s, calculate empty prompt: 0.2s).
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    recorded speed for worker 'master' is invalid                                  world.py:214DISTRIBUTED | DEBUG    recorded speed for worker 'slave' is invalid                                   world.py:214DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:686

slave:

Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=1 --port 7861
DISTRIBUTED | INFO     doing initial ping sweep to see which workers are reachable               distributed.py:51DISTRIBUTED | ERROR    HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: worker.py:618                       /sdapi/v1/memory (Caused by                                                                
                       NewConnectionError('<urllib3.connection.HTTPConnection object at                           
                       0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection              
                       refused'))                                                                                 
DISTRIBUTED | INFO     worker 'slave' is unreachable                                                  world.py:725Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.
Startup time: 15.3s (prepare environment: 2.6s, import torch: 5.4s, import gradio: 1.1s, setup paths: 2.1s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.0s, create ui: 1.1s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 13.2s (load weights from disk: 5.0s, create model: 0.5s, apply weights to model: 5.5s, apply half(): 2.0s, calculate empty prompt: 0.2s).

WebUI extension log:

DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
INFO - worker 'slave' is online
DEBUG - checking if worker 'slave' is reachable...
INFO - doing initial ping sweep to see which workers are reachable
DEBUG - config loaded

It seems that it doesn't send the command to the slave instance?
I deleted the previous extension version, started SD, shut it down, installed the new one.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

Does this happen with no other extensions enabled (builtin ones should be fine)? Also can you post your extension list of what you've been using.

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

This is the only non-builtin extension I'm using, and it works fine if I disable it.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

In that case:

what commit of sdwui are you on
what version is your python interpreter
can you post your distributed.log file from the extension folder

Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

sdwui version: 1.9.3 (1c0a0c4)
python ver: 3.10.14
distributed.log:

2024-05-11 08:02:47,965 - ERROR - Config was not found at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,972 - INFO - Generated new config file at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,975 - ERROR - config is corrupt or invalid JSON, unable to load
2024-05-11 08:02:47,976 - DEBUG - cannot parse null config (present but empty config file?)
generating defaults for config
2024-05-11 08:02:47,979 - DEBUG - config saved
2024-05-11 08:02:47,981 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:02:48,115 - DEBUG - config loaded
2024-05-11 08:02:48,561 - DEBUG - config loaded
2024-05-11 08:02:49,560 - DEBUG - config loaded
2024-05-11 08:02:49,605 - DEBUG - config loaded
2024-05-11 08:02:49,754 - DEBUG - config loaded
2024-05-11 08:02:49,798 - DEBUG - config loaded
2024-05-11 08:02:50,067 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:03:12,589 - DEBUG - config saved
2024-05-11 08:03:24,649 - INFO - Redoing benchmarks...
2024-05-11 08:03:24,661 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:04:00,909 - DEBUG - master finished warming up

2024-05-11 08:04:18,209 - INFO - Sample 1: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:35,482 - INFO - Sample 2: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:52,793 - INFO - Sample 3: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:52,796 - DEBUG - Worker 'master' average ipm: 3.47
2024-05-11 08:04:52,798 - INFO - benchmarking worker 'master'
2024-05-11 08:04:52,800 - INFO - benchmarking worker 'slave'
2024-05-11 08:04:52,897 - DEBUG - Worker 'slave' 3.76/3.94 GB VRAM free

2024-05-11 08:06:14,143 - DEBUG - config loaded
2024-05-11 08:06:14,145 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:06:14,147 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:06:14,156 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:06:42,980 - DEBUG - handling interrupt signal
2024-05-11 08:06:42,984 - DEBUG - config saved
2024-05-11 08:06:57,760 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:57,769 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7a920dc857b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:06:57,772 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,163 - DEBUG - config loaded
2024-05-11 08:06:59,170 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:59,172 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:06:59,210 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,343 - DEBUG - config loaded
2024-05-11 08:06:59,784 - DEBUG - config loaded
2024-05-11 08:07:00,814 - DEBUG - config loaded
2024-05-11 08:07:00,862 - DEBUG - config loaded
2024-05-11 08:07:01,025 - DEBUG - config loaded
2024-05-11 08:07:01,073 - DEBUG - config loaded
2024-05-11 08:07:18,009 - DEBUG - config loaded
2024-05-11 08:07:18,011 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:07:18,013 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:07:18,023 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:07:47,572 - WARNING - config reports invalid speed (0 ipm) for worker 'master'
please re-benchmark
2024-05-11 08:07:47,575 - WARNING - config reports invalid speed (0 ipm) for worker 'slave'
please re-benchmark
2024-05-11 08:07:51,197 - INFO - Redoing benchmarks...
2024-05-11 08:07:51,207 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:08:39,355 - DEBUG - handling interrupt signal
2024-05-11 08:08:39,359 - DEBUG - config saved
2024-05-11 08:08:54,534 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:54,543 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7364e6e7d750>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:54,546 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,413 - DEBUG - config loaded
2024-05-11 08:08:55,420 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:55,422 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:08:55,426 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b078428d4b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:55,429 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,564 - DEBUG - config loaded
2024-05-11 08:08:56,008 - DEBUG - config loaded
2024-05-11 08:08:56,965 - DEBUG - config loaded
2024-05-11 08:08:57,010 - DEBUG - config loaded
2024-05-11 08:08:57,159 - DEBUG - config loaded
2024-05-11 08:08:57,204 - DEBUG - config loaded
2024-05-11 08:09:58,364 - DEBUG - handling interrupt signal
2024-05-11 08:09:58,369 - DEBUG - config saved
2024-05-11 08:10:15,019 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:15,028 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:10:15,031 - INFO - worker 'slave' is unreachable
2024-05-11 08:10:34,280 - DEBUG - config loaded
2024-05-11 08:10:34,287 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:34,289 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:10:34,414 - INFO - worker 'slave' is online
2024-05-11 08:10:34,550 - DEBUG - config loaded
2024-05-11 08:10:34,980 - DEBUG - config loaded
2024-05-11 08:10:35,970 - DEBUG - config loaded
2024-05-11 08:10:36,017 - DEBUG - config loaded
2024-05-11 08:10:36,163 - DEBUG - config loaded
2024-05-11 08:10:36,206 - DEBUG - config loaded
2024-05-11 08:11:05,151 - DEBUG - config loaded
2024-05-11 08:11:05,154 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:11:05,156 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:11:05,166 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:22:31,026 - DEBUG - handling interrupt signal
2024-05-11 08:22:31,029 - DEBUG - config saved
2024-05-11 12:04:14,157 - DEBUG - config loaded
2024-05-11 12:04:14,163 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 12:04:14,165 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 12:04:14,169 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x72c67007d270>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 12:04:14,172 - INFO - worker 'slave' is unreachable
2024-05-11 12:04:14,300 - DEBUG - config loaded
2024-05-11 12:04:14,736 - DEBUG - config loaded
2024-05-11 12:04:15,672 - DEBUG - config loaded
2024-05-11 12:04:15,714 - DEBUG - config loaded
2024-05-11 12:04:15,862 - DEBUG - config loaded
2024-05-11 12:04:15,906 - DEBUG - config loaded
2024-05-11 12:04:39,376 - DEBUG - config loaded
2024-05-11 12:04:39,378 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 12:04:39,380 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:00:21,789 - DEBUG - config loaded
2024-05-11 13:00:21,792 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:00:21,794 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:02:15,027 - DEBUG - config loaded
2024-05-11 13:02:15,029 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:02:15,031 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:54:30,599 - DEBUG - config loaded
2024-05-11 19:54:30,602 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:54:30,604 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:56:46,931 - DEBUG - config loaded
2024-05-11 19:56:46,935 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:56:46,937 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:58:26,201 - DEBUG - config loaded
2024-05-11 19:58:26,204 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:58:26,206 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:01:18,842 - DEBUG - config loaded
2024-05-11 20:01:18,845 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:01:18,847 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:03:07,938 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:03:07,946 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ef589479570>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:03:07,950 - INFO - worker 'slave' is unreachable
2024-05-11 20:05:22,187 - DEBUG - config loaded
2024-05-11 20:05:22,193 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:05:22,196 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:05:22,307 - DEBUG - worker 'slave' loaded weights in 0.11s
2024-05-11 20:05:52,491 - DEBUG - handling interrupt signal
2024-05-11 20:05:52,740 - DEBUG - config saved
2024-05-11 20:06:09,005 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,014 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa44527d450>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,017 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:09,909 - DEBUG - config loaded
2024-05-11 20:06:09,916 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,918 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:09,922 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x715c71e79330>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,925 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:10,059 - DEBUG - config loaded
2024-05-11 20:06:10,507 - DEBUG - config loaded
2024-05-11 20:06:11,448 - DEBUG - config loaded
2024-05-11 20:06:11,495 - DEBUG - config loaded
2024-05-11 20:06:11,643 - DEBUG - config loaded
2024-05-11 20:06:11,686 - DEBUG - config loaded
2024-05-11 20:06:34,215 - DEBUG - handling interrupt signal
2024-05-11 20:06:34,219 - DEBUG - config saved
2024-05-11 20:06:48,342 - DEBUG - config loaded
2024-05-11 20:06:48,349 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:48,351 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:48,478 - INFO - worker 'slave' is online
2024-05-11 20:06:48,623 - DEBUG - config loaded
2024-05-11 20:06:49,052 - DEBUG - config loaded
2024-05-11 20:06:49,988 - DEBUG - config loaded
2024-05-11 20:06:50,036 - DEBUG - config loaded
2024-05-11 20:06:50,185 - DEBUG - config loaded
2024-05-11 20:06:50,227 - DEBUG - config loaded
2024-05-11 20:08:55,680 - DEBUG - config loaded
2024-05-11 20:08:55,683 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:08:55,685 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:08:55,695 - DEBUG - worker 'slave' loaded weights in 0.01s

log of both instances starting, and after I tried generating an image. they just hang like this on the latest commit:

on the left is the master, on the right is the slave.

Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.

I always restart the instances by stopping and starting them. I believe the connection refused is due to the slave trying to ping itself (after fetching the worker config) before it started. Before trying to generate, I always check if the status is IDLE on the slave from the webui.

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

Remove the extension from the slave worker, you only need it installed on the main instance. If you're using the same installation root for more than one instance this means that you probably need to use sdwui's command line options so you can force it to use a separate config that has the extension disabled for the slave instance. You can see on the slave instance it's trying to connect back to itself as a worker (since the port is the same), this shouldn't happen.

from stable-diffusion-webui-distributed.

shootie22 commented on June 24, 2024

Thank you - I was able to get it working by adding --disable-all-extensions which disables all extra extensions on the slave instance, apart from the built-in ones.

But I am now facing an issue where my slave instance is seemingly running out of VRAM when generating using the extension. Low sampling steps seem to work, but anything higher than ~10-15 seems to cause it to run out of VRAM.

It's strange because it works just fine if I am to generate through its own web-ui, at any number of sampling steps, even higher resolutions. Do you think this be an issue with how the extension spreads the workload?

from stable-diffusion-webui-distributed.

papuSpartan commented on June 24, 2024

If there are still issues let me know and I can reopen the issue.

from stable-diffusion-webui-distributed.

Getting stuck at "Distributed - injecting images 100%" about stable-diffusion-webui-distributed HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent