compvis / stable-diffusion Goto Github PK

A latent text-to-image diffusion model

Home Page: https://ommer-lab.com/research/latent-diffusion-models/

License: Other

Python 11.08% Shell 0.06% Jupyter Notebook 88.86%

stable-diffusion's Introduction

Stable Diffusion

Stable Diffusion was made possible thanks to a collaboration with Stability AI and Runway and builds upon our previous work:

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer
CVPR '22 Oral | GitHub | arXiv | Project page

Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512x512 images from a subset of the LAION-5B database. Similar to Google's Imagen, this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. See this section below and the model card.

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

You can also update an existing latent diffusion environment by running

conda install pytorch torchvision -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

Stable Diffusion v1

Stable Diffusion v1 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 860M UNet and CLIP ViT-L/14 text encoder for the diffusion model. The model was pretrained on 256x256 images and then finetuned on 512x512 images.

Note: Stable Diffusion v1 is a general text-to-image diffusion model and therefore mirrors biases and (mis-)conceptions that are present in its training data. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card.

The weights are available via the CompVis organization at Hugging Face under a license which contains specific use-based restrictions to prevent misuse and harm as informed by the model card, but otherwise remains permissive. While commercial use is permitted under the terms of the license, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations, since there are known limitations and biases of the weights, and research on safe and ethical deployment of general text-to-image models is an ongoing effort. The weights are research artifacts and should be treated as such.

The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based.

Weights

We currently provide the following checkpoints:

sd-v1-1.ckpt: 237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).
sd-v1-2.ckpt: Resumed from sd-v1-1.ckpt. 515k steps at resolution 512x512 on laion-aesthetics v2 5+ (a subset of laion2B-en with estimated aesthetics score > 5.0, and additionally filtered to images with an original size >= 512x512, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using the LAION-Aesthetics Predictor V2).
sd-v1-3.ckpt: Resumed from sd-v1-2.ckpt. 195k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
sd-v1-4.ckpt: Resumed from sd-v1-2.ckpt. 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of the checkpoints:

Text-to-Image with Stable Diffusion

Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder. We provide a reference script for sampling, but there also exists a diffusers integration, which we expect to see more active community development.

Reference Sampling Script

We provide a reference sampling script, which incorporates

a Safety Checker Module, to reduce the probability of explicit outputs,
an invisible watermarking of the outputs, to help viewers identify the images as machine-generated.

After obtaining the stable-diffusion-v1-*-original weights, link them

mkdir -p models/ldm/stable-diffusion-v1/
ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt

and sample with

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

By default, this uses a guidance scale of --scale 7.5, Katherine Crowson's implementation of the PLMS sampler, and renders images of size 512x512 (which it was trained on) in 50 steps. All supported arguments are listed below (type python scripts/txt2img.py --help).

usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
                  [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
                  [--seed SEED] [--precision {full,autocast}]

optional arguments:
  -h, --help            show this help message and exit
  --prompt [PROMPT]     the prompt to render
  --outdir [OUTDIR]     dir to write results to
  --skip_grid           do not save a grid, only individual samples. Helpful when evaluating lots of samples
  --skip_save           do not save individual samples. For speed measurements.
  --ddim_steps DDIM_STEPS
                        number of ddim sampling steps
  --plms                use plms sampling
  --laion400m           uses the LAION400M model
  --fixed_code          if enabled, uses the same starting code across samples
  --ddim_eta DDIM_ETA   ddim eta (eta=0.0 corresponds to deterministic sampling
  --n_iter N_ITER       sample this often
  --H H                 image height, in pixel space
  --W W                 image width, in pixel space
  --C C                 latent channels
  --f F                 downsampling factor
  --n_samples N_SAMPLES
                        how many samples to produce for each given prompt. A.k.a. batch size
  --n_rows N_ROWS       rows in the grid (default: n_samples)
  --scale SCALE         unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
  --from-file FROM_FILE
                        if specified, load prompts from this file
  --config CONFIG       path to config which constructs model
  --ckpt CKPT           path to checkpoint of model
  --seed SEED           the seed (for reproducible sampling)
  --precision {full,autocast}
                        evaluate at this precision

Note: The inference config for all v1 versions is designed to be used with EMA-only checkpoints. For this reason use_ema=False is set in the configuration, otherwise the code will try to switch from non-EMA to EMA weights. If you want to examine the effect of EMA vs no EMA, we provide "full" checkpoints which contain both types of weights. For these, use_ema=False will load and use the non-EMA weights.

Diffusers Integration

A simple way to download and sample Stable Diffusion is by using the diffusers library:

# make sure you're logged in with `huggingface-cli login`
from torch import autocast
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
	"CompVis/stable-diffusion-v1-4", 
	use_auth_token=True
).to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
    image = pipe(prompt)["sample"][0]  
    
image.save("astronaut_rides_horse.png")

Image Modification with Stable Diffusion

By using a diffusion-denoising mechanism as first proposed by SDEdit, the model can be used for different tasks such as text-guided image-to-image translation and upscaling. Similar to the txt2img sampling script, we provide a script to perform image modification with Stable Diffusion.

The following describes an example where a rough sketch made in Pinta is converted into a detailed artwork.

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8

Here, strength is a value between 0.0 and 1.0, that controls the amount of noise that is added to the input image. Values that approach 1.0 allow for lots of variations but will also produce images that are not semantically consistent with the input. See the following example.

Input

Outputs

This procedure can, for example, also be used to upscale samples from the base model.

Comments

Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

stable-diffusion's People

Contributors

Stargazers

Watchers

Forkers

prixt mroosen ttlsvh pranay144 batrlatom andreasjansson duanduan2288 jrmasiero jvizcainom ausbitbank apolmig fungiboletus nathanshipley miguelbandera roskideluge worthlesspixels p-flock shaun95 japanihon arronm johndpope jojule eagle705 whitefu echelon techthiyanes tungvuthanh zebrajack edenasinc sumanlucifer algonacci kelleyma49 satyaborg hubin858130 researchoor rasah673 muharremokutan odiexin mohsenfayyaz89 sal-dti cuplzyw activx abassino lapskojs oztc facybenbook wgcban quan821223 giantmonster icodein shengyu-meng racoutinho apfrogner paperwave dr1nfinite harsh-dhillon mobola1edu gyrosu tags07 austineinstein k2tomasz edson-github willtejeda amirhertz francislabountyjr aidyai tree-ind commotum jorik041 cgmind makeable frankccccc nguyendoanquyet stephenroddy jingliang0412 xiaohaipeng quentinfasquel romanhyde rolux xueadas patil-suraj shonebain simpsoncarlos3 plyfager makieali raunakram kastnerkyle cfreude daedalus-1337 farfallahu jmcics gracemk aniketmaurya cpacker enkayz gasanchezv markusjbuehler toriningen ashbt undrash

stable-diffusion's Issues

Pretrained checkpoint

Hi, first of all, thank you for your invaluable work and sharing.
A weak ago, I requested the pretrained model following your discriptions.
However, I only got a mail (below), and I have not received anything about the checkpoint.

So, my question is
Did you stop the public sharing of the pretrained model?
or
How much time do I wait, more?

Thank you.

Unclear ReadMe on how to proceed with "After obtaining the weights, link them"

How does one obtain weights? Why arent there any links?

OOM with Tesla T4 and small images

Hi there,

I'm running into OOM error using a Tesla T4 on GCP Ubuntu 18.04 using the v1.4 checkpoint. I see Global seed set to 42 and Loading model from sd-v1-4.ckpt followed by killed, even when I tried making the images 128x128 and then 32x32 and finally 2x2.

Any tips for figuring out what the issue is would be greatly appreciated!

Question about 16 bit training and implications on the generation quality

Hi Team,

I want to try training the model in 16-bit precision. It looks that the code does not support it yet.

Plus, I wonder if you have some comments on potential generation quality degradation when the model is trained with 16-bit precision.

Running examples after linking `sd-v1-3.ckpt` causes error

The scripts/txt2img.py identifies the ckpt but throws an error while loading it,

Global seed set to 42
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
  File "scripts/txt2img.py", line 279, in <module>
    main()
  File "scripts/txt2img.py", line 188, in main
    model = load_model_from_config(config, f"{opt.ckpt}")
  File "scripts/txt2img.py", line 27, in load_model_from_config
    pl_sd = torch.load(ckpt, map_location="cpu")
  File "/home/asathuluri/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/asathuluri/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Is there a compatibility issue between this branch and the checkpoint used?

Simple colab ready for researcher

https://huggingface.co/CompVis/stable-diffusion-v-1-3/discussions/1

I did some fixes and it is able to generate images on colab PRO
Please, feel free to improve the notebook
10 minutes installation
40 seconds to generate 6 images with 50 steps
Version 0.02
https://colab.research.google.com/drive/1j6sggQmWSiYrRmfnMR8_h3X6n1YCck77?usp=sharing

Entry not found for sd-v1-4-full-ema.ckpt

I tried to down load it from https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4-full-ema.ckpt

but it says Entry not found.

could you please help to check and thanks lot in advance.

Can I deploy it with Heroku?

Am I just really underestimating the complexity of running this?

txt2img.py always get killed

(ldm) hugo@DESKTOP:/mnt/d/stable-diffusion$ python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms --ckpt models/ldm/text2img256/model.ckpt
Global seed set to 42
Loading model from models/ldm/text2img256/model.ckpt
Global Step: 947666
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Killed

killed comes from
txt2img.py -> Line34: <model = instantiate_from_config(config.model)>
full trace (with some debug point)

Global seed set to 42
<<CONFIG{Not changed}>>
{'model': {'base_learning_rate': 0.0001, 'target': 'ldm.models.diffusion.ddpm.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'jpg', 'cond_stage_key': 'txt', 'image_size': 64, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'monitor': 'val/loss_simple_ema', 'scale_factor': 0.18215, 'use_ema': False, 'scheduler_config': {'target': 'ldm.lr_scheduler.LambdaLinearScheduler', 'params': {'warm_up_steps': [10000], 'cycle_lengths': [10000000000000], 'f_start': [1e-06], 'f_max': [1.0], 'f_min': [1.0]}}, 'unet_config': {'target': 'ldm.modules.diffusionmodules.openaimodel.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'use_spatial_transformer': True, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False}}, 'first_stage_config': {'target': 'ldm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'ldm.modules.encoders.modules.FrozenCLIPEmbedder'}}}}
Loading model from models/ldm/text2img256/model.ckpt
Global Step: 947666
<class 'ldm.models.diffusion.ddpm.LatentDiffusion'>
LatentDiffusion: Running in eps-prediction mode
Hit: get_obj_from_str
<class 'ldm.modules.diffusionmodules.openaimodel.UNetModel'>
DiffusionWrapper has 859.52 M params.
Hit: get_obj_from_str
<class 'ldm.models.autoencoder.AutoencoderKL'>
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Hit: get_obj_from_str
<class 'torch.nn.modules.linear.Identity'>
Hit: get_obj_from_str
<class 'ldm.modules.encoders.modules.FrozenCLIPEmbedder'>
Killed

Anyone know why it keeps killing itself?
i downloaded all models, setup Conda env etc

Install missing and downgrade certain python packages

Running the following command python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms gives the following error on Ubuntu 20.04:

  File "scripts/txt2img.py", line 11, in <module>
    from pytorch_lightning import seed_everything
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics, void
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
    from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/home/thunder/anaconda3/envs/ldm/lib/python3.8/site-packages/torchmetrics/utilities/data.py)

I could solve it by downgrading torchmetrics to version 0.6.0 by running pip install torchmetrics==0.6.0

Running the command again I got the error,

Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 440000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Traceback (most recent call last):
  File "scripts/txt2img.py", line 279, in <module>
    main()
  File "scripts/txt2img.py", line 188, in main
    model = load_model_from_config(config, f"{opt.ckpt}")
  File "scripts/txt2img.py", line 31, in load_model_from_config
    model = instantiate_from_config(config.model)
  File "/home/thunder/stable-diffusion/ldm/util.py", line 85, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "/home/thunder/stable-diffusion/ldm/models/diffusion/ddpm.py", line 461, in __init__
    self.instantiate_cond_stage(cond_stage_config)
  File "/home/thunder/stable-diffusion/ldm/models/diffusion/ddpm.py", line 519, in instantiate_cond_stage
    model = instantiate_from_config(config)
  File "/home/thunder/stable-diffusion/ldm/util.py", line 85, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "/home/thunder/stable-diffusion/ldm/util.py", line 93, in get_obj_from_str
    return getattr(importlib.import_module(module, package=None), cls)
  File "/home/thunder/anaconda3/envs/ldm/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/thunder/stable-diffusion/ldm/modules/encoders/modules.py", line 7, in <module>
    import kornia
ModuleNotFoundError: No module named 'kornia'

This can be solved by simply installing kornia pip install kornia

T5 instead of CLIP

Dear stable-diffusion team,

Thank you for sharing this great work. I really like it.

Have you consider using pretrained T5 encoder instead of pretrained CLIP? According to Imagen paper, T5-XXL is better than CLIP.

'''
We also find that while T5-XXL and CLIP text encoders perform similarly on simple benchmarks such as MS-COCO, human evaluators prefer T5-XXL encoders over CLIP text encoders in both image-text alignment and image fidelity on DrawBench, a set of challenging and compositional prompts.
'''

Thank you for your help.

Best Wishes,

Zongze

Question regarding random seeds

I've been writing an interactive version of txt2img.py that minics the behaviour of the discord bot. What I can't figure out is how to retrieve the individual random seeds for multiple sampled images? Does anyone know how this is done?

I notice that there are two ways of generating multiple images in the original txt2img.py, either using --n_iter or --n_samples. The latter is faster, but with the first I can set a new random seed between each image, which I think allows me to reproduce the behavior of the bot. Can someone advise on what the difference between these options is?

module has no attribute 'FrozenCLIPEmbedder'

Hello, thanks for the repo and I apologize in advance if it's a noob question, but when I run python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms the code seems to run fine, the model is loaded, etc, then I get :
Traceback (most recent call last):
[...]
AttributeError: module 'ldm.modules.encoders.modules' has no attribute 'FrozenCLIPEmbedder'

Does anybody know how to solve it ?
(I'm on Windows and I'm working with the latest release of Stable Diffusion, with the model sd-v1-4.ckpt, if it might help)

Thanks in advance

SPotaufeux

Tutorial

Hi, I hope that you be always great
How can it be used for artwork samples? Is there a video tutorial?

RuntimeError when the number of prompts from a file causes a partial sample batch

Using the --from-file option and supplying a list of prompts that has a length that is not a multiple of n_samples causes the following error:

RuntimeError: einsum(): operands do not broadcast with remapped shapes [original->remapped]: [48, 4096, 40]->[48, 4096, 1, 40] [40, 77, 40]->[40, 1, 77, 40]

I'm on Windows 10, though I doubt it matters for this one.

When we will be able to run it locally ?

Hey i have gtx1080ti , will the 1024 model run on it or im getting hopes up?
Im using discord SD bot currently but would love to try inpainting.

Fine tune

Hello everyone! Thanks for your great project and for your job! I have are little question. Do you have any guides about fine tuning for stable diffusion?

Is it possible to use the cpu for sampling?

Wanted to try sampling on cpu but can't get it to work. Are there any solutions for this problem?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

support image variation ?

can this diffusion model support image variation without text prompt as dalle2

k_heun, k_lms and other samplers

Hi! Thanks for the release
Are there plans on releasing the samplers like k_heun and k_lms in this GitHub?
k_lms is the default sampler in the discord server, and it seems to produce better results for wider ranges of cfg scale.

differece between v1-3 and v1-4

Dear stable diffusion team,

Thank you for sharing this great work, I really enjoy it.

Could you tell if what is the different between the checkpoint v1-3 and v1-4? If I understand correctly, both of them are init using v1-2, but train for different number of steps. If so, then why not init v1-4 from v1-3?
'''
sd-v1-3.ckpt: Resumed from sd-v1-2.ckpt. 195k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

sd-v1-4.ckpt: Resumed from sd-v1-2.ckpt. 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
'''

Thank you for your help.

Best Wishes,

Zongze

GUI / notebook for generating images

A graphical interface for generating images (both specifying inputs and visualizing/saving outputs) would make it a lot easier to quickly iterate over prompts and other settings.

For anyone wanting something along these lines right now, I made a basic notebook version of txt2img.py that exposes the args and their defaults (and keeps the model loaded indefinitely):
https://colab.research.google.com/github/cpacker/stable-diffusion/blob/interactive-notebook/scripts/stable_diffusion_interactive_colab.ipynb

Basically just change settings on a GUI (e.g., just edit the prompt) and re-run a cell:

Given the amount of people interested in using the repo, IMO it would be nice to have "official" notebook versions of the main scripts (both img2img.py and txt2img.py) that expose all the options in a similar way. Not sure if the notebook I linked is the best/cleanest way to do it though (e.g. in its current construction, if anyone changes the args in txt2img.py the notebook will go out-of-sync).

Set n_samples to 1 by default

The defaults of n_samples=3 and n_iter=2 cause a out of memory error on a RTX 2080/3060 that have 11/12GB VRAM respectively, which is completely unnecessary because the model fits fine with batch size of 1.

This seems to be a constant question everywhere and people are just running the model with defaults that have unnecessarily high memory consumption. The following command OOMs on a 12GB RTX 3060

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

but this works fine

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms --n_samples=1

image call back

Do you have any pointers on how to setup the image callback on the txt2img.py , that would so much appreciated.
And thank you for all of this its mind blowing to have these tools.

img2img,

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 15 for tensor number 1 in the list.

Please help me solve this

When I run the txt2img script following the readme, I find this error:
libssl.so.3: cannot open shared object file: No such file or directory
What should I do to solve this?

Access denied due to invalid subscription key or wrong API endpoint

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
eprint(line:60) :: Error when calling Cognitive Face API:
status_code: 401
code: 401
message: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.
Why does this arise?

Text encoder

Thanks for the fantastic work!
I was wondering if there is a reason you've switched the BERT encoder (that was used in latent diffusion) with a clip encoder?

Cheers,
Eliahu

Unconditional image synthesis pre-trained models?

Hello folks!

Thanks for the very interesting work. Do you plan to release pre-trained models for the unconditional image synthesis tasks (e.g. on the FFHQ dataset) in Section 4.2 of the paper?

Thanks!

James.

Instructions for setup and running on Mac Silicon chips

Hi,

I’ve heard it is possible to run Stable-Diffusion on Mac Silicon (albeit slowly), would be good to include basic setup and instructions to do this.

Thanks,
Chris

Does it work under windows

Good afternoon! I tried to put it under windows 10 but I couldn't. Anyone got something like this?

License Clarification

This looks like an interesting project; however, my limited understanding of your licensing appears to make the code unusable for 3rd parties.

Perhaps it's just incomplete? :)

Any chance you could clarify more around this so the community can better understand your intent and what constitutes acceptable use of the https://github.com/CompVis/stable-diffusion/ repository?

Also, would it be possible to clarify the relationship between CompVis/stable-diffusion and https://github.com/pesser/stable-diffusion (MIT license)?

Best,
Ryan

Warning that some CLIP weights were not used

Just checking to see if this is expected behavior.

After setting up the model and running python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms I get this warning:

Global seed set to 42
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 440000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Downloading: 100%|███████████████████████████████████████████████████████████████████| 939k/939k [00:01<00:00, 738kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████| 512k/512k [00:00<00:00, 736kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████| 389/389 [00:00<00:00, 396kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████| 905/905 [00:00<00:00, 447kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████| 4.31k/4.31k [00:00<00:00, 2.16MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████| 1.59G/1.59G [06:18<00:00, 4.52MB/s]
S
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'visual_projection.weight', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.layer_norm1.weight', 'text_projection.weight', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.22.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.layer_norm2.weight', 'vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.18.layer_norm1.weight', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.14.layer_norm2.bias', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'logit_scale', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.16.layer_norm1.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.23.layer_norm2.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.layer_norm1.bias', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.bias']
- This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

It's just a warning, but I thought I should check.

This is on Windows 10, 64-bit.

CUDA out of memory

I'm receiving the following error but unsure how to proceed. This is the output of setting --n_samples 1!

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 8.00 GiB total capacity; 6.13 GiB already allocated; 0 bytes free; 6.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

AMD GPU not supported?

When running SD I get runtime errors that no Nvidia GPU or driver's installed on your system. Workaround for AMD owners? Or unsupported?

Multi-GPU sampling with DataParallel

Have you tried wrapping the model in DataParallel to accelerate sampling? I'm not getting good samples after adding DataParallel inside of DiffusionWapper, here https://github.com/CompVis/stable-diffusion/blob/main/ldm/models/diffusion/ddpm.py#L1398:

class DiffusionWrapper(pl.LightningModule):
    def __init__(self, diff_model_config, conditioning_key):
        super().__init__()
        model = instantiate_from_config(diff_model_config).half()
        self.diffusion_model = nn.DataParallel(model).half()
        self.conditioning_key = conditioning_key
        assert self.conditioning_key in [None, 'concat', 'crossattn', 'hybrid', 'adm']

    def forward(self, x, t, c_concat: list = None, c_crossattn: list = None):
        x_half = x.half()
        t_half = t.half()
        ...
        elif self.conditioning_key == 'crossattn':
            cc = torch.cat(c_crossattn, 1).half()
            out = self.diffusion_model(x_half, t_half, context=cc)
        ...
        return out

Sampling in fp16 doesn't seem to be the problem.

Samples without DataParallel, in half-precision:

Samples with DataParallel, in half-precision:

Both are run on a single GPU. Thanks!

I can't get the model to load under Linux

The model is in models/ldm/stable-diffusion-v1/model.ckpt but when I go to run txt2img it says the model isn't there.

n_samples --1 returns two samples

`txt2img.py --n_samples 1 ' returns two samples...

Just realizing that number of samples actually returned is double whatever the argument is

Help solve the error

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
python: can't open file 'scripts/txt2img.py': [Errno 2] No such file or directory
I seem to be doing everything right and she comes out

Checkpoint trained on only 256x256 data?

The README says the v1.1 checkpoint was trained on 256x256 images and then fine-tuned on 512x512 images. Is there any way we can access this 256x256 model as a 1.0 checkpoint? There are various purposes where having a lower-resolution model is and would be more useful. For example, if I want to denoise Imagenet images, then the 256x256 model better matches the size of ImageNet and so might perform better than the 512x512 model.

Inpainting model

Is there a new inpainting model released for researchers or is it still currently the original latent diffusion model that is current most released?

Where I can get the skpt model?

"We currently provide three checkpoints, sd-v1-1.ckpt, sd-v1-2.ckpt and sd-v1-3.ckpt, which were trained as follows,"

So there is no info where I can get it.

Older diffusion models won't work with current txt2img script

Older diffusion models won't work anymore, for example using models/ldm/text2img256/config.yaml and its corresponding checkpoint.
Using the text2img256 model checkpoint, the command line was: python3 scripts/txt2img.py --ckpt="model.ckpt" --config="models/ldm/text2img256/config.yaml"

Leading to a tensor mismatch:

  File "/home/simon/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 3, 3, 3], expected input[6, 4, 64, 64] to have 3 channels, but got 4 channels instead

Probably a batch size extra dimension?

How harsh is 10GB VRAM requirement?

I have 6GB VRAM, and here is what I got running txt2img:

RuntimeError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 5.80 GiB total capacity; 4.12 GiB already allocated; 682.94 MiB free; 4.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

with only options --prompt "tower" and --plms

Maybe I can reduce quality or something?

keep style in img2img

There is a way to keep the style of the input image and only change the structure or content.
Something like:

I know I could do something like, "a city between mountains, ms paint style" but sometimes it still generates another style that doesn't look like the style of the input image, or the input image has a style that I don't know, so it would be better to change the image but not the style.

Please default to CPU-only PyTorch if no suitable GPU is detected.

PyTorch is defaulting to NVIDIA GPU, but it would be good to fall back to CPU-only if no suitable GPU is found.

Something like this perhaps? ModeratePrawn#1

This would provide a better out-of-the-box setup experience for users with unsupported GPU’s (AMD) until those versions are implemented, and to users like me who have no GPU but want to run locally.

Faster diffusion process evolution

Hi there,

Apologies if this is not the correct place to post this. I came across this paper and wanted to flag it up as a potential improvement on stable diffusion: https://nv-tlabs.github.io/CLD-SGM/
I am unsure if the current optimisation tricks employed to speed up inference also apply for this new diffusion process, but thought it may be worthwhile to consider.

Thank you for all the work in bringing this project to life!

how to train a new model on a custom dataset

Dear Stable Diffusion Team,

Thanks for sharing the awesome work!

Would it be possible to provide some guidelines on training a new model on a custom dataset? E.g., how to prepare the dataset, how to start the training, how to set important hyper-parameters...

Ran the first part of code and got this error

Anyone able to provide some input on how to fix?

Installing pip dependencies: / Ran pip subprocess with arguments:
['E:\AI\Conda\envs\ldm\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\Users\KageStudio\condaenv.pv04hvci.requirements.txt']
Pip subprocess output:

Pip subprocess error:
ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: C:\Users\KageStudio

failed

CondaEnvException: Pip failed

: The term '>>' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the
spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:2 char:1

conda activate ldm
~~
- CategoryInfo : ObjectNotFound: (>>:String) [], CommandNotFoundException
- FullyQualifiedErrorId : CommandNotFoundException

dimensions

Am I missing it?
Is it possible to use commands to increase the Width and Height?