kjsman / stable-diffusion-pytorch Goto Github PK

View Code? Open in Web Editor NEW

496.0 6.0 57.0 26 KB

Yet another PyTorch implementation of Stable Diffusion (probably easy to read)

License: MIT License

Python 89.30% Jupyter Notebook 10.70%

diffusion image-generation pytorch stable-diffusion

stable-diffusion-pytorch's Introduction

stable-diffusion-pytorch

Yet another PyTorch implementation of Stable Diffusion.

I tried my best to make the codebase minimal, self-contained, consistent, hackable, and easy to read. Features are pruned if not needed in Stable Diffusion (e.g. Attention mask at CLIP tokenizer/encoder). Configs are hard-coded (based on Stable Diffusion v1.x). Loops are unrolled when that shape makes more sense.

Despite of my efforts, I feel like I cooked another sphagetti. Well, help yourself!

Heavily referred to following repositories. Big kudos to them!

Dependencies

PyTorch
Numpy
Pillow
regex
tqdm

How to Install

Clone or download this repository.
Install dependencies: Run pip install torch numpy Pillow regex or pip install -r requirements.txt.
Download data.v20221029.tar from here and unpack in the parent folder of stable_diffusion_pytorch. Your folders should be like this:

stable-diffusion-pytorch(-main)/
├─ data/
│  ├─ ckpt/
│  ├─ ...
├─ stable_diffusion_pytorch/
│  ├─ samplers/
└  ┴─ ...

Note that checkpoint files included in data.zip have different license -- you should agree to the license to use checkpoint files.

How to Use

Import stable_diffusion_pytorch as submodule.

Here's some example scripts. You can also read the docstring of stable_diffusion_pytorch.pipeline.generate.

Text-to-image generation:

from stable_diffusion_pytorch import pipeline

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts)
images[0].save('output.jpg')

...with multiple prompts:

prompts = [
    "a photograph of an astronaut riding a horse",
    ""]
images = pipeline.generate(prompts)

...with unconditional(negative) prompts:

prompts = ["a photograph of an astronaut riding a horse"]
uncond_prompts = ["low quality"]
images = pipeline.generate(prompts, uncond_prompts)

...with seed:

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, uncond_prompts, seed=42)

Preload models (you will need enough VRAM):

from stable_diffusion_pytorch import model_loader
models = model_loader.preload_models('cuda')

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, models=models)

If you get OOM with above code but have enough RAM (not VRAM), you can move models to GPU when needed and move back to CPU when not needed:

from stable_diffusion_pytorch import model_loader
models = model_loader.preload_models('cpu')

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, models=models, device='cuda', idle_device='cpu')

Image-to-image generation:

from PIL import Image

prompts = ["a photograph of an astronaut riding a horse"]
input_images = [Image.open('space.jpg')]
images = pipeline.generate(prompts, input_images=images)

...with custom strength:

prompts = ["a photograph of an astronaut riding a horse"]
input_images = [Image.open('space.jpg')]
images = pipeline.generate(prompts, input_images=images, strength=0.6)

Change classifier-free guidance scale:

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, cfg_scale=11)

...or disable classifier-free guidance:

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, do_cfg=False)

Reduce steps (faster generation, lower quality):

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, n_inference_steps=28)

Use different sampler:

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, sampler="k_euler")
# "k_lms" (default), "k_euler", or "k_euler_ancestral" is available

Generate image with custom size:

prompts = ["a photograph of an astronaut riding a horse"]
images = pipeline.generate(prompts, height=512, width=768)

LICENSE

All codes on this repository are licensed with MIT License. Please see LICENSE file.

Note that checkpoint files of Stable Diffusion are licensed with CreativeML Open RAIL-M License. It has use-based restriction caluse, so you'd better read it.

stable-diffusion-pytorch's People

Contributors

Stargazers

Watchers

stable-diffusion-pytorch's Issues

Do you have a plan to implement training code?

Thank you for sharing awesome code!
Do you have a plan to implement training code?

Is 4GB VRAM too small for this program?

Thanks for the implementation!
I got
OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 3.31 GiB already allocated; 0 bytes free; 3.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
for running demo.ipynb. Is there some solution to it?

How are the models in data.zip made?

Thank you for making this repo its very educational. This minimal implementation is brilliant. The bigger SD repos are very hard to understand.

Did you have a script to convert them for official models like this one: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt to the format you use in this repo?

Or are you using a model from some other source?

Are you using SD 1.5 model?

How hard is it to make this repo to use models trained by others? Like Inkpunk for example? https://huggingface.co/Envvi/Inkpunk-Diffusion/blob/main/Inkpunk-Diffusion-v2.ckpt

Can you add inpaint, erase and replace and outpainting in such simple manner as well with this?

thanks kjsman ❤

this is beautiful code ✨ thanks for this implementation ❤

Should i scale input image

Hello, @kjsman, thanks for this easily readable implementation. I have a question: am I correct that I should scale the input with a sampler before passing the image to the U-Net model during the training process?

query

I hope this message finds you well. I recently came across your repository for Stable Diffusion in PyTorch and I must say, your effort in making the codebase minimal, and easy to read is commendable. I am new to generative models and your implementation has piqued my interest.
I was wondering if you could provide some insights into the training process of your Stable Diffusion model. Specifically, I am curious about the following:

Training Data: Could you please let me know on which dataset you trained your model? Understanding the dataset used would help me get a better understanding of the capabilities and limitations of the model.
Training Time: I'm also interested to know how much time it took for your model to train. This information will help me gauge the computational requirements and plan accordingly for any experiments or projects involving Stable Diffusion.

Moreover, I would like to know more about your approach to writing this code. Did you primarily refer to research papers, or did you take inspiration from other implementations? For instance, you mentioned using Andrej Karpathy's miniGPT. Could you share your thought process behind choosing this reference or any other methods you considered during your implementation?
I greatly appreciate your assistance and expertise in this matter. Thank you for your time and for sharing your work with the community. I look forward to your response.

where is DownSample？

Hi, thanks for your coding. but i don't find the downsample operation.

Image in latent space gets shifted during encoding.

I am using a simple red image as input:

from stable_diffusion_pytorch import pipeline
from PIL import Image

prompts = ["a photograph of an astronaut riding a horse"]
input_images = [Image.open('red.png')]
images = pipeline.generate(prompts, input_images=input_images)
images[0].save('output.png')

But I am getting the input image shifted down 8px,8px and it generates ugly brown border:

I am pretty sure it happens during the Encode pass as its already shifter in latent space. Here is custom dumping of latent space to image:

Some thing in the Encode pass that is shifting it by a pixel in the latent space. And I can't figure out what.

Is there sdxl simplified code?

[Enhancement] automate weights download without user action

Hello @kjsman,
this is more a feature proposal than an actual issue. Instead of requiring the user to download and open the tar file containing the weights and the vocabulary from your huggingface hub repository, one can directly make the model_loader and the Tokenizer download and cache them.

For the first part, it only requires replacing torch.load(...) here (and for the other 3 functions in the same file) with

torch.hub.load_state_dict_from_url(weights_url, check_hash=True)

All it takes on your side is to upload on hugginface hub the 4 pt files (not in a zipped file) and thats' it.

As regards the tokenizer, just takes to add a default_bpe() method / function

@lru_cache()
def default_bpe():
    p = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz"
    )

    if os.path.exists(p):
        return p
    else:
        p = urlretrieve(
            "https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz?raw=true",
            "bpe_simple_vocab_16e6.txt.gz",
        )
        if len(p) != 1:
            # if it also contains the
            # HTTP message as second entry
            return p[0]
        else:
            return p

Another option is, if you prefer to keep your vocab.json and merges.txt, to upload them as well to Hugginface hub (not in a tar file) or directly to GitHub like the original reposiotry does with its vocab.

If you like it, I will open a new PR, otherwise please let me know if you have any better idea or close this issue if you are not interested in this feature 😄