Giter Club home page Giter Club logo

multidiffusion's Introduction

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation (ICML 2023)

arXiv Pytorch Hugging Face Spaces Replicate

teaser

MultiDiffusion is a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning, as described in (link to paper).

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.

For more see the project webpage.

Diffusers Integration Open In Colab

MultiDiffusion Text2Panorama is integrated into diffusers, and can be run as follows:

import torch
from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler

model_ckpt = "stabilityai/stable-diffusion-2-base"
scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
pipe = StableDiffusionPanoramaPipeline.from_pretrained(
     model_ckpt, scheduler=scheduler, torch_dtype=torch.float16
)

pipe = pipe.to("cuda")

prompt = "a photo of the dolomites"
image = pipe(prompt).images[0]

Gradio Demo

We provide a gradio UI for our method. Running the following command in a terminal will launch the demo:

python app_gradio.py

This demo is also hosted on HuggingFace here

Spatial controls

A web demo for the spatial controls is hosted on HuggingFace here.

Citation

@article{bar2023multidiffusion,
  title={MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation},
  author={Bar-Tal, Omer and Yariv, Lior and Lipman, Yaron and Dekel, Tali},
  journal={arXiv preprint arXiv:2302.08113},
  year={2023}
}

multidiffusion's People

Contributors

chenxwh avatar dribnet avatar omerbt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multidiffusion's Issues

My custom implemetation in Automatic1111's WebUI

Dear authors,

I have implemented your algorithm to Automatic1111's WebUI with the following optimization:

  • Cropping views in a more symmetric way to get a better result.
  • Pre-calculate weights to save time (as weights won't change once the views are determined.
  • Batched latent view processing for acceleration.

Some WebUI related stuffs:

  • Compatibility with all samplers.
  • Compatibility with ControlNet.

Here is the link:

Great thanks to your fantastic work especially in img2img and panorama generation! We are working on text prompt now.

But the uncontrolled large image generation is not ideal at all, as repeated patterns always appears and the image is mostly unusable.

Would you please give us some insights, if we can generate large images without a user-specified prompt mask?

For example, I have an idea (without proof): we may generate a small reference image first, obtain the prompt attention map, scale it to a larger resolution, and finally we automatically locate the prompt to its correct views during multi-diffusion.

Thank you very much!

Color coded?

Does this allow for color coding the text to the corresponding mask? The example images shown use differently colored text which apparently corresponds to the colored mask.

Equation 4 interpetation for the panorama use case

Thanks for the awesome paper and very clear code!

For the panorama use case, can the method be reduced to the following implementation:
At each de-noising step, take the average pixel value of the overlapping regions

latent = torch.where(count > 0, value / count, value)

If yes, how does the least squares formulation of the paper align with it?
image

And again, thanks!

Problem of Formula Derivation

Hello!Thank you for your great work! But I have problem on formula derivation.
I met difficulties when trying to derivatie equation 3 to equation 5.
image
image
Could you please help me and show the process.
Thanks a lot!

Ablation to Scheduler/Guidance Scale?

Since multidiffusion is just a diffusion process, did you ever compare it with different choices of schedulers and guidance scales? If you increase the guidance scale, can you get an with very little content variation, but still smoothly interpolates between different regions?

A small problem in the diffusers.

Hello author, I found a small problem at line 466 of pipeline_stable_diffusion_panorama.py when using the StableDiffusionPanoramaPipeline from diffusers. Should the panorama_height in the judgment statement be changed to panorama_width?

Question About Blurring of Overlapping Masks

I notice in the paper that there are at most 3 overlapping masks.
I was wondering if you tried many masks overlapped on top of each other?

I'm attempting to extend multi-diffusion to other applications, and have noticed significant blurring as more masks get placed on top of each other.

I notice this also wasn't mentioned in limitations, so I was wondering if you had ever tried it?

Resolution or Projection Question about MultiDiffusion

Panoramas represented in Equirectangular Projection usually are generated in a resolution of Hx2H format such as 512x1024.
But MultiDiffusion uses the resolution of 512x2048. I'm very curious about which projection MultiDiffusion is using? And how can I transfer the generated image into a sphere image?

region based not working for multiple prompts

Hello. I ran into a problem, can anyone help me on this.
Here's the code I run

device = torch.device('cuda')
sd = MultiDiffusion(device)


mask = torch.zeros(2,1,512,512).cuda()
mask[0,:,:256]=1
mask[1,:,256:]=1

fg_masks = mask
bg_mask = 1 - torch.sum(fg_masks, dim=0, keepdim=True)
bg_mask[bg_mask < 0] = 0
masks = torch.cat([bg_mask, fg_masks])

prompts = ['dog' ,'cat']# + ['artifacts' ] ,'cat'
#neg_prompts = [opt.bg_negative] + opt.fg_negative
print(masks.shape , len(prompts))
img = sd.generate(masks, prompts , '' , width = 512 )

It gave the following error.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 17
     15 #neg_prompts = [opt.bg_negative] + opt.fg_negative
     16 print(masks.shape , len(prompts))
---> 17 img = sd.generate(masks, prompts , '' , width = 512 )

File ~/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/Desktop/project/MultiDiffusion/region_based.py:142, in MultiDiffusion.generate(self, masks, prompts, negative_prompts, height, width, num_inference_steps, guidance_scale, bootstrapping)
    139     bg = self.scheduler.add_noise(bg, noise[:, :, h_start:h_end, w_start:w_end], t)
    140     #print(latent.shape , 'latent')
    141     #print(latent_view.shape ,bg.shape,masks_view.shape)
--> 142     latent_view[1:] = latent_view[1:] * masks_view[1:] + bg * (1 - masks_view[1:])
    144 # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    145 latent_model_input = torch.cat([latent_view] * 2)

RuntimeError: The expanded size of the tensor (1) must match the existing size (2) at non-singleton dimension 0.  Target sizes: [1, 4, 64, 64].  Tensor sizes: [2, 4, 64, 64]

Thank you.

What does FTD stand for?

In the paper, the abbreviation FTD is used to represent the loss of multi diffusion as compared to standard diffusion. It's never explicitly mentioned what this abbreviation stands for, so I was hoping that could be clarified.

Bug in vae_optimize.py, vae_tile_forward uses 'result' when result is None

(Updated)
The last line of vae_tile_forward (vae_optimize.py line 650) is

return result if result is not None else result_approx.to(device)

When you interrupt the generation using HiRes fix (click Interrupt in the webui), an exception is thrown from this line,

     File "stable-diffusion-webui/extensions/multidiffusion-upscaler-for-automatic1111/scripts/vae_optimize.py", line 650, in vae_tile_forward
        return result if result is not None else result_approx.to(device)
    AttributeError: 'NoneType' object has no attribute 'to'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.