Giter Club home page Giter Club logo

structured-diffusion-guidance's Introduction

Structured Diffusion Guidance (ICLR 2023)

We propose a method to fuse language structures into diffusion guidance for compositionality text-to-image generation.

This is the official codebase for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis.

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Weixi Feng 1, Xuehai He 2, Tsu-Jui Fu1, Varun Jampani3, Arjun Akula3, Pradyumna Narayana3, Sugato Basu3, Xin Eric Wang2, William Yang Wang 1
1UCSB, 2UCSC, 3Google

Update:

Apr. 4th: updated links, uploaded benchmarks and GLIP eval scripts, updated bibtex.

Setup

Clone this repository and then create a conda environment with:

conda env create -f environment.yaml
conda activate structure_diffusion

If you already have a stable diffusion environment, you can run the following commands:

pip install stanza nltk scenegraphparser tqdm matplotlib
pip install -e .

Inference

This repository supports stable diffusion 1.4 for now. Please refer to the official stable-diffusion repository to download the pre-trained model and put it under models/ldm/stable-diffusion-v1/. Our method is training-free and can be applied to the trained stable diffusion checkpoint directly.

To generate an image, run

python scripts/txt2img_demo.py --prompt "A red teddy bear in a christmas hat sitting next to a glass" --plms --parser_type constituency

By default, the guidance scale is set to 7.5 and output image size is 512x512. We only support PLMS sampling and batch size equals to 1 for now. Apart from the default arguments from Stable Diffusion, we add --parser_type and --conjunction.

usage: txt2img_demo.py [-h] [--prompt [PROMPT]] ...
                       [--parser_type {constituency,scene_graph}] [--conjunction] [--save_attn_maps]

optional arguments:
    ...
  --parser_type {constituency,scene_graph}
  --conjunction         If True, the input prompt is a conjunction of two concepts like "A and B"
  --save_attn_maps      If True, the attention maps will be saved as a .pth file with the name same as the image

Without specifying the conjunction argument, the model applies one key and multiple values for each cross-attention layer. For concept conjunction prompts, you can run:

python scripts/txt2img_demo.py --prompt "A red car and a white sheep" --plms --parser_type constituency --conjunction

Overall, compositional prompts remains a challenge for Stable Diffusion v1.4. It may still take several attempts to get a correct image with our method. The improvement is system-level instead of sample-level, and we are still looking for good evaluation metrics for compositional T2I synthesis. We observe less missing objects in Stable Diffusion v2, and we are implementing our method on top of it as well. Please feel free to reach out for a discussion.

Benchmarks

CC-500.txt: Concept Conjunction of two objects with different colors (line1-446). ABC-6K.txt: ~6K attribute binding prompts collected and created from COCO captions.

GLIP Eval

For our GLIP eval, please first clone and setup your environment according to the official GLIP repo and download the model checkpoint(s). Then refer to our GLIP_eval/eval.py and you may need to modify line 59&82. We assumed that each image file name contains the text prompt.

Comments

Our codebase builds heavily on Stable Diffusion. Thanks for open-sourcing!

Citing our Paper

If you find our code or paper useful for your research, please consider citing

@inproceedings{feng2023trainingfree,
title={Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis},
author={Weixi Feng and Xuehai He and Tsu-Jui Fu and Varun Jampani and Arjun Reddy Akula and Pradyumna Narayana and Sugato Basu and Xin Eric Wang and William Yang Wang},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=PUIqjT4rzq7}
}

structured-diffusion-guidance's People

Contributors

weixi-feng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

structured-diffusion-guidance's Issues

How to reproduce the results using GLIP

Hi,

Thanks for the great work. I wonder how to reproduce the GLIP results in Table 2 of the paper. I ran the script GLIP_eval/eval.py and got the detection results saved in "glip_results.json", but had no idea how to calculate the GLIP results. Could you please give some advice to calculate the GLIP-related metrics, e.g., Zero/One obj. and Two obj.? Many thanks!

Ablation study of contextualized text embeddings

Dear authors, thanks for your great work.

I would like to ask one question about the ablation study of contextualized text embeddings.

In section 4.2, you compared the result of using different length of sentences for image generation. I was wondering if the operation of using few embeddings from CLIP was utilized in both encoding the concept (obtained by a constituency tree) and encoding the whole sentence?

Thanks in advance for your response.

How can I use ddim sampler instead of plms sampler in this codebase?

When I tried to change the plms sampler into the ddim sampler in this codebase, I got the following error.

Traceback (most recent call last):
File "scripts/txt2img_timer.py", line 507, in
main()
File "scripts/txt2img_timer.py", line 465, in main
samples_ddim, intermediates = sampler.sample(S=opt.ddim_steps,
File "/home/shawn/anaconda3/envs/structure_diffusion/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/shawn/local/szz/workspace/personal_structure/structured_stable_diffusion/models/diffusion/ddim.py", line 83, in sample
cbs = conditioning[list(conditioning.keys())[0]].shape[0]
AttributeError: 'list' object has no attribute 'shape'

how should I solve this?

Be killed at the beginning

Hello:

2023-11-14 16:29:51 INFO: Loading these models for language: en (English):
===========================
| Processor    | Package  |
---------------------------
| tokenize     | combined |
| pos          | combined |
| constituency | wsj      |
===========================

2023-11-14 16:29:51 INFO: Use device: gpu
2023-11-14 16:29:51 INFO: Loading: tokenize
2023-11-14 16:30:14 INFO: Loading: pos
2023-11-14 16:30:14 INFO: Loading: constituency
2023-11-14 16:30:15 INFO: Done loading processors!
Global seed set to 42
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
[6]    24162 killed     python scripts/txt2img_demo.py --prompt 

May the problem?

Wrong link

The link to your paper on the main page readme is wrong.

It links to https://arxiv.org/ (the main site) but not your particular paper page.

Failing to reproduce results

Congratulations on the arxiv submission!

I tried to reproduce the results of this paper on top of Huggingface Diffusers, based on the reference implementation provided in the preprint.

I ended up implementing like so:
Changes to txt2img
Changes to diffusers
Some explanation in tweet.

In my independent implementation: structured diffusion changes the images only slightly, and in the 10 samples * 4 prompts that I tried, never made the generations more relevant to the prompt.

structured (left) / regular (right) "two blue sheep and a red goat":



I attach the rest of my results:
A red bird and a green apple.zip
A white goat standing next to two black goats.zip
two blue sheep and a red goat.zip
Two ripe spotted bananas are sitting inside a green bowl on a gray counter.zip

Basically, I'm wondering whether:

  • this is exactly the kind of difference I should expect to see (in line with the claimed 5–8% advantage)
  • there's a mistake in my reproduction; better results are possible

Could you possibly read my attention.py and see if it looks like a reasonable interpretation of your algorithm? I changed it substantially to make it to do more work in parallel. I think it should be equivalent, but did I miss something important?

Thanks in advance for any attention you can give this!

Attention Maps

Hi!

I run the code with --save_atten_maps but got AttributeError: 'CrossAttention' object has no attribute 'attn_maps'. I print the module of crossattention and found that it is composed of several layers of networks. How can I get the attention maps?

Is SDv2 supported?

👋hello: Notice that you are implementing our method on top of it, is it supported now?

ModuleNotFoundError: No module named 'structured_stable_diffusion'

Hello and thank you for the beautiful work. I am facing an issue while I am trying to run your code. Specifically, I can't do any relative imports, as the title also indicates (as a reference, the original stable diffusion code which also does relative imports is working fine). Thank you!

Conjunction Issues

I wonder why you chose the feature replaced by the last noun phrase as the value in conjunction situation. Is there any intuitive explanation? I'm confused since A and B are equally important in the 'A and B' prompt. Then why use the feature replaced by B as the value instead of A?

I changed the value from v_c[-1:] to v_c[-2:-1] and didn't see many differences. Does your experiment show that v_c[-1] is a better choice?

                        if not conjunction:
                            c = {'k': k_c[:1], 'v': v_c}
                        else:
                            # c = {'k': k_c, 'v': v_c[-1:]}
                            c = {'k': k_c, 'v': v_c[-2:-1]}

ABB
Generated by v_c[-1:]

AAB
Generated by v_c[-2:-1]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.