weixi-feng / structured-diffusion-guidance Goto Github PK
View Code? Open in Web Editor NEWTraining-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
License: Other
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
License: Other
Hello:
2023-11-14 16:29:51 INFO: Loading these models for language: en (English):
===========================
| Processor | Package |
---------------------------
| tokenize | combined |
| pos | combined |
| constituency | wsj |
===========================
2023-11-14 16:29:51 INFO: Use device: gpu
2023-11-14 16:29:51 INFO: Loading: tokenize
2023-11-14 16:30:14 INFO: Loading: pos
2023-11-14 16:30:14 INFO: Loading: constituency
2023-11-14 16:30:15 INFO: Done loading processors!
Global seed set to 42
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
[6] 24162 killed python scripts/txt2img_demo.py --prompt
May the problem?
Congratulations on the arxiv submission!
I tried to reproduce the results of this paper on top of Huggingface Diffusers, based on the reference implementation provided in the preprint.
I ended up implementing like so:
Changes to txt2img
Changes to diffusers
Some explanation in tweet.
In my independent implementation: structured diffusion changes the images only slightly, and in the 10 samples * 4 prompts that I tried, never made the generations more relevant to the prompt.
structured (left) / regular (right) "two blue sheep and a red goat":
I attach the rest of my results:
A red bird and a green apple.zip
A white goat standing next to two black goats.zip
two blue sheep and a red goat.zip
Two ripe spotted bananas are sitting inside a green bowl on a gray counter.zip
Basically, I'm wondering whether:
Could you possibly read my attention.py
and see if it looks like a reasonable interpretation of your algorithm? I changed it substantially to make it to do more work in parallel. I think it should be equivalent, but did I miss something important?
Thanks in advance for any attention you can give this!
Due to the attention map is too many, I could not get the comfortable as your paper, could you show me your postprocess code?
Hello and thank you for the beautiful work. I am facing an issue while I am trying to run your code. Specifically, I can't do any relative imports, as the title also indicates (as a reference, the original stable diffusion code which also does relative imports is working fine). Thank you!
I wonder why you chose the feature replaced by the last noun phrase as the value in conjunction situation. Is there any intuitive explanation? I'm confused since A and B are equally important in the 'A and B' prompt. Then why use the feature replaced by B as the value instead of A?
I changed the value from v_c[-1:]
to v_c[-2:-1]
and didn't see many differences. Does your experiment show that v_c[-1]
is a better choice?
if not conjunction:
c = {'k': k_c[:1], 'v': v_c}
else:
# c = {'k': k_c, 'v': v_c[-1:]}
c = {'k': k_c, 'v': v_c[-2:-1]}
The link to your paper on the main page readme is wrong.
It links to https://arxiv.org/
(the main site) but not your particular paper page.
Hi!
I run the code with --save_atten_maps but got AttributeError: 'CrossAttention' object has no attribute 'attn_maps'. I print the module of crossattention and found that it is composed of several layers of networks. How can I get the attention maps?
👋hello: Notice that you are implementing our method on top of it, is it supported now?
Hi,
Thanks for the great work. I wonder how to reproduce the GLIP results in Table 2 of the paper. I ran the script GLIP_eval/eval.py and got the detection results saved in "glip_results.json", but had no idea how to calculate the GLIP results. Could you please give some advice to calculate the GLIP-related metrics, e.g., Zero/One obj. and Two obj.? Many thanks!
Thanks for the great work. When I was testing the average inference speed of Stable Diffusion and this work. I found that it takes approx 7s for sd1.4 to generate 1 image, and only 5s for this codevase to generate 1 image. Why would this happen? I have made sure that both models are using ddim sampler.
So we can also add our own model and use it
Dear authors, thanks for your great work.
I would like to ask one question about the ablation study of contextualized text embeddings.
In section 4.2, you compared the result of using different length of sentences for image generation. I was wondering if the operation of using few embeddings from CLIP was utilized in both encoding the concept (obtained by a constituency tree) and encoding the whole sentence?
Thanks in advance for your response.
When I tried to change the plms sampler into the ddim sampler in this codebase, I got the following error.
Traceback (most recent call last):
File "scripts/txt2img_timer.py", line 507, in
main()
File "scripts/txt2img_timer.py", line 465, in main
samples_ddim, intermediates = sampler.sample(S=opt.ddim_steps,
File "/home/shawn/anaconda3/envs/structure_diffusion/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/shawn/local/szz/workspace/personal_structure/structured_stable_diffusion/models/diffusion/ddim.py", line 83, in sample
cbs = conditioning[list(conditioning.keys())[0]].shape[0]
AttributeError: 'list' object has no attribute 'shape'
how should I solve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.