runpeidong / dreamllm Goto Github PK

View Code? Open in Web Editor NEW

361.0 361.0 5.0 2.49 MB

[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation

Home Page: https://dreamllm.github.io/

License: Apache License 2.0

dreamllm's People

Contributors

Stargazers

Watchers

Forkers

eltociear matthewcintron azure-arc-0 jxzhangjhu

dreamllm's Issues

Question about the generation of conditional embeddings

Hi, thanks for your great work. I have some questions when I read the paper.

The generation of conditional embeddings is shown in Equation (3). The learnable dream tokens, together with the interleaved document sequence so far x and the generated image so far V, are fed into a cross-attention model to figure out conditional embedding. For this process, I have several questions in detail:

What is the architecture of the cross-attention model? How deep is it? Is it randomly initialized?
For the input of this cross-attention modal, do you input the raw text features before the LLM? or the last hidden state of the LLM? How about the visual features V? Do you input the visual features outputted from the visual encoder and the visual projection?

Thanks!

Clarification Regarding Visual embeddings

Hi, thanks for this new techniques for extending MLLMs for interleaved documents.

I had a doubt regarding visual encoder in fig-2 and section 3.

In figure 2, as I understand that "during training" itself the model is learning to generate <dream> tokens as the cat example is given for which dream queries are learned which are essentially textual inversion embeddings that can be synergised with the remaining context of the textual tokens. (it shows inference stream, but interleaved doc shows text with cat image so it confuses me a bit)
but as that happens, are we again sending the input image to the models via CLIP encodings via an extra projection encodings as shown in the diagram. (is that right? or we're just using the dream queries further ahead)

Also, for taking visual inputs are we following a similar pipeline of CLIP embeddings with projection like that of EMU?

Thank you

where to try?

how to try the project?

Looking forward to the code

Hi Author,
Thanks for the excellent work. Looking forward to the code. Thanks

runpeidong / dreamllm Goto Github PK

dreamllm's People

Contributors

Stargazers

Watchers

Forkers

dreamllm's Issues

Question about the generation of conditional embeddings

Clarification Regarding Visual embeddings

where to try?

Looking forward to the code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent