Giter Club home page Giter Club logo

dreamllm's People

Contributors

eltociear avatar runpeidong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dreamllm's Issues

Question about the generation of conditional embeddings

Hi, thanks for your great work. I have some questions when I read the paper.

The generation of conditional embeddings is shown in Equation (3). The learnable dream tokens, together with the interleaved document sequence so far x and the generated image so far V, are fed into a cross-attention model to figure out conditional embedding. For this process, I have several questions in detail:

  1. What is the architecture of the cross-attention model? How deep is it? Is it randomly initialized?
  2. For the input of this cross-attention modal, do you input the raw text features before the LLM? or the last hidden state of the LLM? How about the visual features V? Do you input the visual features outputted from the visual encoder and the visual projection?

Thanks!

Clarification Regarding Visual embeddings

Hi, thanks for this new techniques for extending MLLMs for interleaved documents.

I had a doubt regarding visual encoder in fig-2 and section 3.

Screenshot 2023-10-02 at 7 20 48 PM

In figure 2, as I understand that "during training" itself the model is learning to generate <dream> tokens as the cat example is given for which dream queries are learned which are essentially textual inversion embeddings that can be synergised with the remaining context of the textual tokens. (it shows inference stream, but interleaved doc shows text with cat image so it confuses me a bit)
but as that happens, are we again sending the input image to the models via CLIP encodings via an extra projection encodings as shown in the diagram. (is that right? or we're just using the dream queries further ahead)

Also, for taking visual inputs are we following a similar pipeline of CLIP embeddings with projection like that of EMU?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.