Giter Club home page Giter Club logo

Comments (5)

kohjingyu avatar kohjingyu commented on September 13, 2024

Thanks for pointing this out! I realize this wasn't mentioned in the paper, but we follow the same scheme as Fromage in concatenating captions so that the model sees some interleaved image + text data (i.e., <image><text><image><text>):
Screen Shot 2023-07-25 at 5 01 18 PM

We'll add this to the next version of the paper.

from gill.

Epiphqny avatar Epiphqny commented on September 13, 2024

Hi @kohjingyu, thanks for your quick reply, I understand the purpose now. But there is one extra question for the inference process. Why don't you append the [IMG] tokens after the caption during the inference process, just as in the training process?

image

from gill.

kohjingyu avatar kohjingyu commented on September 13, 2024

During the inference for evaluation we do add the [IMG] tokens:

# Set a really high ret scale so that we force the model to generate an image
# This is equivalent to explicitly appending the [IMG] tokens to the input.
return_outputs = model.generate_for_images_and_texts(
input_data, num_words=2, gen_scale_factor=1e5, generator=g_cuda)

If you mean for qualitative inference (e.g., a chatbot-based demo), we don't add it explicitly because we want the model to decide when it should produce [IMG] tokens (rather than always adding images at the end of generated text). Hope that makes sense!

from gill.

Epiphqny avatar Epiphqny commented on September 13, 2024

@kohjingyu Thanks for your detailed explanation, could your tell me how to train the generation task only? Have you tested the performance of only training for one generation task? Do I only need to preserve the position and generation prediction loss? Where is the image token prediction loss in the code? It seems that the losses for generation and retrieval are tight and I am not sure if it make sense to separate them.

from gill.

kohjingyu avatar kohjingyu commented on September 13, 2024

If you want to train a generation only model, you can probably do so by changing this line to model_modes = ['generation']:

gill/main.py

Line 479 in 232eb02

model_modes = ['captioning', 'retrieval', 'generation']

I haven't tested this in the latest version and you might need to make some minor changes in other parts of the code.

from gill.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.