Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question about the training code about gill HOT 5 CLOSED

Epiphqny commented on September 13, 2024

Question about the training code

from gill.

Comments (5)

kohjingyu commented on September 13, 2024

Thanks for pointing this out! I realize this wasn't mentioned in the paper, but we follow the same scheme as Fromage in concatenating captions so that the model sees some interleaved image + text data (i.e., <image><text><image><text>):

We'll add this to the next version of the paper.

from gill.

Epiphqny commented on September 13, 2024

Hi @kohjingyu, thanks for your quick reply, I understand the purpose now. But there is one extra question for the inference process. Why don't you append the [IMG] tokens after the caption during the inference process, just as in the training process?

from gill.

kohjingyu commented on September 13, 2024

During the inference for evaluation we do add the [IMG] tokens:

gill/evals/generate_vist_images.py

Lines 70 to 73 in 232eb02

 # Set a really high ret scale so that we force the model to generate an image 

 # This is equivalent to explicitly appending the [IMG] tokens to the input. 

 return_outputs = model.generate_for_images_and_texts( 

 input_data, num_words=2, gen_scale_factor=1e5, generator=g_cuda)

If you mean for qualitative inference (e.g., a chatbot-based demo), we don't add it explicitly because we want the model to decide when it should produce [IMG] tokens (rather than always adding images at the end of generated text). Hope that makes sense!

from gill.

Epiphqny commented on September 13, 2024

@kohjingyu Thanks for your detailed explanation, could your tell me how to train the generation task only? Have you tested the performance of only training for one generation task? Do I only need to preserve the position and generation prediction loss? Where is the image token prediction loss in the code? It seems that the losses for generation and retrieval are tight and I am not sure if it make sense to separate them.

from gill.

kohjingyu commented on September 13, 2024

If you want to train a generation only model, you can probably do so by changing this line to model_modes = ['generation']:

gill/main.py

Line 479 in 232eb02

model_modes = ['captioning', 'retrieval', 'generation']

I haven't tested this in the latest version and you might need to make some minor changes in other parts of the code.

from gill.

Recommend Projects

Question about the training code about gill HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# Set a really high ret scale so that we force the model to generate an image
	# This is equivalent to explicitly appending the [IMG] tokens to the input.
	return_outputs = model.generate_for_images_and_texts(
	input_data, num_words=2, gen_scale_factor=1e5, generator=g_cuda)