Comments (5)
Thanks for pointing this out! I realize this wasn't mentioned in the paper, but we follow the same scheme as Fromage in concatenating captions so that the model sees some interleaved image + text data (i.e., <image><text><image><text>
):
We'll add this to the next version of the paper.
from gill.
Hi @kohjingyu, thanks for your quick reply, I understand the purpose now. But there is one extra question for the inference process. Why don't you append the [IMG] tokens after the caption during the inference process, just as in the training process?
from gill.
During the inference for evaluation we do add the [IMG] tokens:
gill/evals/generate_vist_images.py
Lines 70 to 73 in 232eb02
If you mean for qualitative inference (e.g., a chatbot-based demo), we don't add it explicitly because we want the model to decide when it should produce [IMG] tokens (rather than always adding images at the end of generated text). Hope that makes sense!
from gill.
@kohjingyu Thanks for your detailed explanation, could your tell me how to train the generation task only? Have you tested the performance of only training for one generation task? Do I only need to preserve the position and generation prediction loss? Where is the image token prediction loss in the code? It seems that the losses for generation and retrieval are tight and I am not sure if it make sense to separate them.
from gill.
If you want to train a generation only model, you can probably do so by changing this line to model_modes = ['generation']
:
Line 479 in 232eb02
I haven't tested this in the latest version and you might need to make some minor changes in other parts of the code.
from gill.
Related Issues (20)
- Clarification on precomputing the visual embeddings HOT 1
- How to get cc3m_embeddings HOT 1
- About the running log HOT 4
- Normalization of cc3m features HOT 1
- How could this affect the performance? HOT 10
- About error when running Precomputing Text Embeddings and Train HOT 2
- shape mismatch in the example notebook HOT 2
- [solved]
- why don't you use universal representation in one task?
- GILL Image Retrieval Code on VIST HOT 1
- Inference shape is not 8 HOT 1
- Visdial相关问题
- Error size mismatch when load decision model HOT 2
- RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
- param.grad is None !
- shape mismatch in the example "Multimodal Dialogue" HOT 1
- FID Evaluation on CC3M and VIST
- i try to dowmload cc3m using tools recommand by readme.md, but the number of picture can be download only 10% . is it normal?
- about [img] token and train data
- environment conflict
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gill.