Comments (10)
- I used the T5-Large provided by due-benchmark for preprocessing the data.
- The recommended transformers version 4.30.0 was giving
loss does not have a grad function
error. So I had to replace the AdamW optimizer from transformers with the pytorch one. I also tried with 4.20 and AdamW from transformers. However there was no change in the performance.
I used the last checkpoint (last.ckpt
) to get the test predictions. Not sure what exactly is going wrong.
from i-code.
Thank you for the quick reply. So is it not possible to get the results reported in the paper by running the published code without any changes? What is the exact prompt used for DocVQA? The prompt used in RVL-CDIP code is different than what is mentioned in the paper. So I am not sure if prompt used for training DocVQA is also the same from the paper. It would be really helpful if you can provide all the details that are required to obtain the results reported in the paper.
from i-code.
I think the main thing to focus on is the prompt. Finetuning from different prompts affect the performance. Properly adding the 2D and 1D position is also important. Anything missing could result in a performance drop.
from i-code.
The prompt should be the same as in the paper with "question answering on DocVQA. [question]. [context]".
I am mostly curious about the position embedding/bias addition to the model, which matters a lot if not set up properly. Could you provide other information. How many epochs did you run? If it still doesn't work, let me try to push the DocVQA finetuning code.
from i-code.
I used the same prompt as above. The modifications I made are after here as follows:
- prepend the input_ids (item_dict["input_ids"]) with prompt_token_ids
- prepend the attention mask (item_dict["attention_mask"]) with N
True
values where N is the length of the prompt_token_ids - prepend the bounding boxes (item_dict["seg_data"]["tokens"]["bboxes"]) with an Nx4 array of zero values where N is the length of the prompt_token_ids
I used this script for finetuning. The training always stop around 4 epochs due to early stopping criteria.
from i-code.
I was using the Unimodal 224. However, from the paper, the performance of the various models vary only between [+2, -2] at the maximum. Anyway, I will try the other models as well. Thanks for the input.
from i-code.
Hi, I tried the other two variants (512 and dual) as well. These models also did not result in any significant improvement. So far the best score obtained on DocVQA task in due-benchmark is 76.29 with the 512 resolution model.
from i-code.
Could you please provide the following details?
- Which model is used for preprocessing the data (generating memmaps)? Is it the t5-large provided by due-benchmark or the UDOP pretrained model?
- Which transformer version is used to train the model?
from i-code.
- T5-base is used for preprocessing the data. The t5-large is the huggingface transformers.
- I've tested with 4.20 and 4.30.
Btw, which checkpoint you used for evaluation, the one with lowest validation loss or the last checkpoint. I am asking because usually loss is not a good reflector of language score and we usually use the last checkpoint.
from i-code.
what are the resource requirements in order to finetune on DocVQA task?
from i-code.
Related Issues (20)
- Inference VRAM requirements? HOT 2
- Data Collator Incorrect When Using a Decoder Prefix
- from core.common.utils import img_trans_torchvision, get_visual_bbox Module not found error
- layout token unkown HOT 1
- special vis token
- Image loading in dataloader code HOT 5
- Img2txt result is pretty bad on 16bit HOT 4
- Img2Img Broken? HOT 1
- Trainig pipeline of CoDi in i-Code-V3 HOT 6
- i-Code-V3: How could I implement training tasks in i-Code-V3?
- Can you provide classifier-free guidance probability? HOT 1
- Cuda out of memory HOT 1
- Finetuning on InfographicVQA HOT 12
- Release of the model UDOP-1024
- UDOP document editing/generation
- COCO Caption benchmark
- CoDi: Fine Tuning
- VAE in Training of each LDM in CoDi HOT 1
- CoDi-2 Dataset Availability
- Convert to onnx format for interferencing in c#
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from i-code.