dandelin / vilt Goto Github PK

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

License: Apache License 2.0

Python 100.00%

vision-and-language

vilt's Introduction

ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Install

pip install -r requirements.txt
pip install -e .

Download Pretrained Weights

We provide five pretrained weights

ViLT-B/32 Pretrained with MLM+ITM for 200k steps on GCC+SBU+COCO+VG (ViLT-B/32 200k) link
ViLT-B/32 200k finetuned on VQAv2 link
ViLT-B/32 200k finetuned on NLVR2 link
ViLT-B/32 200k finetuned on COCO IR/TR link
ViLT-B/32 200k finetuned on F30K IR/TR link

Out-of-the-box MLM + Visualization Demo

pip install gradio==1.6.4
python demo.py with num_gpus=<0 if you have no gpus else 1> load_path="<YOUR_WEIGHT_ROOT>/vilt_200k_mlm_itm.ckpt"

ex)
python demo.py with num_gpus=0 load_path="weights/vilt_200k_mlm_itm.ckpt"

Out-of-the-box VQA Demo

pip install gradio==1.6.4
python demo_vqa.py with num_gpus=<0 if you have no gpus else 1> load_path="<YOUR_WEIGHT_ROOT>/vilt_vqa.ckpt" test_only=True

ex)
python demo_vqa.py with num_gpus=0 load_path="weights/vilt_vqa.ckpt" test_only=True

Dataset Preparation

See DATA.md

Train New Models

See TRAIN.md

Evaluation

See EVAL.md

Citation

If you use any part of this code and pretrained weights for your own purpose, please cite our paper.

@InProceedings{pmlr-v139-kim21k,
  title = 	 {ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
  author =       {Kim, Wonjae and Son, Bokyung and Kim, Ildoo},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5583--5594},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/kim21k/kim21k.pdf},
  url = 	 {http://proceedings.mlr.press/v139/kim21k.html},
  abstract = 	 {Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.}
}

Contact for Issues

vilt's People

Contributors

Stargazers

Watchers

Forkers

peternara sorrowyn naver-ai liuguoyou senwang98 liuhl-source tejas1995 blackburdai pkurainbow starlight-2021 jaeseokbyun trendingtechnology aaronhd daydreamer2023 chrisbyd coallaoh chunyuany birdflies amsword xrosliang cvlinks yanhuacheng dumpmemory bmyan zeta1999 hellbell procedure2012 codewithzichao zhangxuemiao maxylee chaofantao qianghaozhang rentainhe junchen14 ningshanl sxjscience lllfx jaredfern bodyfunk splionar rotendahl chenjun0210 derrickwang005 ljm198134 manisha2297 shams-sam da-southampton mfkiwl myeonghahwang zhaoshuaibit wintersurvival curiosity654 zl827154659 huang-xx alloooshe shuni1001 csarron aatrox00 edhayes1 chucy2020 zdou0830 prozh98 nielsrogge techthiyanes zbx119044033 mitjanikolaus shimion100 overbestfitting dongzhao12 zlszhonglongshen joanfm fiveflowers xk-wang qizhust zilunzhang koshinryuu khanhnguyen21006 jiany-zhang zencyyoung yui010206 tobran tttyuntian zhanguochang tyut11103 dyhan0920 clovee stony-hub kris927b huangjh98 nobelvictory lyimage hoangphuc1998 campper ag027592 hellothewholeworld pwnyniche bryanyzhu leon-francis leo-bright propardhu

vilt's Issues

Possible out-of-memory issue of dataloader

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

ViltTransformerSS usage clarification

Hello @dandelin,

I am trying to understand the interface expected by ViltTransformerSS

As I see the infer signature is as follows:

    def infer(
        self,
        batch,
        mask_text=False,
        mask_image=False,
        image_token_type_idx=1,
        image_embeds=None,
        image_masks=None,
    ):

If my understanding is correct, If I want to do retrieval and let VilT compute the embedding before entering the coatention layers, I should leave the default values untouched.

But as per the batch parameter, what is the exact signature, reading the code it seems that batch is a Dictionary of Lists with the following keys, which I would like to clarify:

        text_ids = batch[f"text_ids"] (I guess this is the ids after tokenization, what is the type of this?)
        text_labels = batch[f"text_labels"] (What if I do not have a label? Is it okey to be None?)
        text_masks = batch[f"text_masks"] (Is this okey if it is None?)
        img = batch["image"][0] (I guess this is the image (but in what format and with what preprocessing)?

Another thing I observed is that it seems that this inference method works with a single image a time, so I guess it works with a single text and a single image at a time. Is there an inference mode where this can be run with a real larger batch than 1?

Also the output of the function does not seem so clear to me:

       ret = {
            "text_feats": text_feats,
            "image_feats": image_feats,
            "cls_feats": cls_feats,
            "raw_cls_feats": x[:, 0],
            "image_labels": image_labels,
            "image_masks": image_masks,
            "text_labels": text_labels,
            "text_ids": text_ids,
            "text_masks": text_masks,
            "patch_index": patch_index,
        }

Which of these keys can be considered as the similarity metric? I guess the cls_feats or what should I look for?

Thank you very much in advance

python3 run.py with data_root=../../data/COCO num_gpus=2 num_nodes=1 task_mlm_itm whole_word_masking=True step200k per_gpu_batchsize=32

I read your impressive paper and now I try to reproduce your algorithm.

I have a similar issue with [(https://github.com//issues/4)] in pre-training step.
I read your comments on the previous issue (and fixed version of it), but the problem still occurs to me.
(I checked that the other processes do not occupy GPUs, and no problem when only one GPU is used.)

I think the main reason is the difference in the running environment.
I'm not familiar with the pytorch_distributed package, so it is hard to fix this issue.
Could you give me some suggestions for this??

My running environment is as follows:
GPU: 2 x ( Quadro RTX 6000)
Cudnn: 450.57
CUDA 10.2
python: 3.7.4
other packages are same as your requirements.txt

Thanks,

Error log:
#####################################################
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 74, in main
trainer.fit(model, datamodule=dm)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
self.train_loop.run_training_epoch()
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.trainer.hiddens)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 818, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 339, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step
return self._step(args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 170, in _step
output = self.trainer.model(*args)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward
output = self.module.training_step(*inputs[0], **kwargs[0])
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/vilt_module.py", line 219, in training_step
vilt_utils.set_task(self)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/vilt_utils.py", line 179, in set_task
picked = all_gather(current_tasks)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/dist_utils.py", line 169, in all_gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/home/byun/VLP/VLP_MS/ViLT/vilt/modules/dist_utils.py", line 133, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/home/byun/VLP/VLP_MS/ViLT/venv/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1870, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

FLOPS calculation

hi
when you compute the FLOPS in table 6 for baseline models such as ViLBERT, do you also include the FLOPS computation of feature extraction models?

Pretraining unstable

I carefully processed the pretraining datasets. But the loss is unstable when pretraining. Which is not monotone decreasing.
Is this right? I worry that there are some mistakes in the complicated data preprocessing.
For example:

Epoch 0: 34%|███▍ | 12412/36673 [4:26:06<8:40:09, 1.29s/it, loss=2.58, v_num=0]

Epoch 0: 35%|███▍ | 12665/36673 [4:36:41<8:44:29, 1.31s/it, loss=3.12, v_num=0]

Epoch 0: 39%|███▉ | 14368/36673 [5:43:41<8:53:33, 1.44s/it, loss=2.57, v_num=0]

Epoch 0: 40%|███▉ | 14583/36673 [5:56:04<8:59:23, 1.47s/it, loss=3.36, v_num=0]

Question about GCC dataset download

root
├── images_train
│ ├── 0000 # First four letters of the image name
│ │ ├── 0000000 # Image Binary
│ │ ├── 0000001
│ │ └── ...
│ ├── 0001
│ │ ├── 0001000
│ │ ├── 0001001
│ │ └── ...

Hello， please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

Unable to reproduce the 100k results

Dear Authors,
Thanks for open sourcing the code. I tried pretrain 100k steps and finetune on vqav2, but my dev-test score is about 65, unlike the 70.8 on the paper.

Here is my pretrain and finetune command

python run.py with data_root=vilt_dataset/ \
	num_gpus=8 num_nodes=8 task_mlm_itm whole_word_masking=True step100k \
	per_gpu_batchsize=64 exp_name=pretrain

python run.py with data_root=vilt_dataset/ \
	num_gpus=8 num_nodes=1 task_finetune_vqa_randaug \
	per_gpu_batchsize=32 load_path="result/pretrain_seed0_from_/version_0/checkpoints/last.ckpt" \
	exp_name=vqa_finetune

Generate JSON with

python run.py with data_root=vilt_dataset/ \
	num_gpus=4 num_nodes=1 task_finetune_vqa \
	per_gpu_batchsize=256 load_path="result/vqa_finetune_seed0_from_last/version_0/checkpoints/last.ckpt" \
	test_only=True  exp_name="test_vqa"

here is my pretraining and finetuning tb log

Time cost when fine-tuning on COCO-irtr dataset

Hi, Thanks for your fantastic work
I've been trying to reproduce the result on fine-tuning on pre-trained model.It takes me only about 6 hours to fine-tune on VQAV2 dataset.But when I try to fine-tune on the COCO dataset,It took me about 20 hours to run only 1 epoch,I wonder if It's the corrent result.
Also,I find that when I finish one epoch fine-tuning on COCO dataset,It didn't automatically save the model.Is there a problem with my setting?
My experiement was ran on 8 V100 GPUs.

How to run programs on multiple machines, such as 4 machines with 8 gpus(4*8=32 in total)?

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU>

ex)
python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_mlm_itm whole_word_masking=True step200k per_gpu_batchsize=64

How do I set up these codes and what other operations need to be done？

Reproduce Flickr30k Evaluation results - DataSet problem

Hello again @dandelin ,

I was trying to reproduce the steps from https://github.com/dandelin/ViLT/blob/master/EVAL.md to get the results from Flickr30k T2IR.

First I did what is suggested in https://github.com/dandelin/ViLT/blob/master/DATA.md.

So I have in a folder /content/flickr30k this structure:

    /content/flickr30k
     ├── flickr30k_images            
     │   ├── ....jpg
     |   ├── ....jpg
     ├── karpathy          
         ├── dataset_flickr30k.json

Then I do the transformation:

from vilt.utils.write_f30k_karpathy import make_arrow
make_arrow( '/content/flickr30k',  '/content/arrow')

But when I run:

python run.py with data_root='/content/arrow' num_gpus=1 num_nodes=1 per_gpu_batchsize=4 task_finetune_irtr_f30k_randaug test_only=True load_path="/content/TFM_Sparse_Embeddings/vilt_irtr_f30k.ckpt"

I get the error:

ERROR - ViLT - Failed after 0:00:06!
Traceback (most recent calls WITHOUT Sacred internals):
  File "run.py", line 73, in main
    trainer.test(model, datamodule=dm)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 755, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 820, in __test_given_model
    results = self.fit(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
    results = self.accelerator_backend.train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
    results = self.train_or_test()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 67, in train_or_test
    results = self.trainer.run_test()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 662, in run_test
    eval_loop_results, _ = self.run_evaluation(test_mode=True)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 566, in run_evaluation
    dataloaders, max_batches = self.evaluation_loop.get_evaluation_dataloaders(max_batches)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 56, in get_evaluation_dataloaders
    self.trainer.reset_test_dataloader(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/data_loading.py", line 299, in reset_test_dataloader
    self._reset_eval_dataloader(model, 'test')
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/data_loading.py", line 249, in _reset_eval_dataloader
    num_batches = len(dataloader) if has_len(dataloader) else float('inf')
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py", line 33, in has_len
    raise ValueError('`Dataloader` returned 0 length.'
ValueError: `Dataloader` returned 0 length. Please make sure that your Dataloader at least returns 1 batch

Data prepare

When I try to train the model, there are some problems with the Dataloader. I get many errors such as
'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset？ Thank you very much! :)

inference image captioning

who can do a demo for my image-captioning of ViLT. pleaseee!@! I'm a newbie in NLP field <33

python run.py with data_root=/data/workspace/dataset num_gpus=4 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="weights/vilt_200k_mlm_itm.ckpt"

Saving latest checkpoint...
INFO - lightning - Saving latest checkpoint...
ERROR - ViLT - Failed after 1:05:38!
Traceback (most recent calls WITHOUT Sacred internals):
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
self.train_loop.run_training_epoch()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.trainer.hiddens)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 818, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 339, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step
return self._step(args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 170, in _step
output = self.trainer.model(*args)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward
output = self.module.training_step(*inputs[0], **kwargs[0])
File "/data/workspace/ViLT/vilt/modules/vilt_module.py", line 219, in training_step
vilt_utils.set_task(self)
File "/data/workspace/ViLT/vilt/modules/vilt_utils.py", line 177, in set_task
picked = all_gather(current_tasks)
File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 165, in all_gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 129, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1870, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

During handling of the above exception, another exception occurred:

Traceback (most recent calls WITHOUT Sacred internals):
File "/data/workspace/ViLT/run.py", line 72, in main
trainer.fit(model, datamodule=dm)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 555, in train
self.train_loop.on_train_end()
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 200, in on_train_end
self.check_checkpoint_callback(should_save=True, is_last=True)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 234, in check_checkpoint_callback
callback.on_validation_end(self.trainer, model)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 203, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 238, in save_checkpoint
self._validate_monitor_key(trainer)
File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 516, in _validate_monitor_key
raise MisconfigurationException(m)
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/the_metric') not found in the returned metrics: ['irtr/train/irtr_loss', 'itm/train/loss', 'itm/train/wpa_loss', 'itm/train/accuracy']. HINT: Did you call self.log('val/the_metric', tensor) in the LightningModule?

Epoch 0: 0%| | 24/9691 [30:14<202:59:08, 75.59s/it, loss=0.579, v_num=0]

Masked Patch Prediction

can you please explain what is the objective Masked Patch Prediction MPP means? I tried reading the paper but did not find any useful informaiton

The time consumption for pretraining

hi,

thanks for releasing your code!

I am wondering how much time you spent during the pretraining for 64 V100 GPUs.

Error in PyTorch-Lightning when Finetuning on VQA

Hello,

I am trying to finetune ViLT on the VQAv2 task - I created the arrow_root directory as instructed, and then ran:
python run.py with data_root=<PROJECT_DIR>/arrow_root/vqav2/ num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

However, once the model begins training, I get the following error:

Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 711, in run_training_batch
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 817, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 304, in training_step
closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

I printed the value of training_step_output right before the error: {'extra': {}, 'minimize': None}. I am not too familiar with pyTorch-Lighting, but this doesn't seem to be the correct output.

Am I missing any steps here, apart from creating the arrow data and running the model?

Time Calculation

Please could you please provide some details how you calculated the time for example here in this figure

integrate with Lightning ecosystem CI

Hello and so happy to see you use Pytorch-Lightning! 🎉
Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI ⚡
As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... 😕 We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... 👍

What is needed to do?

have some tests, including PL integration
add config to ecosystem CI - https://github.com/PyTorchLightning/ecosystem-ci

What will you get?

scheduled nightly testing configured for development/stable versions
slack notification if something went wrong to investigate
testing also on multi-GPU machine as our gift to you 🐰

cc: @Borda

Add ViLT to HuggingFace Transformers

Hi,

I've been reading the ViLT paper and was impressed by the simplicity, as it only adds text embeddings to a ViT.

As ViT is already available in HuggingFace Transformers, adding ViLT should be relatively easy.

I've currently implemented the model (see here for my current implementation). It includes a conversion script (convert_vilt_original_to_pytorch.py) to convert the weights from this repository (the PyTorch Lightning module) to its HuggingFace counterpart, for all models (base one + the ones with a head on top).

However, I'm facing some issues when performing a forward pass with the original implementation in Google Colab (when just doing pip install -r requirements.txt and running the demo_vqa.py script, you get the following):

Traceback (most recent call last):
  File "demo_vqa.py", line 17, in <module>
    from vilt.modules import ViLTransformerSS
  File "/content/ViLT/vilt/modules/__init__.py", line 1, in <module>
    from .vilt_module import ViLTransformerSS
  File "/content/ViLT/vilt/modules/vilt_module.py", line 3, in <module>
    import pytorch_lightning as pl
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 62, in <module>
    from pytorch_lightning import metrics
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.metric import Metric
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/metric.py", line 23, in <module>
    from pytorch_lightning.metrics.utils import _flatten, dim_zero_cat, dim_zero_mean, dim_zero_sum
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/utils.py", line 18, in <module>
    from pytorch_lightning.utilities import rank_zero_warn
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 24, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module>
    from torchtext.data import Batch
ImportError: cannot import name 'Batch' from 'torchtext.data' (/usr/local/lib/python3.7/dist-packages/torchtext/data/__init__.py)

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

Upgrading PyTorch Lightning to the latest version also returns an error:

Traceback (most recent call last):
  File "demo_vqa.py", line 17, in <module>
    from vilt.modules import ViLTransformerSS
  File "/content/ViLT/vilt/modules/__init__.py", line 1, in <module>
    from .vilt_module import ViLTransformerSS
  File "/content/ViLT/vilt/modules/vilt_module.py", line 7, in <module>
    from vilt.modules import heads, objectives, vilt_utils
  File "/content/ViLT/vilt/modules/vilt_utils.py", line 11, in <module>
    from vilt.gadgets.my_metrics import Accuracy, VQAScore, Scalar
  File "/content/ViLT/vilt/gadgets/my_metrics.py", line 2, in <module>
    from pytorch_lightning.metrics import Metric
ModuleNotFoundError: No module named 'pytorch_lightning.metrics'

As PL deprecated the metrics module.

Are you able to provide a simple Colab notebook to perform inference on an image+text pair?

Thanks!

Finetune on F30K IR/TR，why use itm

Finetune on F30K IR/TR，i found loss set ‘’loss_names = _loss_names({'item':0.5, 'irtr': 1})‘’
Isn't it a good idea to use separate irtr here?

Visual Grounding

Is this model useful for visual grounding purposes? if so how should I change it?

License

Hi, thank you for the great work! Could you upload a license for this repo?

Nearly 3 points below the provided VQAv2 fine-tuning result

Hi,

Thank you for releasing the code!

I want to have a try on getting the result of fine-tuning on VQAv2 recently.
The result generated by the provided vilt_vqa.ckpt (ViLT-B/32 200k finetuned on VQAv2) is [{"test-dev": {"yes/no": 87.44, "number": 50.2, "other": 62.38, "overall": 71.32}}].

Then, I used "num_gpus=4 num_nodes=1 task_finetune_vqa_randaug per_gpu_batchsize=16 load_path=".../vilt_200k_mlm_itm.ckpt"" to fine-tune on VQAv2 using my GPUs and uploaded to VQA Challenge 2021 to evaluate. The result was [{"test-dev": {"yes/no": 85.28, "number": 46.63, "other": 58.87, "overall": 68.36}}].

Will there be the same result if I use the same settings?

COCO split for pre-training

Hi @dandelin , thanks for this great repo and work! Could you please say what COCO split was used for pre-training? (was it 2014, 2017, Karpathy, or something else?) Thanks!

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt"

I am encountering this error:

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
ERROR - ViLT - Failed after 0:00:12!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 17, in main
model = ViLTransformerSS(_config)
File "/others/cs16b114/ViLT/vilt/modules/vilt_module.py", line 61, in init
ckpt = torch.load(self.hparams.config["load_path"], map_location="cpu")
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb11cc1a193 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fafcd29cafb in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fafcd29dd14 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c6296 (0x7fb0ad870296 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2957d4 (0x7fb0ad43f7d4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)

frame #6: python() [0x4d8067]
frame #8: python() [0x58f850]
frame #10: python() [0x54aa51]
frame #14: python() [0x58f6a3]
frame #19: python() [0x54a880]
frame #23: python() [0x58f6a3]
frame #30: python() [0x5db3e4]
frame #32: + 0x431b (0x7fb0e0b9531b in /usr/local/lib/python3.7/dist-packages/wrapt/_wrappers.cpython-37m-x86_64-linux-gnu.so)
frame #37: python() [0x59412b]
frame #42: python() [0x54a880]
frame #51: python() [0x6308e2]
frame #54: python() [0x65450e]
frame #56: __libc_start_main + 0xe7 (0x7fb12285abf7 in /lib/x86_64-linux-gnu/libc.so.6)

Do I have to download the pre-training weights of VILT or the pre-training weights of VIT

hello, I didn't use the pre-training weight of you provided, get the following error：

INFO - timm.models.helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth)
Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth"

urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

Can you share 'vqa_dict.json' file for vqa_demo?

Hi, @dandelin
Thank you for your interesting work, I am runing demo_vqa.py and failed at line 38 https://dl.dropboxusercontent.com/s/otya4i5sagt4f5p/vqa_dict.json, because this link is unavailable now.
Do you download this file?
Also, vilt_200k_mlm_itm.ckpt link is already unavailable...
Wishing for your reply!

Regarding pretraining time

First of all, thanks for great work.
Can you tell us how long the pretraining took on your machine with 64 V100s ?
Thank you in advance

AttributeError: module 'vilt' has no attribute 'modules'

I run into an error

File "run.py",line 6, in
from vilt.modules import ViLTransformerSS
File "ViLT/vilt/moudles/vilt/moudules/init.py",line 1, in
form .vilt_module import ViLTransformerSS
File "ViLT/vilt/moudles/vilt_moudule.py",line 4, in
import vilt.module.vision_transformer as vit
AttributeError: module 'vilt' has no attribute 'modules'

when I run the "Evaluate VQAv2" command

got error when I turn on mpp (Masked Patch Prediction)

Hi, @dandelin

When I turn on the mpp (Masked Patch Prediction), I get this error:

AttributeError: 'VisionTransformer' object has no attribute 'mask_token'

The above error is appear in vision_transformer.py. Could you please tell me how to address it?

Thank you for your help.

Best regards,
Ge-Peng.

Is it necessary to manually add the sampler in the dataloader?

Hi,

Is it necessary to manually add the Pytorch's DistributedSampler in the dataloader?

https://github.com/dandelin/ViLT/blob/master/vilt/datamodules/multitask_datamodule.py#L46

It seems that Pytorch Lightning automatically uses the Pytorch's DistributedSampler.

About finetune on f30k.

Hi, I am very interested in your work! I am wondering why use 15 texts as negative samples instead of 1 text during the finetune period. And what do you think training the model from scratch only using flickr30k dataset?

Finetuning VQA from checkpoint has unmatched keys

Command

$PYTHONBIN run.py with data_root=dataset  \
        num_gpus=1 num_nodes=1 task_finetune_vqa \
        per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

And the error is in trainer.fit()

  File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 184, in setu[52/1882]
g
    self.trainer.checkpoint_connector.restore_weights(model)
  File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 63,
 in restore_weights
    self.hpc_load(checkpoint_path, self.trainer.on_gpu)
  File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 336
, in hpc_load
    self.restore_model_state(model, checkpoint)
  File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 119
, in restore_model_state
    model.load_state_dict(checkpoint['state_dict'])
  File "/data/home/lyuchen/miniconda/envs/vilt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ViLTransformerSS:
        Missing key(s) in state_dict: "vqa_classifier.0.weight", "vqa_classifier.0.bias", "vqa_classifier.1.weight", "vqa_classifier.1.bias", "vqa
_classifier.3.weight", "vqa_classifier.3.bias".
        Unexpected key(s) in state_dict: "mlm_score.bias", "mlm_score.transform.dense.weight", "mlm_score.transform.dense.bias", "mlm_score.transf
orm.LayerNorm.weight", "mlm_score.transform.LayerNorm.bias", "mlm_score.decoder.weight", "itm_score.fc.weight", "itm_score.fc.bias".

ITM Objectives task will not be enabled？

hello, @dandelin

I have tracked the code at vilt_module.py

from the training_step function -> set_task function

pl_module.current_tasks = [
        k for k, v in pl_module.hparams.config["loss_names"].items() if v >= 1
    ]

ITM task will enabled only when v>=1

However, no matter what pre training task, all ITM parameters are set to 0.5 in config.py

Is there a problem with my understanding？

thank you!

Got better results than in the paper:

Hey @dandelin ,

I just want to share the results I reproduced with my own recall implementation. Here is my ViltModel

from typing import List, Dict

import torch
from transformers import BertTokenizer

from vilt.modules import ViLTransformerSS


class ViltModel(ViLTransformerSS):
    def __init__(
            self,
            config,
            *args,
            **kwargs,
    ):
        super().__init__(config)
        self._config = config
        if torch.cuda.is_available():
            dev = "cuda:0"
        else:
            dev = "cpu"
        self._device = torch.device(dev)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.eval()

    @property
    def in_cuda(self):
        return next(self.parameters()).is_cuda

    def rank_query_vs_images(self, query: str, images: List):
        rank_scores = []
        encoded_input = self.tokenizer(query, return_tensors='pt')
        input_ids = encoded_input['input_ids'][:, :self._config['max_text_len']]
        mask = encoded_input['attention_mask'][:, :self._config['max_text_len']]
        in_cuda = self.in_cuda
        if in_cuda:
            input_ids = input_ids.to(self._device)
            mask = mask.to(self._device)
        batch = {'text_ids': input_ids, 'text_masks': mask, 'text_labels': None}
        # no masking
        for image in images:
            if in_cuda:
                image = image.to(self._device)
            batch['image'] = [image.unsqueeze(0)]
            score = self.rank_output(self.infer(batch)['cls_feats'])[:, 0]
            rank_scores.append(score.detach().cpu().item())
        return rank_scores

The compute recall method:

def compute_recall():
    import copy
    from vilt import config
    from vilt.transforms.pixelbert import pixelbert_transform
    from src.dataset.dataset import get_image_data_loader, get_captions_data_loader
    from src.evaluate import evaluate

    # Scared config is immutable object, so you need to deepcopy it.
    conf = copy.deepcopy(config.config())
    conf['load_path'] = VILT_BASE_MODEL_LOAD_PATH
    conf['test_only'] = True
    conf['max_text_len'] = 40
    conf['max_text_len'] = 40
    conf['data_root'] = '/hdd/master/tfm/arrow'
    conf['datasets'] = ['f30k']
    conf['batch_size'] = 1
    conf['per_gpu_batchsize'] = 1
    conf['draw_false_image'] = 0
    conf['num_workers'] = 1

    # You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the
    # loss during training)
    loss_names = {
        'itm': 0.5,
        'mlm': 0,
        'mpp': 0,
        'vqa': 0,
        'imgcls': 0,
        'nlvr2': 0,
        'irtr': 1,
        'arc': 0,
    }
    conf['loss_names'] = loss_names

    if torch.cuda.is_available():
        dev = 'cuda:0'
    else:
        dev = 'cpu'
    device = torch.device(dev)

    print(f' conf for ViltModel {conf}')

    vilt_model = ViltModel(conf)
    vilt_model.to(device)

    image_dataset = get_image_data_loader(root=DATASET_ROOT_PATH,
                                          split_root=DATASET_SPLIT_ROOT_PATH,
                                          split='test',
                                          transform=pixelbert_transform(384),
                                          batch_size=1) # loading the images with the pixelBert transformation

    text_dataset = get_captions_data_loader(root=DATASET_ROOT_PATH,
                                            split_root=DATASET_SPLIT_ROOT_PATH,
                                            split='test',
                                            batch_size=1) # loading the captions with the pixelBert transformation

    images = []
    filenames = []
    for filenames_batch, images_batch in image_dataset:
        filenames.extend(filenames_batch)
        images.extend(images_batch)

    retrieved_image_filenames = []
    groundtruth_expected_image_filenames = []
    print(f' number of queries {len(text_dataset)}, against {len(images)}') # this leads to 5000 captions against 1000 images
    for matching_filename, query in text_dataset:
        filename = matching_filename[0]
        groundtruth_expected_image_filenames.append([filename])
        q = query[0]
        start = time.time()
        scores = vilt_model.rank_query_vs_images(q, images)
        print(f' time to rank a single query {time.time() - start}s')
        retrieved_image_filenames.append([f for _, f in sorted(zip(scores, filenames), reverse=True)])

    evaluate(['recall', 'reciprocal_rank'], retrieved_image_filenames,
             groundtruth_expected_image_filenames,
             [1, 5, 10, 20, 100, 200, 500, None],
             {}, print_results=True)

The obtained results are:

 Mean Recall@1 0.7584
 Mean Recall@5 0.9554
 Mean Recall@10 0.9826
 Mean Recall@20 0.9932
 Mean Recall@100 0.9998
 Mean Recall@200 1.0
 Mean Recall@500 1.0
 Mean Recall@None 1.0
 Mean Reciprocal rank 0.8449181004476226

I know that the results could differ from those in the paper, but this seems like an extremely good result? Is there something I am obviously doing wrong?

Thanks in advance

Visualizations for VQA dataset

Hi,
I want to produce visualizations for VQA like you show in demo.py.

Can you please help in what is needed to be changed in demo.py to make it work for VQA?

For instance, I see loss_names has a value of 0.5 for 'mlm'. What value should be kept for VQA?
Looking forward to your help.

Best,
a

Incorrect dataset link.

Hello,
The link of NLVR2 in DATA.md seems incorrect.

I reproduced the code of pytorch version, but get different result

In Image Retrieval, the R@1 is 68.4 which is higher than 61.9 in paper
In Text retrieval, the R@1 is 73.5 which is lower than 81.4 in paper

So, I want if the input format is error in my code.

In image, I use "pixelbert_transform" function of size=384. In Text, I use Bert base tokenizer with max len 40 which includes [CLS], word tokens and without [SEP]. In flickr-30k, I use dataset_flickr30k.json to get test datasets, and I chose the first caption of five about each image.

Thanks very much for your help!

missing values in vqa_dict.json

hi, it seems the values for key 125 is missing from the vqa_dict.json file:
"124": "car", "126": "cargo",

so there are only 3128 labels for the vqa instead of 3129.

small difference between paper and code about token type embedding

Thanks for your paper and code, it helps me a lot.
There is a small problem that makes me feel confused. In your paper 3.1, the text embedding consists of word embedding, position embedding, and modal-type embedding.

while in the source code of vilt/modules/vilt_module.py, the text_embedding is implemented by:

from transformers.models.bert.modeling_bert import BertConfig, BertEmbeddings
...
  self.text_embeddings = BertEmbeddings(bert_config)

and an extra token_type embedding
self.token_type_embeddings = nn.Embedding(2, config["hidden_size"])
As I know, BertEmbedding() already contains a token type embedding operation inside, so there are actually two token type embedding for text input, and one token type embedding for image input.
I know the self.token_type_embeddings is used as the modal_type embedding to distinguish between image and text.
Is it a mistake? Is it ok not to remove the token type embedding inside BertEmbeddings(bert_config)? Will it cause any difference?
Hope for your reply, thanks!

Download and process GCC and SBU

I'm very sorry for my stupid question.

The datasets from the websites are the type of '.tsv' or else.
Before processing arrow files, some files like '.json' are required.

If it is convenient for you, could you share your codes for downloading images and processing tsv into json?
I am very sorry to disturb you.

Flickr30k Finetune results does not match the provided checkpoint

Hi authors,

I take the provided pretrained 200k checkpoint and did the finetuning of flickr30k. The IR and TR scores after are 64.5 and 81.7. The TR score lower than the one in the paper. My finetuning command is

$PYTHONBIN run.py with data_root=vilt_dataset/ \
        num_gpus=8 num_nodes=1 task_finetune_irtr_f30k \
        per_gpu_batchsize=4 load_path="weights/vilt_200k.ckpt" \
        exp_name="f30k/finetune_official"

I also test the given vilt_irtr_f30k.ckpt and the results is good, with IR=65.3, TR=83.5. Can I ask what is the process of getting vilt_irtr_f30k.ckpt?

`self.mask_token` at 553 line in `vision_transformer.py`

Hi, i'm really impressed with your work and i'm getting a lot of help !

But, there is an error in vilt/modules/vision_transformer.py when mask_it == True.
The error occur because self.mask_token at 553 line in vision_transformer.py is not initilized.
So, i wonder what self.mask_token means?

Thanks.

Debug model

The training process is DDP, can not debug with pdb. Could you please offer the version code without DDP?
or without pytorch_lighting

Pre-training Time

Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
How long have you been training with 64 V100 GPUs?
Thank you!

self.mask_token at 553 line in vision_transformer.py #35 I find the self.mask_token has not been defined.

Hi, I find that the 'self.mask_token' has not been defined. So I can't run the pre-train code. Can you tell me how to solve it?

Thanks!

itm and mlm training accuracy

do you remember your final pretraining itm and mlm training accuracy? I just wanna estimate my pretraining performance in comparison with yours.

we really appreciate it if you can share those data. Thanks

Question about image transformation: short edge is still 384 for the fine-tuning task?

Thanks for your great codes!
I carefully read your paper.

(in your paper) We resize the shorter edge of input images to 384 and limit the longer edge to under 640 while preserving the aspect ratio. This resizing scheme is also used during object detection in other VLP models, but with a larger size of the shorter edge (800). Patch projection of ViLT-B/32 yields 12 × 20 = 240 patches for an image with a resolution of 384*640.

However, I find that the "image_size=384" for all downstream tasks in this codes?

Would it have an effect on the performance of downstream tasks? At least with a shorter edge 800 can greatly increase the length of the sequence. So It should have a smaller batch size when using "shorter edge 800"

A very large batchsize requires 64 GPUs

Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.

If using a small batch size, the performance would drop? How much? Can you provide any empirical results?

python run.py with data_root=content/datasets num_gpus=2 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=64

i encounter this when i pre-train with coco:
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
INFO - timm.models.helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth)
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank ().
INFO - lightning - Using environment variable NODE_RANK for node rank ().
ERROR - ViLT - Failed after 0:00:06!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 67, in main
val_check_interval=_config["val_check_interval"],
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init
deterministic,
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init
self.trainer.node_rank = self.determine_ddp_node_rank()
File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''