Giter Club home page Giter Club logo

pevl's Introduction

PEVL

This is the official PyTorch implementation of the EMNLP 2022 paper "PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models".

Quick links

Overview

PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. PEVL shows impressive results of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs such as visual commomsense reasoning, visual relation detection and visual question answering(GQA). For more details, please see the paper PEVL

Install

Please refer to INSTALL.

Pretraining Instructions

Before pretraining, we initialize PEVL's weights with the parameters of ALBEF[14M]

Our raw pretraining corpus is from Visual Commonsense Reasoning(VCR) and MDETR that collects images from Flickr30k entities, COCO, Visual Genome datasets.

Second Stage Pre-training and Fine-tuning

You can download our first-stage pre-training model from pre-trained pevl. We conduct second stage pre-training and fine-tuning for all downstream tasks.

Referring Expression Comprehension

  1. Second-stage pre-trained checkpoint for position output tasks.
  2. Dataset json files for position output downstream tasks.(the 'file_name' in each json file need to be changed to your own directory)
  3. In configs/visual_grounding.yaml, set the paths for the json files.
  4. Fine-tuning the model using 4 V100 GPUs:
##RefCOCO:
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset refcoco --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcoco --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0  --pretrain 0 --test_dataset refcoco --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcoco_test --checkpoint [Finetuned checkpoint]

##RefCOCOg
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1  --pretrain 0 --test_dataset refcocog --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocog --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0  --pretrain 0 --test_dataset refcocog --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocog_test --checkpoint [Finetuned checkpoint]

##RefCOCO+
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1  --pretrain 0 --test_dataset refcocop --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocop --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0  --pretrain 0 --test_dataset refcocop --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocop_test --checkpoint [Finetuned checkpoint]

Phrase Grounding

  1. Second stage pre-trained checkpoint for position output tasks.
  2. Dataset json files for position output downstream tasks.
  3. In configs/visual_grounding.yaml, set the paths for the json files.
  4. Fine-tuning the model using 8 V100 GPUs:
##Flickr30k
###train
python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint  [Finetuned checkpoint]

Visual Relation Detection (VRD)

  1. Second stage pre-trained checkpoint for visual relation detection.
  2. Download PEVL's VRD dataset json files for visual relation detection from pevl_vrd and images for VRD from Visual Genome .
  3. In configs/vrd.yaml, set the paths for the json files.
  4. Fine-tuning the model using 8 V100 GPUs:
##for finetuning on visual genome:
python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_vrd_train.py --train 1 --pretrain 0 --mode finetune --config ./configs/vrd.yaml --output_dir ./output/vrd --checkpoint vrd.pth

##for evaluation on visual genome:
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_vrd_train.py --train 0 --pretrain 0 --config ./configs/vrd.yaml  --checkpoint [Finetuned checkpoint]

Visual Commonsense Reasoning (VCR)

  1. Second-stage pre-trained checkpoint for visual commonsense reasoning.
  2. Fine-tuned checkpoint for visual commonsense reasoning.
  3. Download PEVL's VCR dataset json files from vcr data and images for visual commonsense reasoning from original websites VCR .
  4. In configs/vcr.yaml, set the paths for the json files and vcr images.

Visual Question Answering (GQA)

  1. Download PEVL's GQA dataset json files from pevl_gqa and images for GQA from original websites GQA .
  2. In configs/gqa.yaml, set the paths for the json files and gqa images.

Citations

If you find this project helps your research, please kindly consider citing our paper in your publications.

@inproceedings{yao2022pevl,
  title={PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models},
  author={Yao, Yuan and Chen, Qianyu and Zhang, Ao and Ji, Wei and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong},
  booktitle={Proceedings of EMNLP},
  year={2022}
}

Acknowledgement

The implementation of PEVL relies on resources from ALBEF especially, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing and excellent work.

pevl's People

Contributors

qyc-98 avatar yaoyuanthu avatar zt-wang19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pevl's Issues

Reproducing the pretrain

Hi, thanks for sharing the code of your interesting work.

I saw run_pretrain.py in your repo; however, there is a missing config file ./configs/Pretrain.yaml' and script for running the pretraining.
Since the pretraining strategy is one of your novel contributions, would you have planned to release the complete code for reproducing the pretraining?

Reproducing the Phrase Grounding task

Hi, thanks for sharing the code of your interesting work.

  1. I want to reproduce the phrase groundinig task,
    So when I tried running the following command on the flicker dataset, I encountered the following error.
    The flicker json file does not have keys such as tokens_positive or not_crop_bbox_list. How can I resolve this issue?

python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint grounding.pth --eval_step 500

image-2023-10-4_14-13-56
image-2023-10-4_14-9-29

  1. in flicker.json

file_name": "flickr30k_images/flickr30k_images/1000092795.jpg", "text_type": "caption", "height": 500, "width": 333, "pseudo_caption": "Two young guys with shaggy hair look at their hands @@ [pos_242] [pos_188] [pos_302] [pos_229]

while hanging out in the yard .", "normal_caption": "Two young guys with shaggy hair look at their hands while hanging out in the yard .", "bbox": [158.0, 184.0, 40.0, 41.0], "bbox_list": [[158.0, 184.0, 40.0, 41.0]]},


What is the meaning of '@@ [pos_242][pos_188][pos_302][pos_229]? If I want to fine-tune on my custom dataset, I need to create a JSON file that follows the same input format, right?

  1. In Refcoco.json, what is the meaning of not_crop_bbox_list, positive token, negative token? If I want to fine-tune on my custom dataset, I need to create a JSON file that follows the same input format, right?

Thank you so much!

token position

Thank author for prviding clear code.I wonder how to get the tokens positive and tokens negative?

Code for ALBEF's VCR

Hi, can you share the codebase of finetuning for VCR task with pure ALBEF model? Thanks.

train for vcr

Hi, i cannot find these files in this repo:
train_vcr_file: ['./2022_01_06_vcr_pretrain_prompt_data.json']
train_val_vcr_q2a_file: ['./pevl_vcr_itm_finetune_val_QA_data.json']
train_val_vcr_qa2r_file: ['./pevl_vcr_itm_finetune_val_QAR_data.json']

and could you offer the command to train vcr task

Finetuning for VCR

Hi, I have noticed when finetuning for VCR task, both ita and itm are used. The itm loss is computed by the cross entropy between the itm prediction and label. But for the ita loss, I think you directly regard the question and negative answer as a positive pair? Since you initialize sim_targets = torch.zeros(sim_i2t_m.size()).to(image.device)and
sim_targets.fill_diagonal_(1)
Is this reasonable?

code and checkpoints for VQA

Hi,

Thank you for the amazing work and code! Will you plan to provide the code and checkpoints for VQA task?

Best,
Ziyan

Checkpoint download speed is very slow (1KB/s)

It is currently very slow to download all of the checkpoints, the speed is only ~1KB/s (I am based in the US). Can the authors upload the checkpoints to any other sites? Or is there anyone who has downloaded them upload them somewhere else?

Thanks!

checkpoint for VCR

Hi, could you share the checkpoint file for second-stage pretrained VCR?

Query regarding downstream task

Hi AUthors,

Thanks for making code available for this awesome work. My question is like Pix2seq did for object detection , can this work can be applied to dense prediction task like : Semantic Segmentation etc ?

training on vcr task

hi, i notice that when finetuning ssp model on vcr task, the performance drop a lot at each 5000 steps in the first epoch.
before finuetuning, the result for Q2A and QA2R are both more than 74%
step 5000: 67% and 66%
step 10000: 64.7% and 64.9%
step 15000: 61.3% and 60.9%
step 20000: 60.8% and 57.9%
and the other results have no been gained. still training.

is it true? is the model trained in a right way? have you notice this phenomenon when you finetune it?
of course, i have no V100 so i train the model in 4 2080ti. with the limitation of memory, i set batch size=1, test_batch_size=4, gradient_accumulation_steps=8. the other config is the same as your vcr.yaml.

looking for you help, thanks!

Visual Relation Detection Reproducibility

Hello,

Thank you for the wonderful work.

I was trying to reproduce the results shown in paper for VRD.

Currently I am getting around the following scores:
R@20: 0.5417 R@50: 0.6160 R@100: 0.6350
mR@20: 0.1220 mR@50: 0.1606 mR@100: 0.1723

I finetuned the pretrained checkpoint for 10 epochs with 8 v100 GPUs as instructed with a batch size of 8 and 32.

I also separately finetuned the pretrained checkpoint for 10 epochs with 10 100a GPUs with a batch size of 100.

Would you be kind of enough to provide some reasoning behind these results?

Thank you.

runtime error when i run run_vcr_train.py

when i run run_vcr_train.py on four 2080ti, i got:

 File "run_vcr_train.py", line 253, in main                                                                                                                                                                                                             
    config, args.training_mode, args, vcr_val_q2a_loader, vcr_val_qa2r_loader)                                                                                                                                                                           
  File "run_vcr_train.py", line 109, in train                                                                                                                                                                                                            
    loss_ita, loss_itm = model(images, text, alpha, itm_labels, mode='finetuning')                                                                                                                                                                       
  File "/anaconda3/envs/pevl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl                                                                                                                           
    result = self.forward(*input, **kwargs)                                                                                                                                                                                                              
  File "/anaconda3/envs/pevl/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 692, in forward                                                                                                                        
    if self.reducer._rebuild_buckets():                                        
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.  This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing t
he keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

in my opinion, the parameters of the momentum model are not used in producing loss. how to deal with it?

and the difference between the code of finetune part and pretrain part on vcr task is only deletion of MLM and soft loss. is it true?

looking for your help, thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.