Official implementation for the paper "An Efficient One-stage Prefix-based Generator for Image Captioning"
In our work, we use the X-VLM image encoder, which was already trained over an extremely large number of images-text pairs(4M), and GPT2 as the decoder, following the baselien ClipCap.
a group of people standing next to an elephant. | a wooden table with a vase of flowers on top of it. | a wooden crate filled with lots of ripe and unripe bananas. |
a woman is eating a bowl of food at a table. | a wooden table topped with wooden spoons and wooden sticks. | a motorcycle parked in a dirt field with horses in the background. |
Clone, create environment and install dependencies:
git clone https://github.com/hyfwyy/OPG.git
cd OPG
conda env create -f environment.yml
conda activate opg
Download train_captions to data/coco/annotations
.
Download training images and validation images and unzip (We use Karpathy et el. split).
Download pre-trained 4M checkpoint from X-VLM.
For cross-entropy stage:
mlp+gpt2 tuning:
python train_scst.py --scst=False --device=cuda:0 --mapping_type=mlp --use_sparce_mask=True --use_aux_loss=True --threshold=0.1 --lamda=0.1
#
trans+gpt2 frozen:
python train_scst.py --scst=False --device=cuda:0 --mapping_type=transformer --only_prefix --use_sparce_mask=True --use_aux_loss=True --threshold=0.1 --lamda=0.1
trans+gpt2 tuning
python train_scst.py --scst=False --device=cuda:0 --mapping_type=transformer --only_prefix --use_sparce_mask=True --use_aux_loss=True --threshold=0.1 --lamda=0.1
For CIDEr optimization stage:
python train_scst.py --scst=True --checkpoint=$checkpoint_path$ --mapping_type=mlp
This repository is heavily based on ClipCap repositories. For training we used the data of COCO dataset.
For any inquiry please contact us at our email addresses: [email protected]