Giter Club home page Giter Club logo

feielysia / viecap Goto Github PK

View Code? Open in Web Editor NEW
144.0 2.0 5.0 41.81 MB

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning, ICCV 2023

Home Page: https://openaccess.thecvf.com/content/ICCV2023/html/Fei_Transferable_Decoding_with_Visual_Entities_for_Zero-Shot_Image_Captioning_ICCV_2023_paper.html

Python 80.97% Shell 2.46% Jupyter Notebook 16.57%
transferability vision-language-model object-hallucination zero-shot-captioning modality-biases

viecap's Introduction

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning, ICCV 2023

Authors: Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng

This repository contains the official implementation of our paper: Transferable Decoding with Visual Entities for Zero-Shot Image Captioning.

arXiv bilibili


Catalogue:


Introduction

This paper aims at the transferability of the zero-shot captioning for out-of-domain images. As shown in this image, we demonstrate the susceptibility of pre-trained vision-language models and large language models to modality bias induced by language models when adapting them into image-to-text generation. Simultaneously, these models tend to generate descriptions containing objects that do not actually exist in the image but frequently appear during training, a phenomenon known as object hallucination. We propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. This is the official repository for ViECap, in which you can easily reproduce our paper's results and try it on your own images.


Examples

Here are some fantastic examples for diverse captioning scenarios of our model!


The captioning results on the NoCaps dataset are presented here:


Task COCO $\Rightarrow$ Nocaps (In) COCO $\Rightarrow$ Nocaps (Near) COCO $\Rightarrow$ Nocaps (Out) COCO $\Rightarrow$ Nocaps (Overall) COCO $\Rightarrow$ Flickr30k Flickr30k $\Rightarrow$ COCO COCO Flickr30k
Metric CIDEr CIDEr CIDEr CIDEr CIDEr CIDEr CIDEr CIDEr
MAGIC ---- ---- ---- ---- 17.5 18.3 49.3 20.4
DeCap 65.2 47.8 25.8 45.9 35.7 44.4 91.2 56.7
CapDec 60.1 50.2 28.7 45.9 35.7 27.3 91.8 39.1
----- ---- ---- ---- ---- ---- ---- ---- ----
ViECap 61.1 64.3 65.0 66.2 38.4 54.2 92.9 47.9

Citation

If you find our paper and code helpful, we would greatly appreciate it if you could leave a star and cite our work. Thanks!

@InProceedings{Fei_2023_ICCV,
    author    = {Fei, Junjie and Wang, Teng and Zhang, Jinrui and He, Zhenyu and Wang, Chengjie and Zheng, Feng},
    title     = {Transferable Decoding with Visual Entities for Zero-Shot Image Captioning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {3136-3146}
}
@article{fei2023transferable,
  title={Transferable Decoding with Visual Entities for Zero-Shot Image Captioning},
  author={Fei, Junjie and Wang, Teng and Zhang, Jinrui and He, Zhenyu and Wang, Chengjie and Zheng, Feng},
  journal={arXiv preprint arXiv:2307.16525},
  year={2023}
}

Required Prerequisites

For code execution, begin by cloning this repository and downloading the annotations, checkpoints, and evaluation files from the Releases of this repository. Afterward, unzip the files and position them within the root directory. It should be noted that we only run our codes on Linux.

git clone [email protected]:FeiElysia/ViECap.git

Data Preparation

To utilize this code with your desired dataset, the initial step involves converting the dataset format through data preprocessing. Firstly, extract the entities from each caption within your chosen dataset using the following command (make sure you have placed all captions from the dataset into a list):

python entities_extraction.py

(Optional) you can pre-extract the training text features.

python texts_features_extraction.py

Using these two scripts, you can now transform any dataset you wish to use for training into the appropriate data format for the dataloader. Additionally, we have made the processed COCO dataset and Flickr30k dataset available in the Releases, feel free to use them directly!

To evaluate the trained ViECap, you should first construct the vocabulary and extract the embeddings of each category in the vocabulary. Utilize the vocabulary provided in the Releases and execute the following script (we also supply the extracted vocabulary embeddings here):

python generating_prompt_ensemble.py

(Optional) you can also acquire the image features beforehand for evaluation. Make sure to modify the script if you want to adapt it to your own dataset.

Note that if you choose not to use the provided image features from us, you should download the image source files for the COCO and Flickr30k dataset from their official websites. Afterwards, you should place these files into the 'ViECap/annotations/coco/val2014' directory for COCO images and the 'ViECap/annotations/flickr30k/flickr30k-images' directory for Flickr30k images.

python images_features_extraction.py

Training

To train ViECap on the COCO dataset or the Flickr30k dataset, using the following script (bash train_*.sh n), respectively:

bash train_coco.sh 0
bash train_flickr30k.sh 0

where n represents the ID of gpu used (i.e., 'cuda:n').


Evaluation

Now, you can evaluate the captioning performance of your trained model on the testing dataset using the command bash eval_*.sh EXP_NAME n OTHER_ARGS m, in which EXP_NAME signifies the file name for storing checkpoints, OTHER_ARGS signifies any other arguments used, and n and m refer to the GPU ID and the weight epoch used, respectively.


Cross-domain Captioning

To evaluate the cross-domain captioning performance from COCO to NoCaps, run the following script:

bash eval_nocaps.sh train_coco 0 '--top_k 3 --threshold 0.2' 14
Task COCO $\Rightarrow$ NoCaps (In) COCO $\Rightarrow$ NoCaps (In) COCO $\Rightarrow$ NoCaps (Near) COCO $\Rightarrow$ NoCaps (Near) COCO $\Rightarrow$ NoCaps (Out) COCO $\Rightarrow$ NoCaps (Out) COCO $\Rightarrow$ NoCaps (Overall) COCO $\Rightarrow$ NoCaps (Overall)
Metric CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
DeCap 65.2 ---- 47.8 ---- 25.8 ---- 45.9 ----
CapDec 60.1 10.2 50.2 9.3 28.7 6.0 45.9 8.3
----- ---- ---- ---- ---- ---- ---- ---- ----
ViECap 61.1 10.4 64.3 9.9 65.0 8.6 66.2 9.5

To evaluate the cross-domain captioning performance from COCO to Flickr30k, run the following script:

bash eval_flickr30k.sh train_coco 0 '--top_k 3 --threshold 0.2' 14
Metric BLEU@4 METEOR CIDEr SPICE
MAGIC 6.2 12.2 17.5 5.9
DeCap 16.3 17.9 35.7 11.1
CapDec 17.3 18.6 35.7 ----
----- ---- ---- ---- ----
ViECap 17.4 18.0 38.4 11.2

To evaluate the cross-domain captioning performance from Flickr30k to COCO, run the following script:

bash eval_coco.sh train_flickr30k 0 '--top_k 3 --threshold 0.2 --using_greedy_search' 29
Metric BLEU@4 METEOR CIDEr SPICE
MAGIC 5.2 12.5 18.3 5.7
DeCap 12.1 18.0 44.4 10.9
CapDec 9.2 16.3 27.3 ----
----- ---- ---- ---- ----
ViECap 12.6 19.3 54.2 12.5

In-domain Captioning

To evaluate the in-domain captioning performance on the COCO testing set, run the following script:

bash eval_coco.sh train_coco 0 '' 14
Metric BLEU@4 METEOR CIDEr SPICE
ZeroCap 7.0 15.4 34.5 9.2
MAGIC 12.9 17.4 49.3 11.3
DeCap 24.7 25.0 91.2 18.7
CapDec 26.4 25.1 91.8 ----
----- ---- ---- ---- ----
ViECap 27.2 24.8 92.9 18.2

To evaluate the in-domain captioning performance on the Flickr30k testing set, run the following script:

bash eval_flickr30k.sh train_flickr30k 0 '' 29
Metric BLEU@4 METEOR CIDEr SPICE
ZeroCap 5.4 11.8 16.8 6.2
MAGIC 6.4 13.1 20.4 7.1
DeCap 21.2 21.8 56.7 15.2
CapDec 17.7 20.0 39.1 ----
----- ---- ---- ---- ----
ViECap 21.4 20.1 47.9 13.6

FlickrStyle10K

For FlickrStyle10K, you can easily put it into practice by adhering to the aforementioned steps. Begin by downloading the dataset!


We have provided the captioning results in the Releases. You can evaluate them directly using bash language_eval.sh </path>

For example, if you wish to assess the cross-domain captioning performance from COCO to NoCaps, execute the following commands:

bash language_eval.sh ../checkpoints/train_coco/indomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/neardomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/outdomain_generated_captions.json
bash language_eval.sh ../checkpoints/train_coco/overall_generated_captions.json

Inference

you can describe any image you need according to the following script:

python infer_by_instance.py --prompt_ensemble --using_hard_prompt --soft_prompt_first --image_path ./images/instance1.jpg

The generated caption is: A little girl in pink pajamas sitting on a bed.


Change --image_path to specify the path of any image you want to describe!

A little girl that is laying down on a bed.

A scenic view of a river with a waterfall in the background.

A girl with a ponytail is walking down the street.

(Optional) you can also execute the following script to generate captions for all the images within a specific file.

python infer_by_batch.py --prompt_ensemble --using_hard_prompt --soft_prompt_first --image_path ./images

Acknowledgments

Our repository builds on CLIP, ClipCap, CapDec, MAGIC and pycocotools repositories. Thanks for open-sourcing!


Contact

If you have any questions, please feel free to contact me at: [email protected].

viecap's People

Contributors

feielysia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

viecap's Issues

代码写的真好

可读性很强,作为模板学习,感谢休伯利安的舰长程序员hhh

some question about 'ClipCaptionPrefix'

Thank you for sharing this exciting work! The code and comments are pretty standard and I really learned a lot from it.

I would like to know: Does the hyperparameter frozen_gpt mean freezing the whole gpt model during training, I notice that the ClipCaptionPrefix code is as follows:

class ClipCaptionPrefix(ClipCaptionModel):

    def parameters(self, recurse: bool = True):
        return self.mapping_network.parameters()

    def train(self, mode: bool = True):
        super(ClipCaptionPrefix, self).train(mode)
        self.gpt.eval()
        return self

I think gpt.eval() just stop the Batch Normalization and Dropout module, and I print the params of gpt2 after set frozen_gpt=True, which goes as follows:

for name, param in model.gpt.named_parameters():
  print(name, ":", param.requires_grad)

the output:

transformer.wte.weight : True
transformer.wpe.weight : True
transformer.h.0.ln_1.weight : True
...

So im wondering whether the whole gpt2 model is frozen,or just the BN and Dropout layer.

Thanks in advance!

about infer by batch

Congrats on your paper being accepted by iccv 2023!
looking at the infer_by_batch.py file in your source code, I don't seem to see the inputs that use batch data, or am I just being careless and not understanding?

GPT2的预训练模型加载问题

尊敬的作者您好,我在运行train_coco.sh时遇到如下问题,请问该如何解决呢?
我想应该是加载GPT2预训练模型权重的问题,我搜索尝试了很多方法但都无效,希望您能解答一下,谢谢!
Traceback (most recent call last):
File "main.py", line 168, in
main()
File "main.py", line 152, in main
datasets = CaptionsDataset(
File "/private/ViECap-main/CaptionsDataset.py", line 31, in init
tokenizer = AutoTokenizer.from_pretrained(language_model)
File "/root/anaconda3/envs/Viecap/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 498, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/root/anaconda3/envs/Viecap/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 359, in get_tokenizer_config
resolved_config_file = get_file_from_repo(
File "/root/anaconda3/envs/Viecap/lib/python3.8/site-packages/transformers/utils/hub.py", line 678, in get_file_from_repo
resolved_file = cached_path(
File "/root/anaconda3/envs/Viecap/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
output_path = get_from_cache(
File "/root/anaconda3/envs/Viecap/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

clip_project_length应该怎么确定

学长好,最近在尝试使用ViT-L/14作为backbone时发现直接生成的描述都是缺少主语的不完整句子,看了半天觉得是clip_project_length这里的问题,不知道您是怎么确定这个值的呢,以及对于换成ViT-L/14您有没有什么指教。
感谢学长。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.