Improving CLIP Training with Language Rewrites

This repo contains text data, code and pre-trained models for paper Improving CLIP Training with Language Rewrites.

Overview:

We propose Language augmented CLIP (LaCLIP). LaCLIP enhances CLIP training by rewriting text descriptions associated with each image through the in-context learning capability of large language models. It preserves key concepts while introducing diversity in sentence structure and vocabulary. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations. Experimental results on various datasets demonstrate that LaCLIP significantly improves transfer performance without additional computational or memory requirements. Key steps:

Meta-Input-Output Generation: we explored different strategies for generating meta-input-output pairs that can be used as examples in the prompt context for LLaMA in-context learning, namely ChatGPT, Bard, MSCOCO and Human. Examples of generating such pairs with ChatGPT:

In-Context Learning with LLaMA: Utilizing the constructed context input as a prompt, LLaMA exhibits its ability to perform text completion and generate rewritten versions of the corresponding text samples. This process is conducted for each text sample present in the pre-training image-text dataset. Example of LLaMA rewriting a text sample:

LaCLIP: Training with Rewritten Texts: Having generated M different rewrites for each caption, we randomly select one of them as the augmented text for each image. We then train CLIP with the augmented image-text pairs.

Code Overview

4 versions of augmented text on 3 datasets (CC3M, CC12M, RedCaps)
Pre-trained models with LaCLIP and vanilla CLIP
Zero-shot evaluation code on ImageNet

Dependencies

PyTorch 1.11.0
torchvision 0.12.0
timm 0.5.4
open_clip (optional, for LAION-400M models)

Augmented Texts

Original is the original caption associated with each image.
ChatGPT/Bard/MSCOCO/Human is the text generated by LLaMA ICL with the ChatGPT/Bard/MSCOCO/Human Meta-Input-Output pairs as in-context learning examples.

Dataset	Original	ChatGPT	Bard	MSCOCO	Human
CC3M	Link	Link	Link	Link	Link
CC12M	Link	Link	Link	Link	Link
RedCaps	Link	Link	Link	Link	Link

Pre-trained Models

Dataset	Method	Zero-Shot	Checkpoint
CC3M	CLIP	15.8	ViT-B/16
CC3M	LaCLIP	21.5	ViT-B/16
CC12M	CLIP	40.2	ViT-B/16
CC12M	LaCLIP	48.4	ViT-B/16
RedCaps	CLIP	42.9	ViT-B/16
RedCaps	LaCLIP	46.2	ViT-B/16
LAION-400M	CLIP	62.0	ViT-B/32
LAION-400M	LaCLIP	64.4	ViT-B/32

Zero-shot Evaluation on ImageNet

To perform zero-shot evaluation on ImageNet, use the following command:

For CC3M, CC12M and RedCaps models:

python eval_zeroshot_imagenet.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model CLIP_VITB16 --batch-size 128 --workers 8

For LAION-400M models:

python eval_zeroshot_imagenet_laion.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model ViT-B-32 --batch-size 128 --workers 8

Citation

@article{fan2023improving,
  title={Improving CLIP Training with Language Rewrites},
  author={Fan, Lijie and Krishnan, Dilip and Isola, Phillip and Katabi, Dina and Tian, Yonglong},
  journal={arXiv preprint arXiv:2305.20088},
  year={2023}
}

whuhxb / laclip Goto Github PK

laclip's Introduction

Improving CLIP Training with Language Rewrites

Overview:

Code Overview

Dependencies

Augmented Texts

Pre-trained Models

Zero-shot Evaluation on ImageNet

Citation

laclip's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent