This repo contains text data, code and pre-trained models for paper Improving CLIP Training with Language Rewrites.
We propose Language augmented CLIP (LaCLIP). LaCLIP enhances CLIP training by rewriting text descriptions associated with each image through the in-context learning capability of large language models. It preserves key concepts while introducing diversity in sentence structure and vocabulary. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations. Experimental results on various datasets demonstrate that LaCLIP significantly improves transfer performance without additional computational or memory requirements. Key steps:
- Meta-Input-Output Generation: we explored different strategies for generating meta-input-output pairs that can be used as examples in the prompt context for LLaMA in-context learning, namely ChatGPT, Bard, MSCOCO and Human. Examples of generating such pairs with ChatGPT:
- In-Context Learning with LLaMA: Utilizing the constructed context input as a prompt, LLaMA exhibits its ability to perform text completion and generate rewritten versions of the corresponding text samples. This process is conducted for each text sample present in the pre-training image-text dataset. Example of LLaMA rewriting a text sample:
- LaCLIP: Training with Rewritten Texts: Having generated M different rewrites for each caption, we randomly select one of them as the augmented text for each image. We then train CLIP with the augmented image-text pairs.
- 4 versions of augmented text on 3 datasets (CC3M, CC12M, RedCaps)
- Pre-trained models with LaCLIP and vanilla CLIP
- Zero-shot evaluation code on ImageNet
- PyTorch 1.11.0
- torchvision 0.12.0
- timm 0.5.4
- open_clip (optional, for LAION-400M models)
- Original is the original caption associated with each image.
- ChatGPT/Bard/MSCOCO/Human is the text generated by LLaMA ICL with the ChatGPT/Bard/MSCOCO/Human Meta-Input-Output pairs as in-context learning examples.
Dataset | Original | ChatGPT | Bard | MSCOCO | Human |
---|---|---|---|---|---|
CC3M | Link | Link | Link | Link | Link |
CC12M | Link | Link | Link | Link | Link |
RedCaps | Link | Link | Link | Link | Link |
Dataset | Method | Zero-Shot | Checkpoint |
---|---|---|---|
CC3M | CLIP | 15.8 | ViT-B/16 |
CC3M | LaCLIP | 21.5 | ViT-B/16 |
CC12M | CLIP | 40.2 | ViT-B/16 |
CC12M | LaCLIP | 48.4 | ViT-B/16 |
RedCaps | CLIP | 42.9 | ViT-B/16 |
RedCaps | LaCLIP | 46.2 | ViT-B/16 |
LAION-400M | CLIP | 62.0 | ViT-B/32 |
LAION-400M | LaCLIP | 64.4 | ViT-B/32 |
To perform zero-shot evaluation on ImageNet, use the following command:
For CC3M, CC12M and RedCaps models:
python eval_zeroshot_imagenet.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model CLIP_VITB16 --batch-size 128 --workers 8
For LAION-400M models:
python eval_zeroshot_imagenet_laion.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model ViT-B-32 --batch-size 128 --workers 8
@article{fan2023improving,
title={Improving CLIP Training with Language Rewrites},
author={Fan, Lijie and Krishnan, Dilip and Isola, Phillip and Katabi, Dina and Tian, Yonglong},
journal={arXiv preprint arXiv:2305.20088},
year={2023}
}