Giter Club home page Giter Club logo

laclip's Introduction

Improving CLIP Training with Language Rewrites

This repo contains text data, code and pre-trained models for paper Improving CLIP Training with Language Rewrites.

Overview:

We propose Language augmented CLIP (LaCLIP). LaCLIP enhances CLIP training by rewriting text descriptions associated with each image through the in-context learning capability of large language models. It preserves key concepts while introducing diversity in sentence structure and vocabulary. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations. Experimental results on various datasets demonstrate that LaCLIP significantly improves transfer performance without additional computational or memory requirements. Key steps:

  • Meta-Input-Output Generation: we explored different strategies for generating meta-input-output pairs that can be used as examples in the prompt context for LLaMA in-context learning, namely ChatGPT, Bard, MSCOCO and Human. Examples of generating such pairs with ChatGPT:

chatgpt

  • In-Context Learning with LLaMA: Utilizing the constructed context input as a prompt, LLaMA exhibits its ability to perform text completion and generate rewritten versions of the corresponding text samples. This process is conducted for each text sample present in the pre-training image-text dataset. Example of LLaMA rewriting a text sample:

ICL

  • LaCLIP: Training with Rewritten Texts: Having generated M different rewrites for each caption, we randomly select one of them as the augmented text for each image. We then train CLIP with the augmented image-text pairs.

result

Code Overview

  • 4 versions of augmented text on 3 datasets (CC3M, CC12M, RedCaps)
  • Pre-trained models with LaCLIP and vanilla CLIP
  • Zero-shot evaluation code on ImageNet

Dependencies

  • PyTorch 1.11.0
  • torchvision 0.12.0
  • timm 0.5.4
  • open_clip (optional, for LAION-400M models)

Augmented Texts

  • Original is the original caption associated with each image.
  • ChatGPT/Bard/MSCOCO/Human is the text generated by LLaMA ICL with the ChatGPT/Bard/MSCOCO/Human Meta-Input-Output pairs as in-context learning examples.
Dataset Original ChatGPT Bard MSCOCO Human
CC3M Link Link Link Link Link
CC12M Link Link Link Link Link
RedCaps Link Link Link Link Link

Pre-trained Models

Dataset Method Zero-Shot Checkpoint
CC3M CLIP 15.8 ViT-B/16
CC3M LaCLIP 21.5 ViT-B/16
CC12M CLIP 40.2 ViT-B/16
CC12M LaCLIP 48.4 ViT-B/16
RedCaps CLIP 42.9 ViT-B/16
RedCaps LaCLIP 46.2 ViT-B/16
LAION-400M CLIP 62.0 ViT-B/32
LAION-400M LaCLIP 64.4 ViT-B/32

Zero-shot Evaluation on ImageNet

To perform zero-shot evaluation on ImageNet, use the following command:

For CC3M, CC12M and RedCaps models:

python eval_zeroshot_imagenet.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model CLIP_VITB16 --batch-size 128 --workers 8

For LAION-400M models:

python eval_zeroshot_imagenet_laion.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model ViT-B-32 --batch-size 128 --workers 8

Citation

@article{fan2023improving,
  title={Improving CLIP Training with Language Rewrites},
  author={Fan, Lijie and Krishnan, Dilip and Isola, Phillip and Katabi, Dina and Tian, Yonglong},
  journal={arXiv preprint arXiv:2305.20088},
  year={2023}
}

laclip's People

Contributors

lijiefan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.