Giter Club home page Giter Club logo

ft-clip's Introduction

FT-CLIP

This repo is the official implementation of "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet".

Introduction

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%, 88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset. These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP.

Results

ViT-Base/16224 ViT-Base/16384 ViT-Large/16384 ViT-Large/14224 ViT-Large/14336
FLOPS 17.5G 55.4G 190.7G 80.7G 190.6G
Supervised Baseline
ImageNet-21K 84.0 86.2 87.1 ---- ----
JFT-300M ---- 86.7 88.0 ---- ----
JFT-3B ---- 86.6 88.5 ---- ----
MIM with CLIP as prediction target
MVP 84.4 ---- ---- ---- ----
FD-CLIP 84.9 ---- ---- ---- ----
CAE-v2 85.3 ---- ---- ---- ----
BEiT-2 85.5 ---- ---- ---- ----
Fine-tuning CLIP directly
FT-CLIP(ours) 85.7 86.6 ---- 88.0 88.3

Setup

PyTorch, Timm and DeepSpeed is needed. CUDA version or GPU difference may slightly influence the results.

pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install --user timm==0.4.12
pip install --user deepspeed==0.4.0

Fine-tuning configs

The CLIP-Base/16 model can be fine-tuned on ImageNet-1k using 8 A100-40GB:

MODEL=CLIP_B16
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet

echo $OUTPUT_DIR
mkdir -p $OUTPUT_DIR
cp $0 $OUTPUT_DIR

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model ${MODEL} --data_path $DATA_PATH \
    --input_size 224 \
    --finetune True \
    --num_workers 8 \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 256 --lr 6e-4 --update_freq 1 \
    --warmup_epochs 10 --epochs 50 \
    --layer_decay 0.6 \
    --drop_path 0 \
    --dist_eval --eval_all --no_save_ckpt \
    --enable_deepspeed \
    --clip_mean_and_std \
    --layer_scale_init_value 0 \
    --abs_pos_emb --disable_rel_pos_bias \
    --weight_decay 0.05 --mixup 0 --cutmix 0 \
    --nb_classes 1000  --model_prefix visual.\
    --model_ema --model_ema_decay 0.9998 \
    2>&1 | tee -a ${OUTPUT_DIR}/log.txt
  • --batch_size: batch size per GPU.
  • Effective batch size = number of GPUs * --batch_size * --update_freq. So in the above example, the effective batch size is 8*256*1 = 2048.
  • --lr: base learning rate.
  • --layer_decay: layer-wise learning rate decay. The LR of i_th layer is lr * layer_decay ** i.
  • --warmup_epochs: learning rate warmup epochs.
  • --epochs: total pre-training epochs.
  • --clip_mean_and_std: use the CLIP norm factor, instead of the ImageNet norm.

see scripts/ for more config

Acknowledgments

This repository is modified from BEiT, built using the timm library, the DeiT repository and the CLIP repository. The CLIP model file is modified from DeCLIP.

Citation

If you use this code for your research, please cite our paper.

@article{dong2022ftclip,
  title={CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet},
  author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Shuyang, Gu and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
  journal={arXiv preprint arXiv:2212.06138},
  year={2022}
}

ft-clip's People

Contributors

lightdxy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.