Giter Club home page Giter Club logo

tip-adapter's Introduction

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Official implementation of 'Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification'.

The paper has been accepted by ECCV 2022.

News

  • Our latest work, CaFo, is based on Tip-Adapter and accepted by CVPR 2023 🔥. Please refer here for the code.

Introduction

Tip-Adapter is a training-free adaption method for CLIP to conduct few-shot classification, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art by fine-tuning the cache model for only 10x fewer epochs than existing approaches, which is both effective and efficient.

Requirements

Installation

Create a conda environment and install dependencies:

git clone https://github.com/gaopengcuhk/Tip-Adapter.git
cd Tip-Adapter

conda create -n tip_adapter python=3.7
conda activate tip_adapter

pip install -r requirements.txt

# Install the according versions of torch and torchvision
conda install pytorch torchvision cudatoolkit

Dataset

Follow DATASET.md to install ImageNet and other 10 datasets referring to CoOp.

Get Started

Configs

The running configurations can be modified in configs/dataset.yaml, including shot numbers, visual encoders, and hyperparamters.

For simplicity, we provide the hyperparamters achieving the overall best performance on 1~16 shots for a dataset, which accord with the scores reported in the paper. If respectively tuned for different shot numbers, the 1~16-shot performance can be further improved. You can edit the search_scale, search_step, init_beta and init_alpha for fine-grained tuning.

Note that the default load_cache and load_pre_feat are False for the first running, which will store the cache model and val/test features in configs/dataset/. For later running, they can be set as True for faster hyperparamters tuning.

Numerical Results

We provide Tip-Adapter's numerical results in Figure 4 and 5 of the paper at exp.log.

CLIP-Adapter's numerical results are also updated for comparison.

Running

For ImageNet dataset:

CUDA_VISIBLE_DEVICES=0 python main_imagenet.py --config configs/imagenet.yaml

For other 10 datasets:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/dataset.yaml

The fine-tuning of Tip-Adapter-F will be automatically conducted after the training-free Tip-Adapter.

Contributors

Renrui Zhang, Peng Gao

Acknowledgement

This repo benefits from CLIP, CoOp and CLIP-Adapter. Thanks for their wonderful works.

Citation

@article{zhang2021tip,
  title={Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling},
  author={Zhang, Renrui and Fang, Rongyao and Gao, Peng and Zhang, Wei and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2111.03930},
  year={2021}
}

Contact

If you have any question about this project, please feel free to contact [email protected] and [email protected].

tip-adapter's People

Contributors

gaopengcuhk avatar zrrskywalker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tip-adapter's Issues

Details of data augmentation

In the paper, "the CLIP-style pre-processing resizes the cropped image’s short side to 224 while keeping its original aspect", and you said that you use the CLIP-style RandomResizeCrop.

However, I found that in the code, the standard RandomResizeCrop is used.

I wonder that is this setting important to the final performance or I misunderstood here?

Can't find directory 'dalle_caltech-101\\dalle_caltech.json' , How to solve it?

Running configs.
{'root_path': '', 'load_cache': False, 'load_pre_feat': False, 'search_hp': True, 'search_scale': [12, 5], 'search_step': [200, 20], 'init_beta': 1, 'init_alpha': 1.3, 'gpt3_prompt_file': './gpt_file/caltech_prompt.json', 'dataset': 'caltech101', 'shots': 16, 'clip_backbone': 'RN50', 'dino_backbone': 'resnet50', 'dalle_dataset': 'dalle_caltech', 'dalle_shots': 1, 'lr': 0.001, 'augment_epoch': 10, 'train_epoch': 20, 'cache_dir': './caches\caltech101'}

Pretrained weights found at dino/dino_resnet50_pretrain.pth and loaded with msg:
Preparing dataset.
Reading split from E:/semester of junior year 2/graduation design/CaFo-main/DATA/caltech-101/split_zhou_Caltech101.json
Creating a 16-shot dataset
Reading split from dalle_caltech-101\dalle_caltech.json
Traceback (most recent call last):
File "main.py", line 292, in
main()
File "main.py", line 228, in main
dalle_dataset = build_dataset(cfg['dalle_dataset'], cfg['root_path'], cfg['dalle_shots'])
File "E:\semester of junior year 2\graduation design\CaFo-main\datasets_init_.py", line 51, in build_dataset
return dataset_list[dataset](root_path, shots)
File "E:\semester of junior year 2\graduation design\CaFo-main\datasets\dalle_caltech.py", line 15, in init
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
File "E:\semester of junior year 2\graduation design\CaFo-main\datasets\oxford_pets.py", line 120, in read_split
split = read_json(filepath)
File "E:\semester of junior year 2\graduation design\CaFo-main\datasets\utils.py", line 17, in read_json
with open(fpath, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'dalle_caltech-101\dalle_caltech.json'

Adaptor used in vision encoder or text encoder?

Hey, Thanks for nice work. I have some confusion as follows.
First, why the adaptor is used only in vision encoder, did the authors try to use the adaptor in text encoder?
Second, I don't understand why using adaptor performs better using learnable prompt. In addition, the "adaptor" used in this paper is different from the adaptor in NLP tasks, also the position of the insertion is different, which one is better?

Bug when I try cifar100

Thanks for your work.
When I try your code on CIFAR100, I got this error and I dont know how to slove it.
Due to ImageNet's huge number of images, I can only do this.
PLS help.

Torch version: 1.7.1 Namespace(alpha=1, augment_epoch=10, beta=1.17, lr=0.001, train_epoch=20) Model parameters: 151,277,313 Input resolution: 224 Context length: 77 Vocab size: 49408 Load data finished. start getting text features. finish getting text features. start getting image features start saving training image features Augment time: 0 / 10 3%|▉ | 6/196 [00:03<01:45, 1.81it/s] Traceback (most recent call last): File "main.py", line 487, in <module> main() File "main.py", line 244, in main for i, (images, target) in enumerate(tqdm(train_loader)): File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/tqdm/std.py", line 1180, in __iter__ for obj in iterable: File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__ data = self._next_data() File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torchvision/datasets/cifar.py", line 113, in __getitem__ img, target = self.data[index], self.targets[index] IndexError: list index out of range [1]+ Killed python main.py

Sharpness & residual ratio range

Hi,
I can't find any information on how you set the alpha and beta hyperparam ranges for each dataset. Why don't you use the same ranges for all sets, and how did you determine these ranges?

replicate your results on food101 dataset

Would you consider providing the script to replicate your results on food101 dataset? If someone is to adapt your script on ImageNet, do you have suggestions on what to make sure to adjust?

Discrepancy in Logit Scales between clip_logits and cache_logits

Hello,

I have a question regarding the scaling and balance between CLIP logits and cache logits in the tip-adapter implementation. Specifically, I'm looking at the following code:

  1. clip_logits = 100. * val_features @ clip_weights

Here, val_features and clip_weights are L2-normalized vectors. The resulting clip_logits has a range of [-100, 100] due to the 100x scaling factor.

  1. cache_logits = ((-1) * (beta - beta * affinity)).exp() @ cache_values
    affinity = val_features @ cache_keys

val_features and cache_keys are also L2-normalized. The affinity values range from [-1, 1].
The expression - (beta - beta * affinity) leads to a range of [-2*beta, 0], which is then exponentiated, yielding values in the range (0, 1].

The primary concern is that clip_logits and cache_logits are not on the same scale. clip_logits ranges between [-100, 100], while adapter A is mostly in (0, 1]. This discrepancy might affect the effective fusion of these logits in the model, as seen in tip_logits = clip_logits + cache_logits * alpha.

Given that alpha is typically a single-digit number, I'm wondering if this difference in scale is intended or if there might be a need for additional scaling or normalization to align these logits more effectively? Any insights or suggestions would be greatly appreciated.

Thank you for your time and assistance!

search_hp

Hi, Could please the author explain the use of the searchhp function in the code? It doesn't seem to be mentioned in the paper,and how to select the search_ Step and search_ scale ? What are the ranges of alpha and beta?

Run TIP-adapter on text2img retrieval instead

Hi, thanks for the amazing work on adapters on CLIP. Currently the framework computes the affinities between the test query image and the cache keys, before obtaining the corresponding few-shot label. This works well and good. I would just like your advise on how can i extend this to text2img retrieval where I would like to query with text search term, and utilise the cache key-value adapter to return corresponding images. Would it be as naive as to do a text to text embedding affinity matching of the query text with the cache VALUES (instead of keys) as they contain the ground truth labels for the few-shot learning?

test

Can you provide the code of how to classify any test set picture by using the cache model and test the class probability?

Are CLIP/TIP-Adapter only designed for the few-shot setting?

Sorry I've got another question.
I did not find experiments under the base-to-new/domain generalization setting and cross-dataset transfer setting, which is conducted by CoCoOp.
Are CLIP/TIP-Adapter only designed for the few-shot setting? I wonder how the generation abilities are. Maybe you can give me any intuition?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.