Giter Club home page Giter Club logo

clipn's Introduction

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

๐Ÿš€ Updates

  • The codes of CLIPN with hand-crafted prompts are released (./hand-crafted).
  • The codes of CLIPN with learnable prompts are released (./src).
  • Thanks to the valuable suggestions from the reviewers of CVPR 2023 and ICCV 2023, our paper has been significantly improved, allowing it to be published at ICCV 2023.
  • If you are interested in CLIP-based open vocabulary tasks, please feel free to visit our another work! "CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks" (github).

โญ Highlights of CLIPN

  • CLIPN attains SoTA performance in zero-shot OOD detection, all the while inheriting the in-distribution (ID) classification prowess of CLIP.
  • CLIPN offers an approach for unsupervised prompt learning using image-text-paired web-dataset.

๐Ÿ”จ Installation

  • Main python libraries of our experimental environment are shown in requirements.txt. You can install CLIPN following:
git clone https://github.com/xmed-lab/CLIPN.git
cd CLIPN
conda create -n CLIPN
conda activate CLIPN
pip install -r ./requirements.txt

๐Ÿ’ป Prepare Dataset

  • Pre-training Dataset, CC3M. To download CC3M dataset as webdataset, please follow img2dataset.

When you have downloaded CC3M, please re-write your data root into ./src/run.sh.

  • OOD detection datasets.
    • ID dataset, ImageNet-1K: The ImageNet-1k dataset (ILSVRC-2012) can be downloaded here.
    • OOD dataset, iNaturalist, SUN, Places, and Texture. Please follow instruction from these two repositories MOS and MCM to download the subsampled datasets where semantically overlapped classes with ImageNet-1k are removed.

When you have downloaded the above datasets, please re-write your data root into ./src/tuning_util.py.

๐Ÿ”‘ Pre-Train and Evaluate CLIPN

  • Pre-train CLIPN on CC3M. This step is to empower "no" logic within CLIP via the web-dataset.
cd ./src
sh run.sh
  • Zero-Shot Evaluate CLIPN on ImageNet-1K.
    • Metrics and pipeline are defined in ./src/zero_shot_infer.py. Here you can find three baseline methods, and our two inference algorithms: CTW and ATD (see Line 91-96).
    • Dataset details are defined in ./src/tuning_util.py.
    • Inference models are defined in ./src/classification.py, including converting the text encoders into classifiers.
    • You can download the models provided in the table below or pre-trained by yourself. Then re-write the path of your models in the main function of ./src/zero_shot_infer.py. Finally, evaluate CLIPN by:
python3 zero_shot_infer.py

๐Ÿ“˜ Reproduced Results

To ensure the reproducibility of the results, we conducted three repeated experiments under each configuration. The following will exhibit the most recent reproduced results achieved before open-sourcing.

  • ImageNet-1K
Methods Repeat iNaturalist SUN Textures Places Avg Model/log
AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95
ViT-B-16
CLIPN-CTW 1 93.12 26.31 88.46 37.67 79.17 57.14 86.14 43.33 _ _ here
2 93.48 21.06 89.79 30.31 83.31 46.44 88.21 33..85 _ _ here
3 91.79 25.84 89.76 31.30 76.76 59.25 87.66 36.58 _ _ here
Avg 92.80 24.41 89.34 33.09 79.75 54.28 87.34 37.92 87.31 37.42 _
CLIPN-ATD 1 95.65 21.73 93.22 29.51 90.35 42.89 91.25 36.98 _ _ here
2 96.67 16.71 94.77 23.41 92.46 34.73 93.39 29.24 _ _ here
3 96.29 18.90 94.55 24.15 89.61 45.12 93.23 30.11 _ _ here
Avg 96.20 19.11 94.18 25.69 90.81 40.91 92.62 32.11 93.45 29.46 _

The performance in this table is better than our paper , because that we add an average learnable "no" prompt (see Line 600-616 in ./src/open_clip/model.py).

๐Ÿ“ Other Tips

There are several important factors that could affect the performance:

  • Class prompt texts. In the inference period, we need to use prompt texts to get the weights of classifier (see ./src/prompt/prompt.txt). You can hand on the design of high-performance inference prompts for our CLIPN.
  • The number of learnable "no" tokens. Now I just define the number of learnable "no" tokens as 16. You can vary it to find an optimal value.
  • If you have any ideas to enhance CLIPN or attempt to transfer this idea to other topics, feel free to discuss with me and I am happy to share some ideas with you.

๐Ÿ“š Citation

If you find our paper helps you, please kindly consider citing our paper in your publications.

@inproceedings{wang2023clipn,
  title={CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No},
  author={Wang, Hualiang and Li, Yi and Yao, Huifeng and Li, Xiaomeng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1802--1812},
  year={2023}
}

๐Ÿป Acknowledge

We sincerely appreciate these three highly valuable repositories open_clip, MOS and MCM.

clipn's People

Contributors

dinhanhx avatar guspan-tanadi avatar silangwhl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

clipn's Issues

Question concerning OOD detection

First of all, thank you for your work.
The method is promising and your article is very interesting, so I tried to use it in two way:

  • determining whether a detected object is a False Positive
  • determining the absence of an object in an image

I'm using the .pt weights you kindly provided, and I tried to implement the ATD and the CTW methods.
However the results were really bad leading me to think I missed something, on my first usecase the prompt was only:
"A photo of a person with a {}" ("A photo of a person without a {}") with "hat", "cap", "helmet" as the class names.
Using ATD everything is considered as an OOD, using CTW almost everything is considered as an ID.
I have some question regarding your paper:
Do you have a reference or a paper explaining where Eq.4 comes from? So regarding the CTW method, Eq.4 should be over 0.5 for the classification to be OOD.
And also from where comes the Eq.8?
As for the Eq.6, to compute pij, this is a kind of softmax right? Just adding the temperature parameter?
In this case, wouldn't the ATD method be unusable when you only have one class and just want to discard the FP as pij is equal to 1?
The first thing that came to my mind was to find the index of maximum value in logits, and check logits[index] > logits_no[index] to check if it's an ID or an OOD, however I suppose it's mathematically incorrect as you didn't mention it in your paper, and the test I ran also led to bad results.

Here are the functions I wrote for ATD and CTW from what I understood from your paper, they are kind of raw as it's a wip. I used the code in "handcrafted" folder, from what I understood this is the one to use when dealing with custom prompts and not the learned ones.
Both of them takes the logits and logits_no computed this way:
logits = F.normalize(feat, dim=-1, p=2) @ fc_yes.T
logits_no = F.normalize(feat, dim=-1, p=2) @ fc_no.T
As well as a tau parameter, I set it to 1 for now.

def CTW(logits_yes, logits_no, tau):
    yes = logits_yes[0].detach().tolist()
    no = logits_no[0].detach().tolist()
    pij = []
    denominator = 0
    for i in range(len(yes)):
        denominator += math.exp(yes[i] / tau)
    for i in range(len(yes)):
        pij.append(math.exp(yes[i] / tau) / denominator)
    pijno = []
    for i in range(len(no)):
        pijno.append(math.exp(no[i]/tau) / (math.exp(yes[i]/tau) + math.exp(no[i]/tau)))
    index = pij.index(max(pij))
    bestood = pijno[index]
    return (index, 1 - bestood > bestood)
def ATD(logits_yes, logits_no, tau):
    ood = 1.
    yes = logits_yes[0].detach().tolist()
    no = logits_no[0].detach().tolist()
    pijno = []
    for i in range(len(no)):
        pijno.append(math.exp(no[i]/tau)/(math.exp(yes[i]/tau) + math.exp(no[i]/tau)))
    pij = []
    denominator = 0
    for i in range(len(yes)):
        denominator += math.exp(yes[i]/tau)
    for i in range(len(yes)):
        pij.append(math.exp(yes[i]/tau)/denominator)
    index = pij.index(max(pij))
    for i, pno in enumerate(pijno):
        ood -= (1 - pno)*pij[i]
    res = 0
    for pyes in pij:
        if pyes > ood:
            res = 1
    return (index, res)

The return value is 1 if it's an ID and 0 otherwise.
The model is in eval mode and I use process_test function returned by load_model() function to preprocess the images I load using Pil Image.open().
So I don't know if I did something wrong or if I "just" need to retrain the model.
Thank for your help!

Request for CC-3M Dataset Pre-trained Model Parameters

Highly appreciate your work and would like to explore new models in your research. However, I lack sufficient computational power for pre-training. May I inquire if you could provide the pre-trained model parameters for the CC-3M dataset? Thank you very much.

Could you release the code for CIFAR100 dataset?

Really appreciate your work. It is really impressive.
I'd like to replicate the results using CIFAR-100 as the in-distribution dataset. Is it possible to access the code for the CIFAR-100 dataset or receive guidance on image transformations and dataset splitting, such as using CIFAR-10 as the out-of-distribution dataset?
Thank you very much.

Weights for the ViT-B-32 model

Hello,

Thank you for the interesting work! Could you provide the weights for the ViT-B-32 model?
Only the ViT-B-16 version is available in the README.

Thanks in advance!
Best,

Having trouble installing requirements on MacOS with M3

I tried to do pip install -r ./requirements as instructed but encountered a couple issues:

  1. error in blessings setup command: use_2to3 is invalid. indicating an error when installing blessings==1.6, which in fact can be solve by pip install setuptools==58 ref
  2. Error installing cloud-init command-not-found cupshelpers defer==1.06 distro-info===0.23ubuntu1 language-selector==0.1 nvidia-cublas-cu11==11.10.3.66 through pip on MacOS
ERROR: Could not find a version that satisfies the requirement cloud-init==23.1.2 (from versions: none)
ERROR: No matching distribution found for cloud-init==23.1.2

A quick google search suggests that cloud-init seems to be (terribly undocumented) something usually built-in in a virtual machine.

Any suggestions to solve the issue of installing these packages ??

Question about baseline results in Tab 2

Appreciate your impressive work.
In the table 2 of the main paper, is the MSP, MaxLogit results reproduced on CLIP or CLIPN? I test the MaxLogit on CLIP (VitB-32) on CIFAR100 (id) and CIFAR10 (ood), but only get 74.8%AUROC.

Code issue

In train_one_epoch function, your model outputs 4 variables (image_features, text_features, text_features_no, logit_scale). However, in evaluate function, your model outputs only 3 variables without text_features_no. This is quite strange because you training text enocder no but do not use it during evaluation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.