clip_as_rnn's Introduction

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun*, Runjia Li*, Philip Torr, Xiuye Gu, Siyang Li

[arXiv] [Project] [Code]

The code is fully released at Google Research.

The README doc is currently under development.

Installation

Requirements

Anaconda 3
PyTorch ≥ 1.7 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this.
conda env create --name ENV_NAME --file=car_env.yml

Getting Started

Demo

We have set up an online demo. You can check it out at: TODO

Run Demo Locally

If you want to test an image locally, you can simply run

python3 demo.py --cfg-path=YOUR_CFG_PATH --output_path=SAVE_PATH

Evaluation with Benchmarks

Data preparation: See Preparing Datasets for CaR
Evaluate: python3 evaluate.py --cfg-path=CFG_PATH You can find configs for each dataset under configs.

Citing CaR

@inproceedings{clip_as_rnn,
  title = {CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor},
  author = {Sun, Shuyang and Li, Runjia and Torr, Philip and Gu, Xiuye and Li, Siyang},
  year = {2024},
  booktitle = {CVPR},
}

clip_as_rnn's People

Contributors

Stargazers

Watchers

clip_as_rnn's Issues

How did you get the visual prompts?

Thank you for your excellent work!

How did you get the visual prompts? Are prompts learnable or just annotated by human?

I am appreciate it if you could solve my questions.

Precision and recall of the remained lables.

Hello! Have you tested the precision and recall of the remained lables on the COCO benchmark?

Set of referring image segmentation queries

Thanks for your interesting work!!

I cannot get the construction details of the initial text queries for referring image segmentation.
From my understanding, open-vocab segmentation uses a set of input text queries and makes your recurrent filtering of non-existing concept texts necessary. However, since referring image segmentation uses a pair of an image and a text as input, I cannot understand how CaR eliminates the irrelevant texts recurrently. Therefore, my short knowledge can be filled by knowing the initial text queries for this task.

If the detail has existed on the paper, I would be sorry to ask about it, and excuse me, please.

Best regards,

Namyup Kim.

Dataset preparation

Hi there,

Thanks for your amazing work on CaR and congratulation on CVPR acceptance!
Will you provide the dataset preparation guidance from zero-shot segmentation to referring image segmentation and referring video segmentation?

Best,
Zhihua

Recommend Projects

kevin-ssy / clip_as_rnn Goto Github PK