Giter Club home page Giter Club logo

ov-detr's Introduction

Open-Vocabulary DETR with Conditional Matching

This repository contains the implementation of the following paper:

Open-Vocabulary DETR with Conditional Matching
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy
European Conference on Computer Vision (ECCV), 2022

Installation

We use the same environment as Deformable DETR. You are also required to install the following packages:

We test our models under python=3.8, pytorch=1.11.0, cuda=10.1, 8 Nvidia V100 32GB GPUs.

Data

Please refer to dataset_prepare.md.

Running the Model

Please refer to run_scripts.md.

Model Zoo

  • Open-vocabulary COCO (AP50 metric)
Base Novel All Model
61.0 29.4 52.7 Google Drive

Citation

If you find our work useful for your research, please consider citing the paper:

@InProceedings{zang2022open,
 author = {Zang, Yuhang and Li, Wei and Zhou, Kaiyang and Huang, Chen and Loy, Chen Change},
 title = {Open-Vocabulary DETR with Conditional Matching},
 journal = {European Conference on Computer Vision},
 year = {2022}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

We would like to thanks Deformable DETR, CLIP and ViLD for their open-source projects.

Contact

Please contact Yuhang Zang if you have any questions.

ov-detr's People

Contributors

yuhangzang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ov-detr's Issues

About training times

Hello,

Can you provide an estimation of the training time with 8 x v100 GPUs?

Thanks.

Several Questions on training and inference

Hello!
It is a very insteresting work for build open vocabulary learning with DETR.

We read this paper but we have several questiones:

1, What does the "R" means ? How does it involve the training and testing? In fig.4, It seems that R means the number of Class during the class?

2, What is the ground truth during the matching (p in Equ.6) ? What is the relation to R ? We are very confusing on Fig.3(b).

3, How to handle the novel during the training? How to indentify the novel proposals? How to use it for training since no GTs are included?

Thanks!!!!!!

The AP results on COCO?

Hi,
I am Tin, a student who found this work very interesting.
However, when I run the evaluation like this:
GPU_VISIBLE_DEVICES=0,1,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --dataset_file coco --coco_path ../../Detic/Detic/datasets/coco/ --output_dir ./output/ --num_queries 100 --with_box_refine --two_stage --label_map --eval --resume ../coco_model.pth \

Screen Shot 2022-10-06 at 4 02 17 PM

The results are even better than your reported results on the paper, concretely, this is your table 4.
Screen Shot 2022-10-06 at 4 55 23 PM

I am perplexed now and I really need to understand how come it is to make sure that I am doing right and move to further steps.

Hope that you will clarify it. Thank you very much for your contribution.

Best,
Tin

How to train the model on customized dataset?

Hi, thank you for your excellent open-source work.

I am trying to train the model on my own dataset.

But I found the model needs to train a class-agnostic detector first and extract the features using the CLIP model.

However, no information is provided for the class-agnostic detector. Is there a convenient way for us to train the model on customized datasets?

I am looking forward to receiving your reply!

training with MOT17

Thank you for ur amazing work.
I just want to ask that I am trying ti train ur work on mot2017 but I meet this error always I don't know why I have .pkl file wrt my dataset
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
Loaded Categories: []
cat2label mapping: {}
Category IDs: []
cat2label mapping: {}
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loaded Categories: []
cat2label mapping: {}
Category IDs: []
cat2label mapping: {}
Start training
All keys in model.clip_feat: dict_keys([1])
Category labels: {}
Current key: 1
Warning: Category ID 1 not found in cat2label
Warning: Category ID -1 not found in clip_feat
Warning: Category ID 59 not found in clip_feat
Warning: Category ID 25 not found in clip_feat
Warning: Category ID 45 not found in clip_feat
Warning: Category ID 37 not found in clip_feat
Warning: Category ID 22 not found in clip_feat
Warning: Category ID 49 not found in clip_feat
Warning: Category ID 54 not found in clip_feat
Warning: Category ID 62 not found in clip_feat
Warning: Category ID 24 not found in clip_feat
Warning: Category ID 27 not found in clip_feat
Warning: Category ID 15 not found in clip_feat

assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Segmentation annotation error during training Instance Segmentation

Thanks for your implementation,

I have successfully trained Object Detection with COCO based on your code. But when I continued to train instance segmentation --mask, I met the error:

    segmentations = [obj["segmentation"] for obj in anno]
KeyError: 'segmentation'

My solution is adding if condition:
segmentations = [obj["segmentation"] for obj in anno if "segmentation" in obj]
But at the line:
masks = masks[keep]

There is new error:
IndexError: The shape of the mask [4] at index 0 does not match the shape of the indexed tensor [3, 480, 640] at index 0

So why does the first error appear? And How can I fix this error?

Thank you a lot.

A small question about the baseline of OV-DETR

Hi, thanks for the great work! I have a small question about the experimental claims in Table 2.

There is a performance decline (from 9.5 to 6.3) after introducing novel-class object proposals (# 2 Table 2 ) for self-training. I am confused about the explanation given in the paper: "Because we do not know the category id of these object proposals, we observe that the label assignment of these object proposals is inaccurate". However, I noticed that the specific pseudo-label of novel classes has been given in "coco_train2017_seen_2_proposal.json", which is a little bit noisy.

Moreover, could you please provide the generated proposals for the LVIS dataset, i.e., lvis_train2017_seen_2_proposal.json and configs? I am following your great work and hope to keep a fair comparison with the same data usage.

Thank you so much!

Can not find caterogy id?

Hi,
When I ran the code, I had this error
ile "../datasets/coco.py", line 181, in
self.cat2label[obj["category_id"]]
KeyError: 64

Do you have any suggestion to fix this bug?
Thanks

Does OV-DETR really Open-Vocabulary?

Hi @yuhangzang ,

Thanks for your great work! There is somthing confused me a lot. In the function forward_test() of OVDETR, I found that the clip based text query is still required for detection. However, in real application, do we know the class names of all objects in the test image? If not, how OVDETR detects the unknown objects?

Thanks.

The results of Table 3 in the paper?

Hi @yuhangzang,
Your work is really great in terms of detecting novel objects in the open world.
One thing I am confused about is how you generate Table 3 in the paper.

Screen Shot 2022-10-04 at 10 02 43 AM

I calculated the average Precision and Recall myself and found that only a few top boxes have high scores. That means the higher the number of Queries, the less Precision, and I found that the num queries = 5 give the best Precision and this is quite similar to the result of Table 3 when you used the num queries = 100.

So could you elaborate it so that I could get the results?
Best,
Tin

Performance not as expected

Using the code in repo and using the provided pre-trained model to test consistently with performance in paper, but when retraining, the metrics are only APall=51.3 APsee=59.7 APunseen=27.6, and the setting of parameter R is also different from that in the paper (3 in the paper and 12 in the code). What should be noted when reproducing the performance of the paper?

How to know the novel proposals belong to which class?

In your model, you randomly select clip image features according to the proposal's class, however, for novel proposals, we only know they are novel, but we do not know which specific novel class they belong to, did you actually classify these proposals using CLIP? More specifically, how to generate the ``instances_train2017_seen_2_proposal.json'' file that annotates classes even for novel proposals?

segm head

Thanks for great work!

You follow DETR to add an external class-agnostic segmentation head. But it seems that you use C4 ( line 115 and 120) rather than C5 in DETR. Could you explain the design principles.

Looking forward to your reply! Thanks a lot!

Inference code?

Hi, this is really an excellent idea.
Could you provide the demo code or inference code to see the outcome of the model?

Best,
Tin

Can not reproduce the reported performance on COCO

Thanks for your great and inspiring work on open-vocabulary detection.

But after directly running the released codes, I only get about 20 AP50 for unseen and 42 for seen, which is under a big gap with reported (29.4, 61.0). Is there any operation that has not been included in current released codes?

Thanks.

About “clip_feat.pkl”

Hi!
Thanks for your interesting work on open vocabulary detection.
I read the paper and tried to run the code, but had some trouble. Hope for your help!

  1. How can I get this file "clip_feat.pkl"? What does it mean?
  2. where is the definition of "self.all_ids" in file "ovdetr/models/model.py" line 290? Does it mean "self.seen_ids" in line 245?

About the selection of text and image conditional inputs ?

Hi, yuhang:

Thanks for your patient replies before! I still have a small question about the code of text and image conditional inputs selection.

As your code, clip_query = text_query * mask + img_query * (1 - mask) (line 308 in file ovdetr/models/model.py), the text and image conditional inputs are selected randomly by the mask generated by mask = (torch.rand(len(text_query)) < self.prob).float().unsqueeze(1).to(text_query.device) (line 302 in file ovdetr/models/model.py).

But, as the paper said, the text conditional inputs of novel classes cannot be used during training. So, the mask in line 302 of file ovdetr/models/model.py need to be further processed by setting the locations corresponding to novel classes to zero. Am I right?

The incomplete files

Hi, it has been 4 months since you last updated the code. However, the lvis related dataset files have not been updated. Moreover, I only find the train annotation jsons, but I don't find the val annotation jsons. Can you upload related files? Thanks.

Unseen classes in COCO

Hi, I have found that the ground truth of unseen classes in COCO have been used in trainning, I wonder if I'm misunderstood.

ids of unseen classes in COCO
微信截图_20230327113302

output of model, as you can see, 'selected_id' contain '36' which is one of the unseen classes in COCO
微信截图_20230327113415

while bipartite matching, the gt of label 36 has been used in computing the cost
微信截图_20230327113837

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.