yuhangzang / ov-detr Goto Github PK

[Under preparation] Code repo for "Open-Vocabulary DETR with Conditional Matching" (ECCV 2022)

Python 87.55% Shell 0.11% C++ 1.12% Cuda 11.23%

ov-detr's Introduction

Open-Vocabulary DETR with Conditional Matching

arXiv | Project Page | Code

This repository contains the implementation of the following paper:

Open-Vocabulary DETR with Conditional Matching
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy
European Conference on Computer Vision (ECCV), 2022

Installation

We use the same environment as Deformable DETR. You are also required to install the following packages:

We test our models under python=3.8, pytorch=1.11.0, cuda=10.1, 8 Nvidia V100 32GB GPUs.

Data

Please refer to dataset_prepare.md.

Running the Model

Please refer to run_scripts.md.

Model Zoo

Open-vocabulary COCO (AP50 metric)

Base	Novel	All	Model
61.0	29.4	52.7	Google Drive

Citation

If you find our work useful for your research, please consider citing the paper:

@InProceedings{zang2022open,
 author = {Zang, Yuhang and Li, Wei and Zhou, Kaiyang and Huang, Chen and Loy, Chen Change},
 title = {Open-Vocabulary DETR with Conditional Matching},
 journal = {European Conference on Computer Vision},
 year = {2022}
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

We would like to thanks Deformable DETR, CLIP and ViLD for their open-source projects.

Contact

Please contact Yuhang Zang if you have any questions.

ov-detr's People

Contributors

Stargazers

Watchers

Forkers

jlqzzz stonewst liubinggunzu yangfukui sunzey asdf2kr tahirashehzadi linhuixiao gwwangshuo longer-is-better pixelchen24 yxliang xiaomoguhzz omkarthawakar dgymjol kaist-cvml-open-vocabulary whuhxb uzzi1235

ov-detr's Issues

About training times

Hello,

Can you provide an estimation of the training time with 8 x v100 GPUs？

Thanks.

Several Questions on training and inference

Hello!
It is a very insteresting work for build open vocabulary learning with DETR.

We read this paper but we have several questiones:

1, What does the "R" means ? How does it involve the training and testing? In fig.4, It seems that R means the number of Class during the class？

2, What is the ground truth during the matching (p in Equ.6) ? What is the relation to R ? We are very confusing on Fig.3(b).

3, How to handle the novel during the training? How to indentify the novel proposals? How to use it for training since no GTs are included？

Thanks!!!!!!

Hi,
I am Tin, a student who found this work very interesting.
However, when I run the evaluation like this:
GPU_VISIBLE_DEVICES=0,1,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --dataset_file coco --coco_path ../../Detic/Detic/datasets/coco/ --output_dir ./output/ --num_queries 100 --with_box_refine --two_stage --label_map --eval --resume ../coco_model.pth \

The results are even better than your reported results on the paper, concretely, this is your table 4.

I am perplexed now and I really need to understand how come it is to make sure that I am doing right and move to further steps.

Hope that you will clarify it. Thank you very much for your contribution.

Best,
Tin

How to train the model on customized dataset?

Hi, thank you for your excellent open-source work.

I am trying to train the model on my own dataset.

But I found the model needs to train a class-agnostic detector first and extract the features using the CLIP model.

However, no information is provided for the class-agnostic detector. Is there a convenient way for us to train the model on customized datasets?

I am looking forward to receiving your reply!

training with MOT17

Thank you for ur amazing work.
I just want to ask that I am trying ti train ur work on mot2017 but I meet this error always I don't know why I have .pkl file wrt my dataset
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
Loaded Categories: []
cat2label mapping: {}
Category IDs: []
cat2label mapping: {}
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loaded Categories: []
cat2label mapping: {}
Category IDs: []
cat2label mapping: {}
Start training
All keys in model.clip_feat: dict_keys([1])
Category labels: {}
Current key: 1
Warning: Category ID 1 not found in cat2label
Warning: Category ID -1 not found in clip_feat
Warning: Category ID 59 not found in clip_feat
Warning: Category ID 25 not found in clip_feat
Warning: Category ID 45 not found in clip_feat
Warning: Category ID 37 not found in clip_feat
Warning: Category ID 22 not found in clip_feat
Warning: Category ID 49 not found in clip_feat
Warning: Category ID 54 not found in clip_feat
Warning: Category ID 62 not found in clip_feat
Warning: Category ID 24 not found in clip_feat
Warning: Category ID 27 not found in clip_feat
Warning: Category ID 15 not found in clip_feat

assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Segmentation annotation error during training Instance Segmentation

Thanks for your implementation,

I have successfully trained Object Detection with COCO based on your code. But when I continued to train instance segmentation --mask, I met the error:

    segmentations = [obj["segmentation"] for obj in anno]
KeyError: 'segmentation'

My solution is adding if condition:
segmentations = [obj["segmentation"] for obj in anno if "segmentation" in obj]
But at the line:
masks = masks[keep]

There is new error:
IndexError: The shape of the mask [4] at index 0 does not match the shape of the indexed tensor [3, 480, 640] at index 0

So why does the first error appear? And How can I fix this error?

Thank you a lot.

A small question about the baseline of OV-DETR

Hi, thanks for the great work! I have a small question about the experimental claims in Table 2.

There is a performance decline (from 9.5 to 6.3) after introducing novel-class object proposals (# 2 Table 2 ) for self-training. I am confused about the explanation given in the paper: "Because we do not know the category id of these object proposals, we observe that the label assignment of these object proposals is inaccurate". However, I noticed that the specific pseudo-label of novel classes has been given in "coco_train2017_seen_2_proposal.json", which is a little bit noisy.

Moreover, could you please provide the generated proposals for the LVIS dataset, i.e., lvis_train2017_seen_2_proposal.json and configs? I am following your great work and hope to keep a fair comparison with the same data usage.

Thank you so much!

Can not find caterogy id?

Hi,
When I ran the code, I had this error
ile "../datasets/coco.py", line 181, in
self.cat2label[obj["category_id"]]
KeyError: 64

Do you have any suggestion to fix this bug?
Thanks

Confusion of cost_class in OVHungarianMatcher

cost_class in OVHungarianMatcher seems unrelated to the targets, why is this used to calculate the matching cost?

Does OV-DETR really Open-Vocabulary?

Hi @yuhangzang ,

Thanks for your great work! There is somthing confused me a lot. In the function forward_test() of OVDETR, I found that the clip based text query is still required for detection. However, in real application, do we know the class names of all objects in the test image? If not, how OVDETR detects the unknown objects?

Thanks.

The results of Table 3 in the paper?

Hi @yuhangzang,
Your work is really great in terms of detecting novel objects in the open world.
One thing I am confused about is how you generate Table 3 in the paper.

I calculated the average Precision and Recall myself and found that only a few top boxes have high scores. That means the higher the number of Queries, the less Precision, and I found that the num queries = 5 give the best Precision and this is quite similar to the result of Table 3 when you used the num queries = 100.

So could you elaborate it so that I could get the results?
Best,
Tin

Performance not as expected

Using the code in repo and using the provided pre-trained model to test consistently with performance in paper, but when retraining, the metrics are only APall=51.3 APsee=59.7 APunseen=27.6, and the setting of parameter R is also different from that in the paper (3 in the paper and 12 in the code). What should be noted when reproducing the performance of the paper?

How to know the novel proposals belong to which class?

In your model, you randomly select clip image features according to the proposal's class, however, for novel proposals, we only know they are novel, but we do not know which specific novel class they belong to, did you actually classify these proposals using CLIP? More specifically, how to generate the ``instances_train2017_seen_2_proposal.json'' file that annotates classes even for novel proposals?

segm head

Thanks for great work!

You follow DETR to add an external class-agnostic segmentation head. But it seems that you use C4 ( line 115 and 120) rather than C5 in DETR. Could you explain the design principles.

Looking forward to your reply! Thanks a lot!

Inference code?

Hi, this is really an excellent idea.
Could you provide the demo code or inference code to see the outcome of the model?

Best,
Tin

Can not reproduce the reported performance on COCO

Thanks for your great and inspiring work on open-vocabulary detection.

But after directly running the released codes, I only get about 20 AP50 for unseen and 42 for seen, which is under a big gap with reported (29.4, 61.0). Is there any operation that has not been included in current released codes?

Thanks.

About “clip_feat.pkl”

Hi!
Thanks for your interesting work on open vocabulary detection.
I read the paper and tried to run the code, but had some trouble. Hope for your help!

How can I get this file "clip_feat.pkl"? What does it mean?
where is the definition of "self.all_ids" in file "ovdetr/models/model.py" line 290? Does it mean "self.seen_ids" in line 245?

About the selection of text and image conditional inputs ?

Hi, yuhang:

Thanks for your patient replies before! I still have a small question about the code of text and image conditional inputs selection.

As your code, clip_query = text_query * mask + img_query * (1 - mask) (line 308 in file ovdetr/models/model.py), the text and image conditional inputs are selected randomly by the mask generated by mask = (torch.rand(len(text_query)) < self.prob).float().unsqueeze(1).to(text_query.device) (line 302 in file ovdetr/models/model.py).

But, as the paper said, the text conditional inputs of novel classes cannot be used during training. So, the mask in line 302 of file ovdetr/models/model.py need to be further processed by setting the locations corresponding to novel classes to zero. Am I right?

When will the code be released ? Thank you

Would like to know some hyperparameters

Hello, Could you tell me that training schedule settings about your Table 4, such as how many epochs you chosen, thanks~~!

Questions about complete training and evaluation scripts

Hi! Thanks for opensourcing your code.
I am currently trying to reproduce your code, but I encountered some problems:
I don't know how to run your code? Could you please provide complete training and evaluation scripts?

The incomplete files

Hi, it has been 4 months since you last updated the code. However, the lvis related dataset files have not been updated. Moreover, I only find the train annotation jsons, but I don't find the val annotation jsons. Can you upload related files? Thanks.

the code of how to generate instances_train2017_seen_2_proposal.json

hello, thanks for your excellent work !
i try to use the code for my private datasets and can you release the code of how to generate instances_train2017_seen_2_proposal.json?
thanks!