Giter Club home page Giter Club logo

vldet's Introduction

VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection,
Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai,
ICLR 2023 (https://arxiv.org/abs/2211.14843)

Highlight

We are excited to announce that our paper was accepted to ICLR 2023! 🥳🥳🥳

A quick explainable video demo for VLDet

vldet_demo.mp4

Performance

Open-Vocabulary on COCO

Open-Vocabulary on LVIS

Installation

Requirements

  • Linux or macOS with Python ≥ 3.7
  • PyTorch ≥ 1.9. Install them together at pytorch.org to make sure of this. Note, please check PyTorch version matches that is required by Detectron2.
  • Detectron2: follow Detectron2 installation instructions.

Example conda environment setup

conda create --name VLDet python=3.7 -y
conda activate VLDet
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia

# under your working directory

git clone https://github.com/clin1223/VLDet.git
cd VLDet
cd detectron2
pip install -e .
cd ..
pip install -r requirements.txt

Features

  • Directly learn an open-vocabulary object detector from image-text pairs by formulating the task as a bipartite matching problem.

  • State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

  • Scaling and extending novel object vocabulary easily.

Benchmark evaluation and training

Please first prepare datasets.

The VLDet models are finetuned on the corresponding Box-Supervised models (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under VLDet_ROOT/models/ before training the VLDet models.

To train a model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Download the trained network weights here.

OV_COCO box mAP50 box mAP50_novel
config_RN50 45.8 32.0
OV_LVIS mask mAP_all mask mAP_novel
config_RN50 30.1 21.7
config_Swin-B 38.1 26.3

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{VLDet,
  title={Learning Object-Language Alignments for Open-Vocabulary Object Detection},
  author={Lin, Chuang and Sun, Peize and Jiang, Yi and Luo, Ping and Qu, Lizhen and Haffari, Gholamreza and Yuan, Zehuan and Cai, Jianfei},
  journal={arXiv preprint arXiv:2211.14843},
  year={2022}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

This repository was built on top of Detectron2, Detic, RegionCLIP and OVR-CNN. We thank for their hard work.

vldet's People

Contributors

clin1223 avatar ifighting avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

vldet's Issues

There is a warning.

[03/07 20:19:00 d2.data.common]: Serialized dataset takes 424.32 MiB
[03/07 20:19:02 detectron2]: Starting training from iteration 0
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[03/07 20:19:45 d2.utils.events]: eta: 14:33:54 iter: 20 total_loss: 0.9773 caption_loss: 0.1988 loss_box_reg: 0.1348 loss_cls: 0.109 loss_rpn_cls: 0.08208 loss_rpn_loc: 0.06292 ot_loss: 0.3213 time: 0.5014 data_time: 1.6379 lr: 0.00039962 max_mem: 5244M.
...Processing...

I am not sure if it is normal.

What's the meaning for alignment scores?

In section 3.2, The cost between image regions and words is defined as the alignment scores S = WR⊤ . The bipartite matching problem can then be efficiently solved by the off-the-shelf Hungarian Algorithm.

The input of the Hungarian algorithm is a cost matrix. So, that means alignment score S is the cost matrix? But, why did you call WR⊤ alignment score?

image and text retrieval

Does VLDet support image and text retrieval? For example, my purpose is to give a text to retrieve the most matching image. If the model supports it, should I use the image embedding? Or each instance embedding? As far as I understand, should I use
proj_x = self.linear(input_x) [VLDet/vldet/modeling/roi_heads/zero_shot_classifier.py line98] as the image/instances embedding?

Can not replicate the final result.

Hi, I run this ' python train_net.py --num-gpus 3 --config-file configs/VLDet_OVCOCO_CLIP_R50_1x_caption.yaml'
Here is the result. The result is lower than yours.

[03/08 12:51:07] d2.evaluation.coco_evaluation INFO: Seen bbox AP50: 45.93429908569812
[03/08 12:51:07] d2.evaluation.coco_evaluation INFO: Unseen bbox AP50: 28.641015836555734
[03/08 12:51:07] detectron2 INFO: Evaluation results for coco_generalized_del_val in csv format:
[03/08 12:51:07] d2.evaluation.testing INFO: copypaste: Task: bbox
[03/08 12:51:07] d2.evaluation.testing INFO: copypaste: AP,AP50,AP75,APs,APm,APl
[03/08 12:51:07] d2.evaluation.testing INFO: copypaste: 24.7339,41.4114,25.6790,11.3174,27.5064,34.1043

Batch : 24
gpu: 3

Could you give me some suggestions?

There is no TEST division in LVIS ?

Great work! How can I evaluate the pretrained or trained model in the LVIS (CC3M) dataset? For example, in the configuration file "VLDet_LbaseCCcap_CLIP_R5021k_640b64_2x_ft4x_caption.yaml", it contains the attributes DATASETS: TRAIN: ("lvis_v1_train_norare", "cc3m_v1_nouns_train _6250tags"). There is no division corresponding to TEST in it, is this correct? Can this correctly evaluate the model trained by LVIS? Thanks!

Can not get the dataset.

You suggest following OVR-CNN to create the open-vocabulary COCO split.
However, I cam not got
coco/
zero-shot/
instances_train2017_seen_2.json
instances_val2017_all_2.json

Besides, I didn't see 'coco_65_concepts.txt' in your Google driver.

Could you provide these files? Thank you very much.

How did you train your Region Proposal Network?

Hi, thank you for your amazing work!
I want to know how did you train your Region Proposal Network?
In section 1, you said, "We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data." It sounds like you didn't use any annotation bounding boxes.
However, In section 3.1, you said, 'our goal is to build an object detector, trained on a dataset with base-class bounding box annotations and a dataset of image-caption pairs 〈 I, C 〉 associated with a large vocabulary C_open'. It sounds like some bounding boxes are used for supervision.

It confused me a lot. In my opinion, maybe you use the ground truth bounding box of base classes to train the RPN.

Kindly look forward to your reply. Thank you very much.

TypeError: __init__() got an unexpected keyword argument 'default_cfg'

python train_net.py --num-gpus 8 --config-file configs/VLDet_LbaseCCcap_CLIP_R5021k_640b64_2x_ft4x_caption.yaml --eval-only MODEL.WEIGHTS models/lvis_vldet.pth
[03/06 02:12:52] timm.models.helpers WARNING: No pretrained configuration specified for resnet50_in21k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Traceback (most recent call last):
File "train_net.py", line 269, in
launch(
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 230, in main
model = build_model(cfg)
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/modeling/meta_arch/build.py", line 22, in build_model
model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/config/config.py", line 189, in wrapped
explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/content/drive/MyDrive/VLDet/vldet/modeling/meta_arch/custom_rcnn.py", line 67, in from_config
ret = super().from_config(cfg)
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 73, in from_config
backbone = build_backbone(cfg)
File "/content/drive/MyDrive/VLDet/detectron2/detectron2/modeling/backbone/build.py", line 31, in build_backbone
backbone = BACKBONE_REGISTRY.get(backbone_name)(cfg, input_shape)
File "/content/drive/MyDrive/VLDet/vldet/modeling/backbone/timm.py", line 171, in build_p67_timm_fpn_backbone
bottom_up = build_timm_backbone(cfg, input_shape)
File "/content/drive/MyDrive/VLDet/vldet/modeling/backbone/timm.py", line 158, in build_timm_backbone
model = TIMM(
File "/content/drive/MyDrive/VLDet/vldet/modeling/backbone/timm.py", line 114, in init
self.base = create_timm_resnet(
File "/content/drive/MyDrive/VLDet/vldet/modeling/backbone/timm.py", line 73, in create_timm_resnet
return build_model_with_cfg(
File "/usr/local/lib/python3.8/dist-packages/timm/models/helpers.py", line 537, in build_model_with_cfg
model = model_cls(**kwargs) if model_cfg is None else model_cls(cfg=model_cfg, **kwargs)
File "/content/drive/MyDrive/VLDet/vldet/modeling/backbone/timm.py", line 31, in init
super().init(**kwargs)
TypeError: init() got an unexpected keyword argument 'default_cfg'

Inference code

Does VLDet support a simple script where I supply an image and a vocabulary file (and possibly embeddings) and get the bounding boxes or segmentation mask as the output?

Memory Leak problem.

Hi,

I am using pytorch==1.10.0 and A100x8 or V100x8 and VLDet_LbaseCCcap_CLIP_R5021k_640b64_2x_ft4x_caption.yaml config.

The GPU memory usage keeps on increasing after each iteration.

increasing speed is about 8~10 MB / iter.

As I figured out, the increment occurs at

scaler.scale(losses).backward()

and

Process is finally killed at

word_features = torch.zeros_like(zs_weight)

Do yu have any idea on this?

How to get `datasets/cc3m/train_image_info_tags.json`

Dear authors,

Thanks for presenting such a great work. I'm very interested in this method and trying to reproduce the results on my own. But I'm confused by the data preparation for Conceptual Caption.

As this doc, it seems there are missing steps from datasets/cc3m/train_image_info.json to datasets/cc3m/train_image_info_tags.json. So python tools/get_tags_for_VLDet_concepts.py does not work.

BTW, would you like to provide some descriptions about train_image_info.json and train_image_info_tags.json. They are a bit confusing. I'm wandering where they are used in the training.

The time required for training

Thanks for your great work. I would like to ask how long your task requires training on the COCO data set, and how many computing resources (e.g., 8 GPUs?), thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.