idea-research / groundingdino Goto Github PK

Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Home Page: https://arxiv.org/abs/2303.05499

License: Apache License 2.0

Python 79.26% C++ 1.98% Cuda 17.57% Jupyter Notebook 0.89% Dockerfile 0.30%

object-detection open-world open-world-detection vision-language vision-language-transformer

groundingdino's Introduction

🦕 Grounding DINO

IDEA-CVR, IDEA-Research

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang^📧.

[Paper] [Demo] [BibTex]

PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

🌞 Helpful Tutorial

✨ Highlight Projects

💡 Highlight

Open-Set Detection. Detect everything with language!
High Performance. COCO zero-shot 52.5 AP (training without COCO data!). COCO fine-tune 63.0 AP.
Flexible. Collaboration with Stable Diffusion for Image Editting.

🔥 News

2023/07/18: We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
2023/06/17: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.
2023/04/15: Refer to CV in the Wild Readings for those who are interested in open-set recognition!
2023/04/08: We release demos to combine Grounding DINO with GLIGEN for more controllable image editings.
2023/04/08: We release demos to combine Grounding DINO with Stable Diffusion for image editings.
2023/04/06: We build a new demo by marrying GroundingDINO with Segment-Anything named Grounded-Segment-Anything aims to support segmentation in GroundingDINO.
2023/03/28: A YouTube video about Grounding DINO and basic object detection prompt engineering. [SkalskiP]
2023/03/28: Add a demo on Hugging Face Space!
2023/03/27: Support CPU-only mode. Now the model can run on machines without GPUs.
2023/03/25: A demo for Grounding DINO is available at Colab. [SkalskiP]
2023/03/22: Code is available Now!

Description

Paper introduction.

Marrying Grounding DINO and GLIGEN

⭐ Explanations/Tips for Grounding DINO Inputs and Outputs

Grounding DINO accepts an (image, text) pair as inputs.
It outputs 900 (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
We defaultly choose the boxes whose highest similarities are higher than a box_threshold.
We extract the words whose similarities are higher than the text_threshold as predicted labels.
If you want to obtain objects of specific phrases, like the dogs in the sentence two dogs with a stick., you can select the boxes with highest text similarities with dogs as final outputs.
Note that each word can be split to more than one tokens with different tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
We suggest separating different category names with . for Grounding DINO.

🏷️ TODO

Release inference code and demo.
Release checkpoints.
Grounding DINO with Stable Diffusion and GLIGEN demos.
Release training codes.

🛠️ Install

Note:

If you have a CUDA environment, please make sure the environment variable CUDA_HOME is set. It will be compiled under CPU-only mode if no CUDA available.

Please make sure following the installation steps strictly, otherwise the program may produce:

NameError: name '_C' is not defined

If this happened, please reinstalled the groundingDINO by reclone the git and do all the installation steps again.

how to check cuda:

echo $CUDA_HOME

If it print nothing, then it means you haven't set up the path/

Run this so the environment variable will be set under current shell.

export CUDA_HOME=/path/to/cuda-11.3

Notice the version of cuda should be aligned with your CUDA runtime, for there might exists multiple cuda at the same time.

If you want to set the CUDA_HOME permanently, store it using:

echo 'export CUDA_HOME=/path/to/cuda' >> ~/.bashrc

after that, source the bashrc file and check CUDA_HOME:

source ~/.bashrc
echo $CUDA_HOME

In this example, /path/to/cuda-11.3 should be replaced with the path where your CUDA toolkit is installed. You can find this by typing which nvcc in your terminal:

For instance, if the output is /usr/local/cuda/bin/nvcc, then:

export CUDA_HOME=/usr/local/cuda

Installation:

1.Clone the GroundingDINO repository from GitHub.

git clone https://github.com/IDEA-Research/GroundingDINO.git

Change the current directory to the GroundingDINO folder.

cd GroundingDINO/

Install the required dependencies in the current directory.

pip install -e .

Download pre-trained model weights.

mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..

▶️ Demo

Check your GPU ID (only if you're using a GPU)

nvidia-smi

Replace {GPU ID}, image_you_want_to_detect.jpg, and "dir you want to save the output" with appropriate values in the following command

CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p weights/groundingdino_swint_ogc.pth \
-i image_you_want_to_detect.jpg \
-o "dir you want to save the output" \
-t "chair"
 [--cpu-only] # open it for cpu mode

If you would like to specify the phrases to detect, here is a demo:

CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p ./groundingdino_swint_ogc.pth \
-i .asset/cat_dog.jpeg \
-o logs/1111 \
-t "There is a cat and a dog in the image ." \
--token_spans "[[[9, 10], [11, 14]], [[19, 20], [21, 24]]]"
 [--cpu-only] # open it for cpu mode

The token_spans specify the start and end positions of a phrases. For example, the first phrase is [[9, 10], [11, 14]]. "There is a cat and a dog in the image ."[9:10] = 'a', "There is a cat and a dog in the image ."[11:14] = 'cat'. Hence it refers to the phrase a cat . Similarly, the [[19, 20], [21, 24]] refers to the phrase a dog.

See the demo/inference_on_a_image.py for more details.

Running with Python:

from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "weights/dog-3.jpeg"
TEXT_PROMPT = "chair . person . dog ."
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("annotated_image.jpg", annotated_frame)

Web UI

We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file demo/gradio_app.py for more details.

Notebooks

We release demos to combine Grounding DINO with GLIGEN for more controllable image editings.
We release demos to combine Grounding DINO with Stable Diffusion for image editings.

COCO Zero-shot Evaluations

We provide an example to evaluate Grounding DINO zero-shot performance on COCO. The results should be 48.5.

CUDA_VISIBLE_DEVICES=0 \
python demo/test_ap_on_coco.py \
 -c groundingdino/config/GroundingDINO_SwinT_OGC.py \
 -p weights/groundingdino_swint_ogc.pth \
 --anno_path /path/to/annoataions/ie/instances_val2017.json \
 --image_dir /path/to/imagedir/ie/val2017

🧳 Checkpoints

	name	backbone	Data	box AP on COCO	Checkpoint	Config
1	GroundingDINO-T	Swin-T	O365,GoldG,Cap4M	48.4 (zero-shot) / 57.2 (fine-tune)	GitHub link \| HF link	link
2	GroundingDINO-B	Swin-B	COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO	56.7	GitHub link \| HF link	link

🎖️ Results

COCO Object Detection Results

ODinW Object Detection Results

Marrying Grounding DINO with Stable Diffusion for Image Editing

See our example notebook for more details.

Marrying Grounding DINO with GLIGEN for more Detailed Image Editing.

See our example notebook for more details.

🦕 Model: Grounding DINO

Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.

♥️ Acknowledgement

Our model is related to DINO and GLIP. Thanks for their great work!

We also thank great previous work including DETR, Deformable DETR, SMCA, Conditional DETR, Anchor DETR, Dynamic DETR, DAB-DETR, DN-DETR, etc. More related work are available at Awesome Detection Transformer. A new toolbox detrex is available as well.

Thanks Stable Diffusion and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{liu2023grounding,
  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
  journal={arXiv preprint arXiv:2303.05499},
  year={2023}
}

groundingdino's People

Contributors

Stargazers

Watchers

Forkers

ai-jie01 johnleo-xjtu ltc0060 capjamesg healthonrails tingtingin sebasgarcia08 muftawu razin1996 georgepearse jraymond1 linhuixiao guskun8 anoop-qasolve gudaostudio jongphago wavelet2008 liuwenhaha dapengchen1234 goldfishfive kkallidromitis ssarswat cv-det amk2777 mingkaichen qianqian121 dumpmemory shaoguowen hsaigroup twistin moileehyeji zcfrank1st cuilinglan absorbguo qingsong99 songuyenerza hengle skyrookieyu ebrece shuowang-ai imagr-ltd lwppwl dengpeng997340899 wozhanqiangtou tuanbc 1oop newhkk themrzmaster lujianyao techthiyanes eltociear josefalio hhy5277 ccly1996 ethan-jiang-1 lyrl artzha luca-medeiros captainfffsama hedlen russ168 laughing-q meteorstone asmitumrao jordan-barrett-jm scumechanics hufeihu vaeschyak ruanqizhen fangarenotgnu gautambose xinyu1205 zbpjlc yukang2017 maheshs11 gasvn krgaurav-iitb quecca ytep-zhi nielsrogge bing-su deanofthewebb luohongliang grv805 nelzomal reidarriveland evdcush nprashanthreddy akshay-ast joberzheng yangjun1994 ycmove benxiao fischerandom1 modelai isold23 ericlee2021-72324 rentainhe slongliu rohanpatankar926

groundingdino's Issues

Where can we find all objects that the prompt text supported? e.g.: face, body, hand:left-right hand; business-suit, etc.

as title

TensorRT with GroundingDINO

Hi, I am a little stuck on how to use TensorRT to speed up GroundingDINO inference. GroundingDINO takes in both an image and text prompt and I am a bit lost on how to convert the text prompt to tensor. Can someone please give me some example code or suggestions on how to make it work? Thank you!

How to get COCO zero-shot performance

Hi. I want to get COCO zero-shot performance, but I found that no corresponding script is provided. How should I do this? Thank you

How to evaluate the model on my own dataset?

Thank you for the great work!

I am trying to evaluate the GroundingDINO model on my own dataset. But the readme only provides an inference example.

how to install the repo ?

hi,dear
as the title ,could I install it with pip ?
such as
pip install git+https://github.com/IDEA-Research/GroundingDINO

How many epochs did you train on O365 datasets?

Hello! I'd like to know how many epochs did you train GroundingDINO (with Swin-Tiny and Swin-Large) on O365 datasets?

Is it possible to feed a list of text prompts?

When I've tried to feed a list of text prompts to the model, I've only received predictions for the last string.

Is this user error or how the API is currently designed?

Can GroundingDINO process multiple classes prompt at the same time

Is it possible to predict multiple categories at the same time in the form of prompt with GroundingDINO, such as realizing the same capacity as dino to predict all categories in objects365

training code

Thanks a lot for the code released!
Will the training code be released, please?

Thanks a lot again！

when i Install Grounding DINO termminal will report this error:

when i Install Grounding DINO
termminal will report this error:

PS D:\kitch-python\Grounded-Segment-Anything-main> python -m pip install -e GroundingDINO
Obtaining file:///D:/kitch-python/Grounded-Segment-Anything-main/GroundingDINO
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [24 lines of output]
Traceback (most recent call last):
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 132, in get_requires_for_build_editable
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 447, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 485, in run_setup
self).run_setup(setup_script=setup_script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "", line 27, in
ModuleNotFoundError: No module named 'torch'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip

Question adding more training data

I have a couple of questions

Is there a slack or discord for work in CV and etc.?
Is it possible to add more training data? If so is there an example?

Thanks,
Thejas

from diffusers import StableDiffusionGLIGENPipeline pipe = StableDiffusionGLIGENPipeline.from_pretrained("gligen/diffusers-inpainting-text-box", revision="fp16", torch_dtype=torch.float16) pipe.to("cuda") The result obtained after running this code “ImportError: cannot import name 'StableDiffusionGLIGENPipeline' from 'diffusers' (/usr/local/lib/python3.9/dist-packages/diffusers/init.py)”

Can not support Chinese text prompt? such as '公文包'.

as title

About referring expression grounding

Thanks for the great work!

When I try the grounding demo file, I find a sentence (eg. a man in blue coat) will be splited into multiple phrases (eg. a man, blue coat) and the model will predict multiple boxes correspondding to the phrase. In fact, the model is expected to generate just one box (a man), just like the Referring Expression Grounding task. It seems this model can't handle this task. Or do I need to adjust some hyper-parameters?

But If this model cuts the given sentence into multiple phrases, how it is tested on RefCOCO dataset and achieves impressive performance.

I'm little confused about it.

Thanks!

Running on CPU mode Only!gpu cannot work!

I try this:
CUDA_VISIBLE_DEVICES=0 python demo/inference_on_a_image.py
-c /home/incar/tms/source/GroundingDINO/groundingdino/config/GroundingDINO_SwinB.cfg.py
-p groundingdino_swinb_cogcoor.pth
-i img/001.jpg
-o "outputs/0"
-t "animal,bird"

and I got this:
source/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!

my pytorch version:

'1.11.0+cu113'

so ,why?

Evaluation code

Thank you for your excellent work. I wonder if you have a plan to release the evaluation code.

Difference between GroundingDINO and UniDetector.

I am a fresh hand for open-set OD. Now I try to learn the difference between your Grounding DINO with UniDetector. Both can implement open-set detection. I think the difference may be, for each dataset, you need to prompt the novel labels into the network for zero-shot, but the UniDetector inputs many prompts, not limited to the labels in each dataset. Besides, may you provide more insight on the contributions of GroundingDINO? The answer will be helpful to me. Thanks!

nvcc 无法找到 cl.exe

在Windows系统中安装该项目 nvcc 执行命令时未找到cl.exe 可是我环境变量都没问题我也测试过了，为什么会这样呢

Error Installing GroundingDINO via pip git due to missing init.py file in 'datasets' folder

Issue Description:

When attempting to install GroundingDINO using pip git, the 'datasets' folder is not being installed properly. This is because the folder lacks an init.py file, which is required for the find_packages function to identify it.

This issue prevents the 'datasets' module from being properly imported and used, which results in errors when trying to run the program. It is recommended to include an empty init.py file in the 'datasets' folder to resolve this issue and allow for successful installation and usage of GroundingDINO.

Inference error on Windows - Permission denied

Hello and thanks for your absolutely amazing work !

Although on an OracleVM with Ubuntu-22 over W10 works ok, on Windows 10 - conda environment - I encounter the following error :
PermissionError: [Errno 13] Permission denied: 'C:\Users\drago\AppData\Local\Temp\tmp3ta82pyj\tmpvv_bru9r.py'

full description:
"(ZeroShot_GroundingDINO) C:\ML_Projects\Computer_Vision\GroundingDINO>python demo/inference_on_a_image.py --config_file groundingdino/config/GroundingDINO_SwinT_OGC.py --checkpoint_path weights/groundingdino_swint_ogc.pth --image_path images/Truck1.jpg --text_prompt truck --output_dir inference/images --box_threshold 0.3 --text_threshold 0.25
Traceback (most recent call last):
File "C:\ML_Projects\Computer_Vision\GroundingDINO\demo\inference_on_a_image.py", line 153, in
model = load_model(config_file, checkpoint_path, cpu_only=args.cpu_only)
File "C:\ML_Projects\Computer_Vision\GroundingDINO\demo\inference_on_a_image.py", line 73, in load_model
args = SLConfig.fromfile(model_config_path)
File "c:\ml_projects\computer_vision\groundingdino\groundingdino\util\slconfig.py", line 182, in fromfile
cfg_dict, cfg_text = SLConfig._file2dict(filename)
File "c:\ml_projects\computer_vision\groundingdino\groundingdino\util\slconfig.py", line 83, in _file2dict
shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
File "C:\Users\drago\anaconda3\envs\ZeroShot_GroundingDINO\lib\shutil.py", line 266, in copyfile
with open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: 'C:\Users\drago\AppData\Local\Temp\tmp3ta82pyj\tmpvv_bru9r.py'

The same error running as admin.

Thanks for help.

Dragos

Decreasing inference time on cpu

Thanks for this awesome model, it does evaluate good with pretrained model

Right now i am getting 15s average inference time, any way to reduce it to 2-3s

How to improve inference speed?

Hello, I want to know if there is a way to improve model's inference speed. It takes about 1.5 seconds for my machine (i9-11900KF+RTX 3080 Ti) to perform one inference, while in Appendix.I of the paper, the FPS can reach 8.37. I would like to ask how you achieved this? What was the configuration during testing?

Thanks !

Issue related to Coco Class Mapper In Utils.py file

In the file utils.py , for class CocoClassMapper , I can see that the classes are not mapped order wise, is it an actual issue? Or was it done by purpose ?
Line Reference
Codeblock of the same

class CocoClassMapper:
    def __init__(self) -> None:
        self.category_map_str = {
            "1": 1,
            "2": 2,
            "3": 3,
            "4": 4,
            "5": 5,
            "6": 6,
            "7": 7,
            "8": 8,
            "9": 9,
            "10": 10,
            "11": 11,
            "13": 12,
            "14": 13,
            "15": 14,
            "16": 15,
            "17": 16,
            "18": 17,
            "19": 18,
            "20": 19,
            "21": 20,
            "22": 21,
            "23": 22,
            "24": 23,
            "25": 24,
            "27": 25,
            "28": 26,
            "31": 27,
            "32": 28,
            "33": 29,
            "34": 30,
            "35": 31,
            "36": 32,
            "37": 33,
            "38": 34,
            "39": 35,
            "40": 36,
            "41": 37,
            "42": 38,
            "43": 39,
            "44": 40,
            "46": 41,
            "47": 42,
            "48": 43,
            "49": 44,
            "50": 45,
            "51": 46,
            "52": 47,
            "53": 48,
            "54": 49,
            "55": 50,
            "56": 51,
            "57": 52,
            "58": 53,
            "59": 54,
            "60": 55,
            "61": 56,
            "62": 57,
            "63": 58,
            "64": 59,
            "65": 60,
            "67": 61,
            "70": 62,
            "72": 63,
            "73": 64,
            "74": 65,
            "75": 66,
            "76": 67,
            "77": 68,
            "78": 69,
            "79": 70,
            "80": 71,
            "81": 72,
            "82": 73,
            "84": 74,
            "85": 75,
            "86": 76,
            "87": 77,
            "88": 78,
            "89": 79,
            "90": 80,
        }

As you can see the numbers are not properly matched , like we are missing keys such as 12,26 , 29 ...

Thanks

Testing or evaluating code in COCO dataset

Hello,

I am trying to replicate the performance of Grounding DINO on COCO dataset. Is there any code or any way to replicate the performance of the Grounding DINO on COCO dataset?

Thanks for the great work!

Great work and a few warnings to address

I appreciate the fantastic work on this project! The paper and implementation are outstanding. Just a heads-up, there are a few PyTorch-related warnings (e.g., floordiv, meshgrid) in the code.

/home/yanai-lab/foo/conda_env/gdino/lib/python3.10/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/TensorShape.cpp:2894.)  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]final text_encoder_type: bert-base-uncased_IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])/home/yanai-lab/foo/conda_env/gdino/lib/python3.10/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/backbone/position_encoding.py:114: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  dim_tx = self.temperatureW ** (2 * (dim_tx // 2) / self.num_pos_feats)/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/backbone/position_encoding.py:118: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  dim_ty = self.temperatureH ** (2 * (dim_ty // 2) / self.num_pos_feats)/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/utils.py:209: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  dim_t = 10000 ** (2 * (dim_t // 2) / 128)

Here is my current environment for running your code:

Package            Version    Editable project location
------------------ ---------- --------------------------------------------------
addict             2.4.0 
brotlipy           0.7.0 
certifi            2022.12.7
cffi               1.15.1
chardet            5.1.0 
charset-normalizer 3.1.0 
click              8.1.3 
contourpy          1.0.7 
cryptography       39.0.1
cycler             0.11.0
filelock           3.10.0
flit_core          3.8.0  
fonttools          4.39.2
groundingdino      0.1.0      /host/space0/xx/GroundingDINO
huggingface-hub    0.13.3                                                                                                                                            
idna               3.4
joblib             1.2.0
kiwisolver         1.4.4
matplotlib         3.7.1
mkl-fft            1.3.1
mkl-random         1.2.2
mkl-service        2.4.0
numpy              1.24.2
opencv-python      4.7.0.72
packaging          21.3
Pillow             9.4.0
pip                23.0.1
pycocotools        2.0.6
pycparser          2.21
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
sacremoses         0.0.53
setuptools         65.6.3
six                1.16.0
timm               0.6.12
tokenizers         0.10.3
torch              1.12.1
torchaudio         0.12.1
torchvision        0.13.1
tqdm               4.65.0
transformers       4.5.1
typing_extensions  4.4.0
urllib3            1.26.15
wheel              0.38.4
yapf               0.32.0

It does not work at all :(

My command:

CUDA_VISIBLE_DEVICES=0 python /workspace/GroundingDINO/demo/inference_on_a_image.py \
  -c /workspace/GroundingDINO/groundingdino/config/GroundingDINO_SwinB.cfg.py \
  -p /workspace/groundingdino_swinb_cogcoor.pth \
  -i /workspace/1e72fd7-1hordon-690.png \
  -o "outputs/0" \
  -t "cat ear."

Input image:

Output:

The swinb checkpoint is downloaded slowly,can you upload a HFlink?

cpu and gpu cant run the demo

(face19) ubuntu@ubuntu-X10SRA:~/seg/GroundingDINO$ python demo/inference_on_a_image.py -p /data/pic/groundingdino_swint_ogc.pth -i .asset/cats.png -c groundingdino/config/GroundingDINO_SwinT_OGC.py -t "cat" -o out/ or add --cpu-only
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
_IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])
/home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/transformers/modeling_utils.py:830: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
2023-04-07 16:26:51.924619: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-07 16:26:52.034800: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
2023-04-07 16:26:52.034830: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-04-07 16:26:52.056150: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-07 16:26:52.605880: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
2023-04-07 16:26:52.605964: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
2023-04-07 16:26:52.605982: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(face19) ubuntu@ubuntu-X10SRA:~/seg/GroundingDINO$

Text encoder API

Hi,

Thank you for your awesome work! I wonder whether there is an API for text encoder that takes in texts and outputs text features, like https://huggingface.co/docs/transformers/model_doc/owlvit#transformers.OwlViTTextModel in OWL-ViT.

Best,
Yixuan

How to choose the best prompt

Thanks a lot for this amazing repo!!!

During my personal experiments I found that sometimes it is difficult to find the best prompt for some unusual objects and for such objects I tried to describe the object using some properties (size, colors, shape, etc). And sometimes this strategy worked successfully, but sometimes not.

What do you think is it possible to somehow extract the information about the "best" prompt for the object for which we know the bounding box?

Thank you!

packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

When executing:

!python demo/inference_on_a_image.py \
  -c {CONFIG_PATH} \
  -p {WEIGHTS_PATH} \
  -i {IMAGE_NAME} \
  -o {OUTPUT_IMAGE_NAME} \
  -t "dog"

I got:

Traceback (most recent call last):
  File "/content/GroundingDINO/demo/inference_on_a_image.py", line 10, in <module>
    from groundingdino.models import build_model
  File "/content/GroundingDINO/groundingdino/models/__init__.py", line 8, in <module>
    from .GroundingDINO import build_groundingdino
  File "/content/GroundingDINO/groundingdino/models/GroundingDINO/__init__.py", line 15, in <module>
    from .groundingdino import build_groundingdino
  File "/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py", line 24, in <module>
    from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast
  File "/usr/local/lib/python3.9/dist-packages/transformers/__init__.py", line 43, in <module>
    from . import dependency_versions_check
  File "/usr/local/lib/python3.9/dist-packages/transformers/dependency_versions_check.py", line 41, in <module>
    require_version_core(deps[pkg])
  File "/usr/local/lib/python3.9/dist-packages/transformers/utils/versions.py", line 101, in require_version_core
    return require_version(requirement, hint)
  File "/usr/local/lib/python3.9/dist-packages/transformers/utils/versions.py", line 92, in require_version
    if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
  File "/usr/local/lib/python3.9/dist-packages/packaging/version.py", line 52, in parse
    return Version(version)
  File "/usr/local/lib/python3.9/dist-packages/packaging/version.py", line 197, in __init__
    raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

Any idea how to solve it?

Cannot install

Got the following error:

      RuntimeError:
      The detected CUDA version (10.1) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for groundingdino
  Running setup.py clean for groundingdino
Failed to build groundingdino

如何训练自己数的据集

作者您好，如何使用你们预训练好的GroundingDINO模型训练自己的数据集？（在自己的数据集数量很少的情况下）

dense similarity predictions

Hi! Thanks for your awesome work. I am just wondering if there is any possibility of extracting dense similarity scores between an inputted image and textual prompts. Exactly, I have tried to extract dense similarity according to the following pseudo-code with the text features and image features after the Feature Enhanced. However, I found that the similarity between them is nearly nonsense. I just would like to check out if there are any other suggestions, as dense similarity is vital for several open world tasks.

enhanced_image_features = F.normalize(enhanced_image_features, dim=-1)
enhanced_text_features = F.normalize(enhanced_text_features, dim=-1)
similarity = enhanced_image_features  @ enhanced_text_features.T

Issue with load_model on Windows platform

Hi, @SlongLiu 👋🏻! We got this issue in our https://github.com/roboflow/notebooks repository. It looks like there are some problems with load_model on Windows.

questions about zero-shot detection

I tried to apply grounded dino on my custom data. However, I am confused about how to input the text or class names? I notice that you take 'dog.stick.' as input in the example in ReadMe. However, if I do not know what object is in the image, should I input all the possible class names as text?

Different order of detection prompts generate different results?

Hello, thanks for your great work 😄

I want to detect several classes in this image. However, when I change the order of detection prompts, I get different results.
Would you mind telling me why the order will affect the results ?
And what is the best practice when I want to use Grounding Dino to detect several classes?

example image

Detection Prompt:

socks, shoes, dress, skirt

RESULT

Detection Prompt:

dress, skirt, socks, shoes

RESULT

Train on custom dataset

Hi, Thanks for the great works! May I know how to train on our own custom dataset

如何编写一个label studio的后端

使用这个模型进行半自动化标注

zero-shot anomaly detection

Hi! Here I implemented a project combined SAM and Grounded DINO for zero-shot anomaly detection (https://github.com/caoyunkang/GroundedSAM-zero-shot-anomaly-detection). Welcome for any discussion~

The weights of GroundingDINO-L

Thanks a lot for the extraordinary work!
Can you provide the weights of GroundingDINO-L (swin-large) ?

Simplify getting output labels

First, just wanted to say thanks for publishing this repo -- very cool and love that it's fully open-sourced.

I wanted to suggest a simplification to get_phrases_from_posmap. For reference, here is the current definition:

def get_phrases_from_posmap(posmap: torch.BoolTensor, tokenlized, caption: str):
    assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
    if posmap.dim() == 1:
        non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
        words_list = caption.split()

        # build word idx list
        words_idx_used_list = []
        for idx in non_zero_idx:
            word_idx = tokenlized.token_to_word(idx)
            if word_idx is not None:
                words_idx_used_list.append(word_idx)
        words_idx_used_list = set(words_idx_used_list)

        # build phrase
        words_used_list = []
        for idx, word in enumerate(words_list):
            if idx in words_idx_used_list:
                words_used_list.append(word)

        sentence_res = " ".join(words_used_list)
        return sentence_res
    else:
        raise NotImplementedError("posmap must be 1-dim")

It seems like this is over-complicating the text decoding, since we can already do that with the tokenizer. An interesting edge case with your implementation:

If the input label contains a long word (e.g. "American flag"), the long word is often lost in the output label. I believe this is because the long word ("American") is split into multiple tokens. In that case, the for idx, word in enumerate(words_list) loop is incorrect, because the positional index of the word is offset. In my tests, it has often returned incorrect or empty text labels, due to this issue.

A potential fix:

from typing import Dict

import torch
from transformers import AutoTokenizer

def get_phrases_from_posmap(
    posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer
):
    assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
    if posmap.dim() == 1:
        non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
        token_ids = [tokenized["input_ids"][i] for i in non_zero_idx]
        return tokenizer.decode(token_ids)
    else:
        raise NotImplementedError("posmap must be 1-dim")

We don't have to compare against the original caption, because all token information is already contained in the tokenizer. This gives the correct result "American flag" in my example above.

Reverse engineer prompt

Firstly thanks for the awesome work, very cool!

I was wondering if it's possible to reverse engineer a prompt given a bounding box?

Some questions about the detail in the paper.

Nice work! Excuse me, I have some question about the Table 7 and figure 5 of the paper. In Table 7, it shows the from scratch training is higher than from pretrained DINO in COCO datase, where in figure 5 is on the contrary. It's a little confusing.

cannot import name '_C' from 'groundingdino'

I'm able to run it properly on a local GPU machine I've got but when I move this to the cloud on a docker image, I'm getting this issue. Have a feeling there's some system package I'm missing or something but not 100% sure.

Some sub-words are ignored when use long word

Thanks for the great code. I encountered an issue when using the GroundDINO (or maybe it is just expected?)
If I use a long word, like 'pottedplant', it will be tokenized into several sub-words.
when generating the output bounding boxes, some sub-words are ignored (I guess it is because the cross-attention is done in token level so scores of some sub-words are lower than the text threshold), and generated label is incomplete.
For example, the 'pottedplant' -> 'pot' 'ted' 'pl' 'ant', and some box labels are wrong, like 'potted' , 'pottedpl'.
I wonder is there any solution for this?

ERROR: No matching distribution found for supervision==0.4.0

pip install supervision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
Looking in indexes: http://pypi.douban.com/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
ERROR: Could not find a version that satisfies the requirement supervision==0.4.0 (from versions: 0.1.0, 0.2.0, 0.2.1, 0.3.0, 0.3.1)
ERROR: No matching distribution found for supervision==0.4.0

bug?

please double check here:

inference_on_a_image.py

box_threshold = args.box_threshold
text_threshold = args.box_threshold

Installation error

Hi,

I'm getting this error message during installation:

UserWarning: The detected CUDA version (11.5) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
error: command '/usr/bin/nvcc' failed with exit code 1
[end of output]

Could you give some tips to fix thi issue?

Some questions on the details in paper

To the Authors

This is a very interesting and good work on visual grounding tasks with a Query-based detector. The paper is also well written and clear. Super interesting results with GLIGEN as well. I do have a few very specific questions about the implementation or concepts in the paper.

As for the language-guided query selection. This module makes a lot of sense and you are basically saying, you want to extract location of the image tokens where they have the greatest responses with the text tokens. And then use these as the location queries in the Mixed-Query-Selection design in DINO. I notice you describe the outer product between text/img tokens as logits. My questions are (a). Is there any supervision on this level? If not, did you use any pretrained Vison Language initializations so that they naturally responds (b). Does it make more sense to use the normalized feature vectors so that the dot product is actually correlation. (c) what happens if the selected image tokens all have responses to the same text token or only a few text tokens, and if there is any way to separate them out like the 1st-stage training in Deformable DETR or DINO?
As for the Sub-Sentence Level Text Feature. (a). How is the attention mask produced when dealing with a weak annotation such as image-caption pairs (Cap4M), did you take a noun extraction methods as described in DetCLIP? As a detailed example would be, how to generate the attention mask for a concept "fruit fly" or any human names such as "Harry Potter", when the detection dataset doesn't have this category. (b). And how to handle the input length limit as GLIP describes in their paper when you have over 1000 categories like LVIS during training/inference? Was there like a sparse negative category sampling strategy?
Loss Function Is the negative class handled similar to the alignment loss described in GLIP or MDETR? I assume you apply sigmoid focal loss and the negative object queries simply learns the 0 from {0, 1} binary target?
Last but not least, do you think it's possible to leverage other frameworks such as pretrained ALBEF, VLMo or even BeiTv3 and inject your design into it? If not, what do you think are the limitations of these frameworks.

Thank you.