Giter Club home page Giter Club logo

idea-research / groundingdino Goto Github PK

View Code? Open in Web Editor NEW
5.1K 32.0 539.0 12.8 MB

Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Home Page: https://arxiv.org/abs/2303.05499

License: Apache License 2.0

Python 79.26% C++ 1.98% Cuda 17.57% Jupyter Notebook 0.89% Dockerfile 0.30%
object-detection open-world open-world-detection vision-language vision-language-transformer

groundingdino's Introduction

🦕 Grounding DINO

PWC PWC
PWC PWC

IDEA-CVR, IDEA-Research

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang📧.

[Paper] [Demo] [BibTex]

PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

🌞 Helpful Tutorial

✨ Highlight Projects

💡 Highlight

  • Open-Set Detection. Detect everything with language!
  • High Performance. COCO zero-shot 52.5 AP (training without COCO data!). COCO fine-tune 63.0 AP.
  • Flexible. Collaboration with Stable Diffusion for Image Editting.

🔥 News

  • 2023/07/18: We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
  • 2023/06/17: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.
  • 2023/04/15: Refer to CV in the Wild Readings for those who are interested in open-set recognition!
  • 2023/04/08: We release demos to combine Grounding DINO with GLIGEN for more controllable image editings.
  • 2023/04/08: We release demos to combine Grounding DINO with Stable Diffusion for image editings.
  • 2023/04/06: We build a new demo by marrying GroundingDINO with Segment-Anything named Grounded-Segment-Anything aims to support segmentation in GroundingDINO.
  • 2023/03/28: A YouTube video about Grounding DINO and basic object detection prompt engineering. [SkalskiP]
  • 2023/03/28: Add a demo on Hugging Face Space!
  • 2023/03/27: Support CPU-only mode. Now the model can run on machines without GPUs.
  • 2023/03/25: A demo for Grounding DINO is available at Colab. [SkalskiP]
  • 2023/03/22: Code is available Now!
Description Paper introduction. ODinW Marrying Grounding DINO and GLIGEN gd_gligen

⭐ Explanations/Tips for Grounding DINO Inputs and Outputs

  • Grounding DINO accepts an (image, text) pair as inputs.
  • It outputs 900 (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
  • We defaultly choose the boxes whose highest similarities are higher than a box_threshold.
  • We extract the words whose similarities are higher than the text_threshold as predicted labels.
  • If you want to obtain objects of specific phrases, like the dogs in the sentence two dogs with a stick., you can select the boxes with highest text similarities with dogs as final outputs.
  • Note that each word can be split to more than one tokens with different tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
  • We suggest separating different category names with . for Grounding DINO. model_explain1 model_explain2

🏷️ TODO

  • Release inference code and demo.
  • Release checkpoints.
  • Grounding DINO with Stable Diffusion and GLIGEN demos.
  • Release training codes.

🛠️ Install

Note:

  1. If you have a CUDA environment, please make sure the environment variable CUDA_HOME is set. It will be compiled under CPU-only mode if no CUDA available.

Please make sure following the installation steps strictly, otherwise the program may produce:

NameError: name '_C' is not defined

If this happened, please reinstalled the groundingDINO by reclone the git and do all the installation steps again.

how to check cuda:

echo $CUDA_HOME

If it print nothing, then it means you haven't set up the path/

Run this so the environment variable will be set under current shell.

export CUDA_HOME=/path/to/cuda-11.3

Notice the version of cuda should be aligned with your CUDA runtime, for there might exists multiple cuda at the same time.

If you want to set the CUDA_HOME permanently, store it using:

echo 'export CUDA_HOME=/path/to/cuda' >> ~/.bashrc

after that, source the bashrc file and check CUDA_HOME:

source ~/.bashrc
echo $CUDA_HOME

In this example, /path/to/cuda-11.3 should be replaced with the path where your CUDA toolkit is installed. You can find this by typing which nvcc in your terminal:

For instance, if the output is /usr/local/cuda/bin/nvcc, then:

export CUDA_HOME=/usr/local/cuda

Installation:

1.Clone the GroundingDINO repository from GitHub.

git clone https://github.com/IDEA-Research/GroundingDINO.git
  1. Change the current directory to the GroundingDINO folder.
cd GroundingDINO/
  1. Install the required dependencies in the current directory.
pip install -e .
  1. Download pre-trained model weights.
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..

▶️ Demo

Check your GPU ID (only if you're using a GPU)

nvidia-smi

Replace {GPU ID}, image_you_want_to_detect.jpg, and "dir you want to save the output" with appropriate values in the following command

CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p weights/groundingdino_swint_ogc.pth \
-i image_you_want_to_detect.jpg \
-o "dir you want to save the output" \
-t "chair"
 [--cpu-only] # open it for cpu mode

If you would like to specify the phrases to detect, here is a demo:

CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p ./groundingdino_swint_ogc.pth \
-i .asset/cat_dog.jpeg \
-o logs/1111 \
-t "There is a cat and a dog in the image ." \
--token_spans "[[[9, 10], [11, 14]], [[19, 20], [21, 24]]]"
 [--cpu-only] # open it for cpu mode

The token_spans specify the start and end positions of a phrases. For example, the first phrase is [[9, 10], [11, 14]]. "There is a cat and a dog in the image ."[9:10] = 'a', "There is a cat and a dog in the image ."[11:14] = 'cat'. Hence it refers to the phrase a cat . Similarly, the [[19, 20], [21, 24]] refers to the phrase a dog.

See the demo/inference_on_a_image.py for more details.

Running with Python:

from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "weights/dog-3.jpeg"
TEXT_PROMPT = "chair . person . dog ."
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("annotated_image.jpg", annotated_frame)

Web UI

We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file demo/gradio_app.py for more details.

Notebooks

COCO Zero-shot Evaluations

We provide an example to evaluate Grounding DINO zero-shot performance on COCO. The results should be 48.5.

CUDA_VISIBLE_DEVICES=0 \
python demo/test_ap_on_coco.py \
 -c groundingdino/config/GroundingDINO_SwinT_OGC.py \
 -p weights/groundingdino_swint_ogc.pth \
 --anno_path /path/to/annoataions/ie/instances_val2017.json \
 --image_dir /path/to/imagedir/ie/val2017

🧳 Checkpoints

name backbone Data box AP on COCO Checkpoint Config
1 GroundingDINO-T Swin-T O365,GoldG,Cap4M 48.4 (zero-shot) / 57.2 (fine-tune) GitHub link | HF link link
2 GroundingDINO-B Swin-B COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO 56.7 GitHub link | HF link link

🎖️ Results

COCO Object Detection Results COCO
ODinW Object Detection Results ODinW
Marrying Grounding DINO with Stable Diffusion for Image Editing See our example notebook for more details. GD_SD
Marrying Grounding DINO with GLIGEN for more Detailed Image Editing. See our example notebook for more details. GD_GLIGEN

🦕 Model: Grounding DINO

Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.

arch

♥️ Acknowledgement

Our model is related to DINO and GLIP. Thanks for their great work!

We also thank great previous work including DETR, Deformable DETR, SMCA, Conditional DETR, Anchor DETR, Dynamic DETR, DAB-DETR, DN-DETR, etc. More related work are available at Awesome Detection Transformer. A new toolbox detrex is available as well.

Thanks Stable Diffusion and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{liu2023grounding,
  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
  journal={arXiv preprint arXiv:2303.05499},
  year={2023}
}

groundingdino's People

Contributors

ahmedosman2001 avatar ashwinunnikrishnan avatar bing-su avatar csuastt avatar deniz-birlikci avatar eltociear avatar everloom-129 avatar gathierry avatar georgepearse avatar haoliuhust avatar haoran-hash avatar hardikdava avatar jishnujp-vp avatar junxnone avatar karimumar98 avatar kazutomurase avatar luca-medeiros avatar mhd-medfa avatar pooya-mohammadi avatar rentainhe avatar sdy623 avatar skalskip avatar slongliu avatar teenaxta avatar zvant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

groundingdino's Issues

TensorRT with GroundingDINO

Hi, I am a little stuck on how to use TensorRT to speed up GroundingDINO inference. GroundingDINO takes in both an image and text prompt and I am a bit lost on how to convert the text prompt to tensor. Can someone please give me some example code or suggestions on how to make it work? Thank you!

how to install the repo ?

hi,dear
as the title ,could I install it with pip ?
such as
pip install git+https://github.com/IDEA-Research/GroundingDINO

training code

Thanks a lot for the code released!
Will the training code be released, please?

Thanks a lot again!

when i Install Grounding DINO termminal will report this error:

when i Install Grounding DINO
termminal will report this error:

PS D:\kitch-python\Grounded-Segment-Anything-main> python -m pip install -e GroundingDINO
Obtaining file:///D:/kitch-python/Grounded-Segment-Anything-main/GroundingDINO
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [24 lines of output]
Traceback (most recent call last):
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 132, in get_requires_for_build_editable
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 447, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 485, in run_setup
self).run_setup(setup_script=setup_script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\KitchXia\AppData\Local\Temp\pip-build-env-lr8tlhtk\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "", line 27, in
ModuleNotFoundError: No module named 'torch'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip

Question adding more training data

I have a couple of questions

  1. Is there a slack or discord for work in CV and etc.?
  2. Is it possible to add more training data? If so is there an example?

Thanks,
Thejas

About referring expression grounding

Thanks for the great work!

When I try the grounding demo file, I find a sentence (eg. a man in blue coat) will be splited into multiple phrases (eg. a man, blue coat) and the model will predict multiple boxes correspondding to the phrase. In fact, the model is expected to generate just one box (a man), just like the Referring Expression Grounding task. It seems this model can't handle this task. Or do I need to adjust some hyper-parameters?

But If this model cuts the given sentence into multiple phrases, how it is tested on RefCOCO dataset and achieves impressive performance.

I'm little confused about it.

Thanks!

Running on CPU mode Only!gpu cannot work!

I try this:
CUDA_VISIBLE_DEVICES=0 python demo/inference_on_a_image.py
-c /home/incar/tms/source/GroundingDINO/groundingdino/config/GroundingDINO_SwinB.cfg.py
-p groundingdino_swinb_cogcoor.pth
-i img/001.jpg
-o "outputs/0"
-t "animal,bird"

and I got this:
source/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!

my pytorch version:

'1.11.0+cu113'

so ,why?

Evaluation code

Thank you for your excellent work. I wonder if you have a plan to release the evaluation code.

Difference between GroundingDINO and UniDetector.

I am a fresh hand for open-set OD. Now I try to learn the difference between your Grounding DINO with UniDetector. Both can implement open-set detection. I think the difference may be, for each dataset, you need to prompt the novel labels into the network for zero-shot, but the UniDetector inputs many prompts, not limited to the labels in each dataset. Besides, may you provide more insight on the contributions of GroundingDINO? The answer will be helpful to me. Thanks!

nvcc 无法找到 cl.exe

在Windows系统中安装该项目 nvcc 执行命令时未找到cl.exe 可是我环境变量都没问题 我也测试过了,为什么会这样呢

Error Installing GroundingDINO via pip git due to missing init.py file in 'datasets' folder

Issue Description:

When attempting to install GroundingDINO using pip git, the 'datasets' folder is not being installed properly. This is because the folder lacks an init.py file, which is required for the find_packages function to identify it.

This issue prevents the 'datasets' module from being properly imported and used, which results in errors when trying to run the program. It is recommended to include an empty init.py file in the 'datasets' folder to resolve this issue and allow for successful installation and usage of GroundingDINO.

Inference error on Windows - Permission denied

Hello and thanks for your absolutely amazing work !

Although on an OracleVM with Ubuntu-22 over W10 works ok, on Windows 10 - conda environment - I encounter the following error :
PermissionError: [Errno 13] Permission denied: 'C:\Users\drago\AppData\Local\Temp\tmp3ta82pyj\tmpvv_bru9r.py'

full description:
"(ZeroShot_GroundingDINO) C:\ML_Projects\Computer_Vision\GroundingDINO>python demo/inference_on_a_image.py --config_file groundingdino/config/GroundingDINO_SwinT_OGC.py --checkpoint_path weights/groundingdino_swint_ogc.pth --image_path images/Truck1.jpg --text_prompt truck --output_dir inference/images --box_threshold 0.3 --text_threshold 0.25
Traceback (most recent call last):
File "C:\ML_Projects\Computer_Vision\GroundingDINO\demo\inference_on_a_image.py", line 153, in
model = load_model(config_file, checkpoint_path, cpu_only=args.cpu_only)
File "C:\ML_Projects\Computer_Vision\GroundingDINO\demo\inference_on_a_image.py", line 73, in load_model
args = SLConfig.fromfile(model_config_path)
File "c:\ml_projects\computer_vision\groundingdino\groundingdino\util\slconfig.py", line 182, in fromfile
cfg_dict, cfg_text = SLConfig._file2dict(filename)
File "c:\ml_projects\computer_vision\groundingdino\groundingdino\util\slconfig.py", line 83, in _file2dict
shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
File "C:\Users\drago\anaconda3\envs\ZeroShot_GroundingDINO\lib\shutil.py", line 266, in copyfile
with open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: 'C:\Users\drago\AppData\Local\Temp\tmp3ta82pyj\tmpvv_bru9r.py'

The same error running as admin.

Thanks for help.

Dragos

Decreasing inference time on cpu

Thanks for this awesome model, it does evaluate good with pretrained model

Right now i am getting 15s average inference time, any way to reduce it to 2-3s

How to improve inference speed?

Hello, I want to know if there is a way to improve model's inference speed. It takes about 1.5 seconds for my machine (i9-11900KF+RTX 3080 Ti) to perform one inference, while in Appendix.I of the paper, the FPS can reach 8.37. I would like to ask how you achieved this? What was the configuration during testing?

Thanks !

Issue related to Coco Class Mapper In Utils.py file

In the file utils.py , for class CocoClassMapper , I can see that the classes are not mapped order wise, is it an actual issue? Or was it done by purpose ?
Line Reference
Codeblock of the same

class CocoClassMapper:
    def __init__(self) -> None:
        self.category_map_str = {
            "1": 1,
            "2": 2,
            "3": 3,
            "4": 4,
            "5": 5,
            "6": 6,
            "7": 7,
            "8": 8,
            "9": 9,
            "10": 10,
            "11": 11,
            "13": 12,
            "14": 13,
            "15": 14,
            "16": 15,
            "17": 16,
            "18": 17,
            "19": 18,
            "20": 19,
            "21": 20,
            "22": 21,
            "23": 22,
            "24": 23,
            "25": 24,
            "27": 25,
            "28": 26,
            "31": 27,
            "32": 28,
            "33": 29,
            "34": 30,
            "35": 31,
            "36": 32,
            "37": 33,
            "38": 34,
            "39": 35,
            "40": 36,
            "41": 37,
            "42": 38,
            "43": 39,
            "44": 40,
            "46": 41,
            "47": 42,
            "48": 43,
            "49": 44,
            "50": 45,
            "51": 46,
            "52": 47,
            "53": 48,
            "54": 49,
            "55": 50,
            "56": 51,
            "57": 52,
            "58": 53,
            "59": 54,
            "60": 55,
            "61": 56,
            "62": 57,
            "63": 58,
            "64": 59,
            "65": 60,
            "67": 61,
            "70": 62,
            "72": 63,
            "73": 64,
            "74": 65,
            "75": 66,
            "76": 67,
            "77": 68,
            "78": 69,
            "79": 70,
            "80": 71,
            "81": 72,
            "82": 73,
            "84": 74,
            "85": 75,
            "86": 76,
            "87": 77,
            "88": 78,
            "89": 79,
            "90": 80,
        }

As you can see the numbers are not properly matched , like we are missing keys such as 12,26 , 29 ...

Thanks

Testing or evaluating code in COCO dataset

Hello,

I am trying to replicate the performance of Grounding DINO on COCO dataset. Is there any code or any way to replicate the performance of the Grounding DINO on COCO dataset?

Thanks for the great work!

Great work and a few warnings to address

I appreciate the fantastic work on this project! The paper and implementation are outstanding. Just a heads-up, there are a few PyTorch-related warnings (e.g., floordiv, meshgrid) in the code.

/home/yanai-lab/foo/conda_env/gdino/lib/python3.10/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/TensorShape.cpp:2894.)  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]final text_encoder_type: bert-base-uncased_IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])/home/yanai-lab/foo/conda_env/gdino/lib/python3.10/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/backbone/position_encoding.py:114: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  dim_tx = self.temperatureW ** (2 * (dim_tx // 2) / self.num_pos_feats)/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/backbone/position_encoding.py:118: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  dim_ty = self.temperatureH ** (2 * (dim_ty // 2) / self.num_pos_feats)/host/space0/GroundingDINO/groundingdino/models/GroundingDINO/utils.py:209: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  dim_t = 10000 ** (2 * (dim_t // 2) / 128)

Here is my current environment for running your code:

Package            Version    Editable project location
------------------ ---------- --------------------------------------------------
addict             2.4.0 
brotlipy           0.7.0 
certifi            2022.12.7
cffi               1.15.1
chardet            5.1.0 
charset-normalizer 3.1.0 
click              8.1.3 
contourpy          1.0.7 
cryptography       39.0.1
cycler             0.11.0
filelock           3.10.0
flit_core          3.8.0  
fonttools          4.39.2
groundingdino      0.1.0      /host/space0/xx/GroundingDINO
huggingface-hub    0.13.3                                                                                                                                            
idna               3.4
joblib             1.2.0
kiwisolver         1.4.4
matplotlib         3.7.1
mkl-fft            1.3.1
mkl-random         1.2.2
mkl-service        2.4.0
numpy              1.24.2
opencv-python      4.7.0.72
packaging          21.3
Pillow             9.4.0
pip                23.0.1
pycocotools        2.0.6
pycparser          2.21
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
sacremoses         0.0.53
setuptools         65.6.3
six                1.16.0
timm               0.6.12
tokenizers         0.10.3
torch              1.12.1
torchaudio         0.12.1
torchvision        0.13.1
tqdm               4.65.0
transformers       4.5.1
typing_extensions  4.4.0
urllib3            1.26.15
wheel              0.38.4
yapf               0.32.0

It does not work at all :(

My command:

CUDA_VISIBLE_DEVICES=0 python /workspace/GroundingDINO/demo/inference_on_a_image.py \
  -c /workspace/GroundingDINO/groundingdino/config/GroundingDINO_SwinB.cfg.py \
  -p /workspace/groundingdino_swinb_cogcoor.pth \
  -i /workspace/1e72fd7-1hordon-690.png \
  -o "outputs/0" \
  -t "cat ear." 

Input image:
image

Output:
image

cpu and gpu cant run the demo

(face19) ubuntu@ubuntu-X10SRA:~/seg/GroundingDINO$ python demo/inference_on_a_image.py -p /data/pic/groundingdino_swint_ogc.pth -i .asset/cats.png -c groundingdino/config/GroundingDINO_SwinT_OGC.py -t "cat" -o out/ or add --cpu-only
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']

  • This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    _IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])
    /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/transformers/modeling_utils.py:830: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
    warnings.warn(
    /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
    warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    2023-04-07 16:26:51.924619: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2023-04-07 16:26:52.034800: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
    2023-04-07 16:26:52.034830: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    2023-04-07 16:26:52.056150: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    2023-04-07 16:26:52.605880: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
    2023-04-07 16:26:52.605964: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/cv2/../../lib64:
    2023-04-07 16:26:52.605982: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
    (face19) ubuntu@ubuntu-X10SRA:~/seg/GroundingDINO$

How to choose the best prompt

Thanks a lot for this amazing repo!!!

During my personal experiments I found that sometimes it is difficult to find the best prompt for some unusual objects and for such objects I tried to describe the object using some properties (size, colors, shape, etc). And sometimes this strategy worked successfully, but sometimes not.

What do you think is it possible to somehow extract the information about the "best" prompt for the object for which we know the bounding box?

Thank you!

packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

When executing:

!python demo/inference_on_a_image.py \
  -c {CONFIG_PATH} \
  -p {WEIGHTS_PATH} \
  -i {IMAGE_NAME} \
  -o {OUTPUT_IMAGE_NAME} \
  -t "dog"

I got:

Traceback (most recent call last):
  File "/content/GroundingDINO/demo/inference_on_a_image.py", line 10, in <module>
    from groundingdino.models import build_model
  File "/content/GroundingDINO/groundingdino/models/__init__.py", line 8, in <module>
    from .GroundingDINO import build_groundingdino
  File "/content/GroundingDINO/groundingdino/models/GroundingDINO/__init__.py", line 15, in <module>
    from .groundingdino import build_groundingdino
  File "/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py", line 24, in <module>
    from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast
  File "/usr/local/lib/python3.9/dist-packages/transformers/__init__.py", line 43, in <module>
    from . import dependency_versions_check
  File "/usr/local/lib/python3.9/dist-packages/transformers/dependency_versions_check.py", line 41, in <module>
    require_version_core(deps[pkg])
  File "/usr/local/lib/python3.9/dist-packages/transformers/utils/versions.py", line 101, in require_version_core
    return require_version(requirement, hint)
  File "/usr/local/lib/python3.9/dist-packages/transformers/utils/versions.py", line 92, in require_version
    if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
  File "/usr/local/lib/python3.9/dist-packages/packaging/version.py", line 52, in parse
    return Version(version)
  File "/usr/local/lib/python3.9/dist-packages/packaging/version.py", line 197, in __init__
    raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

Any idea how to solve it?

Cannot install

Got the following error:

      RuntimeError:
      The detected CUDA version (10.1) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for groundingdino
  Running setup.py clean for groundingdino
Failed to build groundingdino

如何训练自己数的据集

作者您好,如何使用你们预训练好的GroundingDINO模型训练自己的数据集?(在自己的数据集数量很少的情况下)

dense similarity predictions

Hi! Thanks for your awesome work. I am just wondering if there is any possibility of extracting dense similarity scores between an inputted image and textual prompts. Exactly, I have tried to extract dense similarity according to the following pseudo-code with the text features and image features after the Feature Enhanced. However, I found that the similarity between them is nearly nonsense. I just would like to check out if there are any other suggestions, as dense similarity is vital for several open world tasks.

enhanced_image_features = F.normalize(enhanced_image_features, dim=-1)
enhanced_text_features = F.normalize(enhanced_text_features, dim=-1)
similarity = enhanced_image_features  @ enhanced_text_features.T

questions about zero-shot detection

I tried to apply grounded dino on my custom data. However, I am confused about how to input the text or class names? I notice that you take 'dog.stick.' as input in the example in ReadMe. However, if I do not know what object is in the image, should I input all the possible class names as text?

Different order of detection prompts generate different results?

Hello, thanks for your great work 😄

I want to detect several classes in this image. However, when I change the order of detection prompts, I get different results.
Would you mind telling me why the order will affect the results ?
And what is the best practice when I want to use Grounding Dino to detect several classes?


example image
image


Detection Prompt:

socks, shoes, dress, skirt

RESULT
image


Detection Prompt:

dress, skirt, socks, shoes

RESULT
image

Simplify getting output labels

First, just wanted to say thanks for publishing this repo -- very cool and love that it's fully open-sourced.

I wanted to suggest a simplification to get_phrases_from_posmap. For reference, here is the current definition:

def get_phrases_from_posmap(posmap: torch.BoolTensor, tokenlized, caption: str):
    assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
    if posmap.dim() == 1:
        non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
        words_list = caption.split()

        # build word idx list
        words_idx_used_list = []
        for idx in non_zero_idx:
            word_idx = tokenlized.token_to_word(idx)
            if word_idx is not None:
                words_idx_used_list.append(word_idx)
        words_idx_used_list = set(words_idx_used_list)

        # build phrase
        words_used_list = []
        for idx, word in enumerate(words_list):
            if idx in words_idx_used_list:
                words_used_list.append(word)

        sentence_res = " ".join(words_used_list)
        return sentence_res
    else:
        raise NotImplementedError("posmap must be 1-dim")

It seems like this is over-complicating the text decoding, since we can already do that with the tokenizer. An interesting edge case with your implementation:

  • If the input label contains a long word (e.g. "American flag"), the long word is often lost in the output label. I believe this is because the long word ("American") is split into multiple tokens. In that case, the for idx, word in enumerate(words_list) loop is incorrect, because the positional index of the word is offset. In my tests, it has often returned incorrect or empty text labels, due to this issue.

A potential fix:

from typing import Dict

import torch
from transformers import AutoTokenizer

def get_phrases_from_posmap(
    posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer
):
    assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
    if posmap.dim() == 1:
        non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
        token_ids = [tokenized["input_ids"][i] for i in non_zero_idx]
        return tokenizer.decode(token_ids)
    else:
        raise NotImplementedError("posmap must be 1-dim")

We don't have to compare against the original caption, because all token information is already contained in the tokenizer. This gives the correct result "American flag" in my example above.

Reverse engineer prompt

Hi

Firstly thanks for the awesome work, very cool!

I was wondering if it's possible to reverse engineer a prompt given a bounding box?

Some questions about the detail in the paper.

Nice work! Excuse me, I have some question about the Table 7 and figure 5 of the paper. In Table 7, it shows the from scratch training is higher than from pretrained DINO in COCO datase, where in figure 5 is on the contrary. It's a little confusing.

cannot import name '_C' from 'groundingdino'

I'm able to run it properly on a local GPU machine I've got but when I move this to the cloud on a docker image, I'm getting this issue. Have a feeling there's some system package I'm missing or something but not 100% sure.

Some sub-words are ignored when use long word

Thanks for the great code. I encountered an issue when using the GroundDINO (or maybe it is just expected?)
If I use a long word, like 'pottedplant', it will be tokenized into several sub-words.
when generating the output bounding boxes, some sub-words are ignored (I guess it is because the cross-attention is done in token level so scores of some sub-words are lower than the text threshold), and generated label is incomplete.
For example, the 'pottedplant' -> 'pot' 'ted' 'pl' 'ant', and some box labels are wrong, like 'potted' , 'pottedpl'.
I wonder is there any solution for this?

ERROR: No matching distribution found for supervision==0.4.0

pip install supervision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
Looking in indexes: http://pypi.douban.com/simple
Looking in links: https://download.pytorch.org/whl/torch_stable.html
ERROR: Could not find a version that satisfies the requirement supervision==0.4.0 (from versions: 0.1.0, 0.2.0, 0.2.1, 0.3.0, 0.3.1)
ERROR: No matching distribution found for supervision==0.4.0

bug?

please double check here:

inference_on_a_image.py

box_threshold = args.box_threshold
text_threshold = args.box_threshold

Installation error

Hi,

I'm getting this error message during installation:

UserWarning: The detected CUDA version (11.5) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
error: command '/usr/bin/nvcc' failed with exit code 1
[end of output]

Could you give some tips to fix thi issue?

Some questions on the details in paper

To the Authors

This is a very interesting and good work on visual grounding tasks with a Query-based detector. The paper is also well written and clear. Super interesting results with GLIGEN as well. I do have a few very specific questions about the implementation or concepts in the paper.

  1. As for the language-guided query selection. This module makes a lot of sense and you are basically saying, you want to extract location of the image tokens where they have the greatest responses with the text tokens. And then use these as the location queries in the Mixed-Query-Selection design in DINO. I notice you describe the outer product between text/img tokens as logits. My questions are (a). Is there any supervision on this level? If not, did you use any pretrained Vison Language initializations so that they naturally responds (b). Does it make more sense to use the normalized feature vectors so that the dot product is actually correlation. (c) what happens if the selected image tokens all have responses to the same text token or only a few text tokens, and if there is any way to separate them out like the 1st-stage training in Deformable DETR or DINO?
  2. As for the Sub-Sentence Level Text Feature. (a). How is the attention mask produced when dealing with a weak annotation such as image-caption pairs (Cap4M), did you take a noun extraction methods as described in DetCLIP? As a detailed example would be, how to generate the attention mask for a concept "fruit fly" or any human names such as "Harry Potter", when the detection dataset doesn't have this category. (b). And how to handle the input length limit as GLIP describes in their paper when you have over 1000 categories like LVIS during training/inference? Was there like a sparse negative category sampling strategy?
  3. Loss Function Is the negative class handled similar to the alignment loss described in GLIP or MDETR? I assume you apply sigmoid focal loss and the negative object queries simply learns the 0 from {0, 1} binary target?
  4. Last but not least, do you think it's possible to leverage other frameworks such as pretrained ALBEF, VLMo or even BeiTv3 and inject your design into it? If not, what do you think are the limitations of these frameworks.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.