Giter Club home page Giter Club logo

glip's Introduction

GLIP: Grounded Language-Image Pre-training

Updates

$\qquad$ [Workshop] $\qquad$ [IC Challenge] $\qquad$ [OD Challenge]

  • 09/13/2022: Updated HuggingFace Demo! Feel free to give it a try!!!

    • Acknowledgement: Many thanks to the help from @HuggingFace for a Space GPU upgrade to host the GLIP demo!
  • 06/21/2022: GLIP has been selected as a Best Paper Finalist at CVPR 2022!

  • 06/16/2022: ODinW benchmark released! GLIP-T A&B released!

  • 06/13/2022: GLIPv2 is on Arxiv https://arxiv.org/abs/2206.05836!

  • 04/30/2022: Updated Colab Demo!

  • 04/14/2022: GLIP has been accepted to CVPR 2022 as an oral presentation! First version of code and pre-trained models are released!

  • 12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857.

  • 11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP. GLIP demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

  1. When directly evaluated on COCO and LVIS (without seeing any images in COCO), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
  2. After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
  3. When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

We provide code for:

  1. pre-training GLIP on detection and grounding data;
  2. zero-shot evaluating GLIP on standard benchmarks (COCO, LVIS, Flickr30K) and custom COCO-formated datasets;
  3. fine-tuning GLIP on standard benchmarks (COCO) and custom COCO-formated datasets;
  4. a Colab demo.
  5. Toolkits for the Object Detection in the Wild Benchmark (ODinW) with 35 downstream detection tasks.

Please see respective sections for instructions.

Demo

Please see a Colab demo at link!

Installation and Setup

Environment This repo requires Pytorch>=1.9 and torchvision. We recommand using docker to setup the environment. You can use this pre-built docker image docker pull pengchuanzhang/maskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9 or this one docker pull pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9 depending on your GPU.

Then install the following packages:

pip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo
pip install transformers 
python setup.py build develop --user

Backbone Checkpoints. Download the ImageNet pre-trained backbone checkpoints into the MODEL folder.

mkdir MODEL
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/swin_tiny_patch4_window7_224.pth -O swin_tiny_patch4_window7_224.pth
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/swin_large_patch4_window12_384_22k.pth -O swin_large_patch4_window12_384_22k.pth

Model Zoo

Checkpoint host move. The checkpoint links expired. We are moving the checkpoints to https://huggingface.co/harold/GLIP/tree/main. Currently most checkpoints are available. Working to host the remaining checkpoints asap.

Model COCO [1] LVIS [2] LVIS [3] ODinW [4] Pre-Train Data Config Weight
GLIP-T (A) 42.9 / 52.9 - 14.2 ~28.7 O365 config weight
GLIP-T (B) 44.9 / 53.8 - 13.5 ~33.2 O365 config weight
GLIP-T (C) 46.7 / 55.1 14.3 17.7 44.4 O365,GoldG config weight
GLIP-T [5] 46.6 / 55.2 17.6 20.1 42.7 O365,GoldG,CC3M,SBU config [6] weight
GLIP-L [7] 51.4 / 61.7 [8] 29.3 30.1 51.2 FourODs,GoldG,CC3M+12M,SBU config [9] weight

[1] Zero-shot and fine-tuning performance on COCO val2017.

[2] Zero-shot performance on LVIS minival (APr) with the last pre-trained checkpoint.

[3] On LVIS, the model could overfit slightly during the pre-training course. Thus we reported two numbers on LVIS: the performance of the last checkpoint (LVIS[2]) and the performance of the best checkpoint during the pre-training course (LVIS[3]).

[4] Zero-shot performance on the 13 ODinW datasets. The numbers reported in the GLIP paper is from the best checkpoint during the pre-training course, which may be slightly higher than the numbers for the released last checkpoint, similar to the case of LVIS.

[5] GLIP-T released in this repo is pre-trained on Conceptual Captions 3M and SBU captions. It is referred in paper in Table 1 and in Appendix C.3. It differs slightly from the GLIP-T in the main paper in terms of downstream performance. We will release the pre-training support for using CC3M and SBU captions data in the next update.

[6] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M and SBU captions data will be updated.

[7] GLIP-L released in this repo is pre-trained on Conceptual Captions 3M+12M and SBU captions. It slightly outperforms the GLIP-L in the main paper because the model used to annotate the caption data are improved compared to the main paper. We will release the pre-training support for using CC3M+12M and SBU captions data in the next update.

[8] Multi-scale testing used.

[9] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M+12M and SBU captions data to be updated.

Pre-Training

Required Data. Prepare Objects365, Flickr30K, and MixedGrounding data as in DATA.md. Support for training using caption data (Conceptual Captions and SBU captions) will be released soon.

Command.

Perform pre-training with the following command (please change the config-file accordingly; checkout model zoo for the corresponding config; change the {output_dir} to your desired output directory):

python -m torch.distributed.launch --nnodes 2 --nproc_per_node=16 tools/train_net.py \
    --config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml \
    --skip-test --use-tensorboard --override_output_dir {output_dir}

For training GLIP-T models, we used nnodes = 2, nproc_per_node=16 on 32GB V100 machines. For training GLIP-L models, we used nnodes = 4, nproc_per_node=16 on 32GB V100 machines. Please setup the environment accordingly based on your local machine.

(Zero-Shot) Evaluation

COCO Evaluation

Prepare COCO/val2017 data as in DATA.md. Set {config_file}, {model_checkpoint} according to the Model Zoo; set {output_dir} to a folder where the evaluation results will be stored.

python tools/test_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \
        TEST.IMS_PER_BATCH 1 \
        MODEL.DYHEAD.SCORE_AGG "MEAN" \
        TEST.EVAL_TASK detection \
        MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \
        OUTPUT_DIR {output_dir}

LVIS Evaluation

We follow MDETR to evaluate with the FixedAP criterion. Set {config_file}, {model_checkpoint} according to the Model Zoo. Prepare COCO/val2017 data as in DATA.md.

python -m torch.distributed.launch --nproc_per_node=4 \
        tools/test_grounding_net.py \
        --config-file {config_file} \
        --task_config configs/lvis/minival.yaml \
        --weight {model_checkpoint} \
        TEST.EVAL_TASK detection OUTPUT_DIR {output_dir} 
        TEST.CHUNKED_EVALUATION 40  TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300

If you wish to evaluate on Val 1.0, set --task_config to configs/lvis/val.yaml.

ODinW / Custom Dataset Evaluation

GLIP supports easy evaluation on a custom dataset. Currently, the code supports evaluation on COCO-formatted dataset.

We will use the Aquarium dataset from ODinW as an example to show how to evaluate on a custom COCO-formatted dataset.

  1. Download the raw dataset from RoboFlow in the COCO format into DATASET/odinw/Aquarium. Each train/val/test split has a corresponding annotation file and a image folder.

  2. Remove the background class from the annotation file. This can be as simple as open "_annotations.coco.json" and remove the entry with "id:0" from "categories". For convenience, we provide the modified annotation files for Aquarium:

    wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/test/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/test/annotations_without_background.json
    wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/train/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/train/annotations_without_background.json
    wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/valid/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/valid/annotations_without_background.json
    
  3. Then create a yaml file as in configs/odinw_13/Aquarium_Aquarium_Combined.v2-raw-1024.coco.yaml. A few fields to be noted in the yamls:

    DATASET.CAPTION_PROMPT allows manually changing the prompt (the default prompt is simply concatnating all the categories);

    MODELS.*.NUM_CLASSES need to be set to the number of categories in the dataset (including the background class). E.g., Aquarium has 7 non-background categories thus MODELS.*.NUM_CLASSES is set to 8;

  4. Run the following command to evaluate on the dataset. Set {config_file}, {model_checkpoint} according to the Model Zoo. Set {odinw_configs} to the path of the task yaml file we just prepared.

python tools/test_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \
      --task_config {odinw_configs} \
      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \
      TEST.EVAL_TASK detection \
      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \
      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \
      DATASETS.USE_OVERRIDE_CATEGORY True \
      DATASETS.USE_CAPTION_PROMPT True

Flickr30K Evaluation

Prepare Flickr30K data as in DATA.md. Set {config_file}, {model_checkpoint} according to the Model Zoo.

python tools/test_grounding_net.py \
        --config-file {config_file} \
        --task_config configs/flickr/test.yaml,configs/flickr/val.yaml \
        --weight {model_checkpoint} \
        OUTPUT_DIR {output_dir} TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 100 TEST.EVAL_TASK grounding MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False

Fine-Tuning

COCO Fine-Tuning

Prepare the COCO data as in DATA.md. Set {config_file}, {model_checkpoint} according to the Model Zoo.

Below is the fine-tuning script for tuning the Tiny models:

python -m torch.distributed.launch --nproc_per_node=16 tools/train_net.py \
       --config-file {config_file} \
       --skip-test \
       MODEL.WEIGHT {model_checkpoint} \
       DATASETS.TRAIN '("coco_grounding_train", )' \
       MODEL.BACKBONE.FREEZE_CONV_BODY_AT -1 SOLVER.IMS_PER_BATCH 32 SOLVER.USE_AMP True SOLVER.MAX_EPOCH 24 TEST.DURING_TRAINING False TEST.IMS_PER_BATCH 16 SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.BASE_LR 0.00001 SOLVER.LANG_LR 0.00001 SOLVER.STEPS \(0.67,0.89\) DATASETS.DISABLE_SHUFFLE True MODEL.DYHEAD.SCORE_AGG "MEAN" TEST.EVAL_TASK detection

For evaluation, please follow the instructions in COCO Evaluation. Scripts for tuning the Large model will be released soon.

ODinW / Custom Dataset Fine-Tuning

Prepare the dataset as in ODinW / Custom Dataset Evaluation.

Full Model Fine-Tuning

For tuning with 1/3/5/10-shot, set {custom_shot_and_epoch_and_general_copy} to "1_200_8", "3_200_4", "5_200_2", "10_200_1", respectively.

For tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to "0_200_1"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.

python -m torch.distributed.launch --nproc_per_node=4 tools/finetune.py \
      --config-file {config_file}  --ft-tasks {configs} --skip-test \
      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \
      --evaluate_only_best_on_test --push_both_val_and_test \
      MODEL.WEIGHT {model_checkpoint} \
      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \
      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full

Prompt Tuning

Follow the command as in Full Model Fine-Tuning. But set the following hyper-parameters:

SOLVER.WEIGHT_DECAY 0.25 \
SOLVER.BASE_LR 0.05 \
SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2

The Object Detection in the Wild Benchmark

ODinW was first proposed in GLIP and refined and formalized in ELEVATER. GLIP used 13 downstream tasks while the full ODinW has 35 downstream tasks. It will be hosted as a challenge at the CV in the Wild Workshop @ ECCV 2022. We hope our code encourage the community to participate in this challenge!

ODinW was introduced in GLIP and initially contained 13 datasets. We further expand the datasets by including more datasets from RoboFlow and the final version contains 35 datasets.

To distinguish between the two versions, we denote the version used by GLIP as ODinW-13 and the version used by the CVinW workshop as ODinW-35.

This repo also provides the necessary code to train and evaluate on ODinW. See instructions below.

Download ODinW

RoboFlow hosts all the original datasets. We are also hosting the datasets and provide a simple script the download all the data.

python odinw/download_datasets.py

configs/odinw_35 contain all the meta information of the datasets. configs/odinw_13 are the datasets used by GLIP. Each dataset follows the coco detection format.

All ODinW datasets are in the COCO format; thus we can directly use the similar scripts to adapt and evaluate pre-trained models on ODinW. Below is a brief recap.

(Zero-Shot) Evaluation

odinw_configs can be any of the configs from configs/odinw_14 and configs/odinw_35.

python tools/test_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \
      --task_config {odinw_configs} \
      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \
      TEST.EVAL_TASK detection \
      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \
      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \
      DATASETS.USE_OVERRIDE_CATEGORY True \
      DATASETS.USE_CAPTION_PROMPT True

Full-Model Fine-Tuning

For tuning with 1/3/5/10-shot, set {custom_shot_and_epoch_and_general_copy} to "1_200_8", "3_200_4", "5_200_2", "10_200_1", respectively.

For tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to "0_200_1"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.

python -m torch.distributed.launch --nproc_per_node=4 tools/finetune.py \
      --config-file {config_file}  --ft-tasks {odinw_configs} --skip-test \
      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \
      --evaluate_only_best_on_test --push_both_val_and_test \
      MODEL.WEIGHT {model_checkpoint} \
      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \
      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full

Prompt Tuning

For tuning with 1/3/5/10-shot, set {custom_shot_and_epoch_and_general_copy} to "1_200_8", "3_200_4", "5_200_2", "10_200_1", respectively.

For tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to "0_200_1"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.

Follow the command as in Full Model Fine-Tuning. But set the following hyper-parameters:

SOLVER.WEIGHT_DECAY 0.25 \
SOLVER.BASE_LR 0.05 \
SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2

Linear Probing

For tuning with 1/3/5/10-shot, set {custom_shot_and_epoch_and_general_copy} to "1_200_8", "3_200_4", "5_200_2", "10_200_1", respectively.

For tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to "0_200_1"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.

Follow the command as in Full Model Fine-Tuning. But set the following hyper-parameters:

SOLVER.TUNING_HIGHLEVEL_OVERRIDE linear_prob

Knowledge-Augmented Inference

GLIP also supports knowledge-augmented inference. Please see our paper for details. Here we provide an example on how to use external knowledge. Please download a specialized GLIP-A model for knowledge augmented inference wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_a_tiny_o365_knowledge.pth -O MODEL/glip_a_tiny_o365_knowledge.pth.

python tools/test_grounding_net.py --config-file configs/pretrain/glip_A_Swin_T_O365.yaml --weight MODEL/glip_a_tiny_o365_knowledge.pth \
      --task_config {odinw_configs} \
      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \
      TEST.EVAL_TASK detection \
      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \
      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \
      DATASETS.USE_OVERRIDE_CATEGORY True \
      DATASETS.USE_CAPTION_PROMPT True \
      GLIPKNOW.KNOWLEDGE_FILE knowledge/odinw_benchmark35_knowledge_and_gpt3.yaml GLIPKNOW.KNOWLEDGE_TYPE gpt3_and_wiki GLIPKNOW.PARALLEL_LANGUAGE_INPUT True GLIPKNOW.LAN_FEATURE_AGG_TYPE first MODEL.DYHEAD.FUSE_CONFIG.USE_LAYER_SCALE True GLIPKNOW.GPT3_NUM 3 GLIPKNOW.WIKI_AND_GPT3 True

Submit Your Results to ODinw Leaderboard

The participant teams are encouraged to upload their results to ODinW leaderboard on EvalAI. From the perspective od data labeling cost, lowering the requirement of data requirement enables more scenarios, a varied number of tracks are considered in the challenge: zero-shot, few-shot, and full-shot. Please see the ODinW website for more details about each phase.

  1. For zero/full shot setting, the required format for prediction json file is
{
      "dataset_name (e.g., 'WildFireSmoke')":
            [value]: value is following the COCO's 
            result format, which contains 
            ["image_id":xxx, "category_id":xxx, 
            "bbox":xxx, "score":xxx]
}

Please see one provided example for zero shot prediction file: all_predictions_zeroshot.json and one full shot prediction file: all_predictions_fullshot.json.

  1. For few shot (3-shot, according to the challenge description) setting, where three train-val subsets are generated with random seed [3, 30, 300], respectively. The required format for prediction json file is
{
      "dataset_name (e.g., "WildFireSmoke")":{
            "rand_seed_num (e.g., "30")":
                  [value]: value is following the 
                  COCO's result format, which 
                  contains ["image_id":xxx, 
                  "category_id":xxx, "bbox":xxx, 
                  "score":xxx]
     }
}

Please see one provided example for few shot prediction file: all_predictions_3_shot.json.

Citations

Please consider citing our papers if you use the code:

@inproceedings{li2021grounded,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2022},
      booktitle={CVPR},
}
@article{zhang2022glipv2,
  title={GLIPv2: Unifying Localization and Vision-Language Understanding},
  author={Zhang, Haotian* and Zhang, Pengchuan* and Hu, Xiaowei and Chen, Yen-Chun and Li, Liunian Harold and Dai, Xiyang and Wang, Lijuan and Yuan, Lu and Hwang, Jenq-Neng and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2206.05836},
  year={2022}
}
@article{li2022elevater,
  title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},
  author={Li*, Chunyuan and Liu*, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Lee, Yong Jae and Hu, Houdong and Liu, Zicheng and others},
  journal={arXiv preprint arXiv:2204.08790},
  year={2022}
}

glip's People

Contributors

chunyuanli avatar haotian-zhang avatar liunian-harold-li avatar pzzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glip's Issues

Using validation set loss for evaluation

I'd like to track the validation set loss for finetuning evaluation on a custom dataset (i.e., as shown in the original maskrcnn trainer here), rather than the COCO-style AP metrics.

Unfortunately, the evaluation data loader seems to be set up differently from the training data loader. In particular, the positive_map value used by the trainer does not seem to be created, and a branch of the code that attempts to cope with this absence fails.

Is there a good or easy way to accomplish this goal?

Starting from this line in GLIP's trainer.py, I've made a coarse attempt to see what is produced:

model.train()
with torch.no_grad():
  for i, batch in enumerate(val_data_loader):
    images, targets, image_ids, positive_map, *_ = batch
    images = images.to(device)
    if positive_map is None:
      loss_dict = model(images, targets)
    else:
      captions = [t.get_field("caption") for t in targets if "caption" in t.fields()]
      if len(captions) > 0:
        loss_dict = model(images, targets, captions, positive_map)
      else:
        loss_dict = model(images, targets)
      losses = sum(loss for loss in loss_dict.values())
      loss_dict_reduced = reduce_loss_dict(loss_dict)

All the evaluation batches come through without positive_map and the model does not seem to be in a state to accept the call without it. That is, the line loss_dict = model(images, targets) is the one called; it fails:

Traceback (most recent call last):
  File "./GLIP/tools/finetune.py", line 480, in <module>
    main()
  File "./GLIP/tools/finetune.py", line 455, in main
    model = train(
  File "./GLIP/tools/finetune.py", line 169, in train
    do_train(
  File "./GLIP/maskrcnn_benchmark/engine/trainer.py", line 306, in do_train
    loss_dict = model(images, targets)
  File "/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "./GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py", line 284, in forward
    proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
  File "/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "./GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 905, in forward
    embedding = language_dict_features['embedded']
KeyError: 'embedded'

I've dug around the data loaders and pipeline extensively, but haven't quite figured out how to connect the dots appropriately.

Thanks for any input anyone can offer.

Question about Object365

Dear author,

Did you use the V1 version only for Object365? In the paper, you said Object365 contains 0.66M images. The current Object365 training set has 1.72M images.

Thank you!

How to use negative samples to train the model?

Hi, in detection there is lots of false positive, for examples, when we want to detect human-face, some animal face or cartoon face is detected as false negatives, so we need add the negative samples to training data.

But I found that in training, this negative data without annotations will be ignored, could you please tell me the right way to add this negative data ? Thanks.

It seems that the docker environment(pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9) has some bugs

I find the torch version in the docker is "1.9.0a0+c3d40fd", but when I run the code it raise a error: ·module torch.nn.functional' has no attribute 'mish' . In the other docker environment, the torch version is 1.9.1 and it has no error at that position.
What confuses me is that I can see the 'mish' function in version 1.9.0 in the official pytorch documentation. I'm not sure how to find the source code about the "1.9.0a0+c3d40fd".

Object365 in tsv format

Dear authors,

Thanks for presenting such a great work.

I'm interesting in the pretraining part but quite confused by the data format of object365. From #10 , I know a old version of Object365 (v1) was used. But Object365 has been updated to v2.
It seems some v1 images were deleted from v2. The provided train.label.tsv in this repo contains 608606 images, but I only find 519789 out of 608606 in Object365 v2
Would you like to share the script to generate tsv format so that we can generate the required data for Object365 v2?

During tsv data generation, it seems we need to load all the images into memory, which takes too much memory. Is there a way to do pretraining without tsv format?

Best

model weight file become smaller after prompt tuning?

Hi, I was prompt tuning "glip_large_model.pth" with the following commands:

python -m torch.distributed.launch --nproc_per_node=1 tools/finetune.py \ --config-file configs/pretrain/glip_Swin_L.yaml \ --ft-tasks configs/odinw_35/10435.yaml \ --skip-test \ --custom_shot_and_epoch_and_general_copy 10_200_4 \ --evaluate_only_best_on_test --push_both_val_and_test \ MODEL.WEIGHT MODEL/glip_large_model.pth \ SOLVER.USE_AMP True \ TEST.DURING_TRAINING True \ SOLVER.IMS_PER_BATCH 3 \ SOLVER.WEIGHT_DECAY 0.25 \ TEST.EVAL_TASK detection \ DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \ MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 \ MODEL.DYHEAD.USE_CHECKPOINT True \ SOLVER.TEST_WITH_INFERENCE True \ SOLVER.USE_AUTOSTEP True \ DATASETS.USE_OVERRIDE_CATEGORY True \ SOLVER.SEED 10 \ DATASETS.SHUFFLE_SEED 3 \ DATASETS.USE_CAPTION_PROMPT True \ DATASETS.DISABLE_SHUFFLE True \ SOLVER.STEP_PATIENCE 3 \ SOLVER.CHECKPOINT_PER_EPOCH 1.0 \ SOLVER.AUTO_TERMINATE_PATIENCE 8 \ SOLVER.MODEL_EMA 0.0 \ SOLVER.TUNING_HIGHLEVEL_OVERRIDE full \ SOLVER.BASE_LR 0.05 \ SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2 \ TEST.IMS_PER_BATCH 3 \ SOLVER.FIND_UNUSED_PARAMETERS False

I found that after prompt tuning, the weight file reduced from 6.9GB to 1.7GB. Is that normal? What's the reason behind that?

In order to use the finetuned model, I need to set ADD_LINEAR_LAYER: True. Is there anything else we need to change?

Glib Demo - Bug Report

At this line , I think it should be tokens_positive.append([[len(caption_string), len(caption_string) + len(word)]]), which is similar to line 121, and remove line 195.

Otherwise, when we call tokens_positive.append([len(caption_string), len(caption_string) + len(word)]), the function will only create a single entity.

Pretrained model weights without the deep fudsion module

Hi, thanks for your great work! I really enjoy reading your paper.
I'm wondering if you could release the model weights pre-trained only on GoldG & Cap24M datasets without the deep fusion module (both GLIP-T & GLIP-L). We want to have some ablative experiments on that setting. Thank you so much!

the code of glip V2

Hi, May I ask when the code of glip_v2 will be open source and we can't wait to try it

Labels for the bounding boxes

So when I get the bounding boxes, to retrieve the class name, I do something like this:
self.entities[prediction.get_field('labels')[bbox_id]-1]

Is that correct?
However, sometime the prediction.get_field('labels')[bbox_id]-1 is bigger than then total number of entities. Is that a special entity for "no object" class?

Equivalence between Patch Merging and Conv.

Hello, after looking at the code in patch merging part, we found the complex operation that slice the feature and concatenate them then go through the linear layer to reduce dimension from 4C to 2C is completely equal to a conv layer of kernel size 2 and stride 2.

  1. The operation you did is concatenate 4 pixels from a 2x2 patch in to 1 pixel, but quadrupled channel.
    Every 2x2 patch shared the same weight with other patches in your linear layer (self.reduction).
    The conv(kernel size=2, stride=2) does the same thing.

  2. Amount of parameters of this linear layer is equal to this conv layer.
    linear layer params = input channel * output channel = 4C * 2C = 8 * C^2
    conv layer params = kernel size * kernel size * input channel * output channel = 2 * 2 * C * (2 * C) = 8 * C^2
    SO, linear layer params == conv layer params

GLIP demo issue in colab

I am getting isssue

/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py in overlay_entity_names(self, image, predictions, names, text_size, text_pixel, text_offset, text_offset_original)
345
346 cv2.putText(
--> 347 image, s, (int(x), int(y)-text_offset_original), cv2.FONT_HERSHEY_SIMPLEX, text_size, (self.color, self.color, self.color), text_pixel, cv2.LINE_AA
348 )
349 previous_locations.append((int(x), int(y)))

AttributeError: 'GLIPDemo' object has no attribute 'color'

The noun phrase extraction algorithm

Hi,

I am interested in the noun phrase extraction algorithm, but it seems this process is not very detailed in the GLIP paper and code. Is more details about it available?

Thanks,
Blakey

Extract region features

Hi there! Thank you for your amazing work! But how to extract features for regions in a given image?

How to use prompt turning(language_prompt_v2) in custom datasets?

Hi, Thanks for your sharing this nice work!
I would like to confirm the following question:

  • When I follow the setting "SOLVER.WEIGHT_DECAY 0.25 SOLVER.BASE_LR 0.05 SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2", I can not get a nice result. There is still a big gap compared to full_model turning<1shot,3shot,5shot,10shot,...,full data>.
  • When I only use name in my custom dataset, I can not achieve the effect in the paper. When I use manul prompt, can you give me some suggestions?
  • There are some automated ways to set prompt? Like NLP?
  • I find 4 language prompt turning way in the code, can you tell me the difference between them?

Finally, thanks for this great work!

Do you use alignment loss?

In the paper, you introduced text-region alignment loss. However, in the config files, USE_CONTRASTIVE_ALIGN_LOSS are set to FALSE. Is there any explanation for not using it?

How to do prompt fine-tune in custum dataset.

I try to do prompt fine-tune in my custum dataset with only one class. So I set the data config like below:

DATALOADER:
  ASPECT_RATIO_GROUPING: false
  SIZE_DIVISIBILITY: 32

DATASETS:
  GENERAL_COPY: 16
  CAPTION_PROMPT: '[{"prefix": " ", "name": "cigaret", "suffix": ", tube-shaped product smoked near mouth or holding by finger"}, ]'
  REGISTER:
    train:
      ann_file: '/data/smoke/coco_format_json/coco_train_merge.json'
      img_dir:  ''
    val:
      ann_file: '/data/smoke/coco_format_json/smoke_val_102.json'
      img_dir: '' 
  TEST: ("val",)
  TRAIN: ("train",)
INPUT:
  MAX_SIZE_TEST: 640
  MAX_SIZE_TRAIN: 640
  MIN_SIZE_TEST: 640
  MIN_SIZE_TRAIN: 640
MODEL:
  ATSS:
    NUM_CLASSES: 2
  DYHEAD:
    NUM_CLASSES: 2
  FCOS:
    NUM_CLASSES: 2
  ROI_BOX_HEAD:
    NUM_CLASSES: 2
SOLVER:
  CHECKPOINT_PERIOD: 100
  MAX_EPOCH: 0
  WARMUP_ITERS: 0
TEST:
  IMS_PER_BATCH: 1

Then I try to do inference following the demo (https://colab.research.google.com/drive/12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing), using the config file generated in ft_task_1/config.yaml, the inference important code is as follows:

caption = 'cigaret, tube-shaped product smoked near mouth or holding by finger'

image = cv2.imread(im_path)
result, preds = glip_demo.run_on_web_image(image, caption, 0.5)

if using the num_class=2 configuration, it will report a index overflow, so I change the num_class to 81 as default, then the code can run without errors, but no preds for other object like "mouth"/"finger", that is strange.

Could you tell me the right way to do prompt fine-tuning and after fine-tuning, the right way to infernce the new model?

Thanks.

no box detected using code in the codelab demo

Thanks for the great work! I'm trying the code provided in your codelab demo. However, there is no bounding box detected (No errors appear during the compiling and execution). I've also tried to re-install packages to ensure versions of packages on my server are the same as those in your code. Is there any possible reasons for the result?
image
image

'GLIPDemo' object has no attribute 'color'

when calling self.overlay_entity_names(result, top_predictions),

cv2.putText(
image, s, (int(x), int(y)-text_offset_original), cv2.FONT_HERSHEY_SIMPLEX, text_size, (self.color, self.color, self.color), text_pixel, cv2.LINE_AA
)
self.color is not intilized.

When running demo.

color is initilied in visualize_with_predictions, which is not called in demo.

1

1

Some yaml file.

Hello, when we finetune the code, we fine some "yaml" files are losed.
e,g. Objects365/objects365_train_vgoiv6.cas2000.yaml, Objects365/train.cas2000.yaml and so on.

Can you open these file for us?

Best grade.

Runtime Error in Colab demo

Hi, really thanks for releasing so amazing work. But I encounter something unexpected in the colab demo.

image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')

caption = 'bobble heads on top of the shelf'

result, _ = glip_demo.run_on_web_image(image, caption, 0.5)

imshow(result, caption)

When I run this codes above , "RuntimeError: Not compiled with GPU support" is reported, can you provide some advice?

INPUT parameter in cfg (for example "Aquarium_Aquarium_Combined.v2-raw-1024.coco.yaml")

Hello,

thanks for your awesome work and that you provide the code here on github!

I have a quick question about the INPUT parameter in the cfg files:

for example:
INPUT: MIN_SIZE_TRAIN: 800 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1333

Could you please elaborate what this parameter does?

Does this define the maximum input size of the network?
For example if my images of a custom dataset are 2560x1440, does this parameter downscale them to width=1333?
And if i had smaller images would this upscale them to 800?

Also does the parameter "min_image_size" of GLIP_demo
glip_demo = GLIPDemo( cfg, min_image_size=800, confidence_threshold=0.5, show_mask_heatmaps=False )
define the input size of the images during inference?

The reason Iask is the following :
When I promp finetuned my custom dataset i used the following INPUT parmeter : all set to 2560. The training worked fine and the results where good.
then I used GLIP_demo to infer them and watch the output. Her i used min_image_size=2560, and i got OOM error.

thank you for your reply!
Patrick

RuntimeError: Not compiled with GPU support

Hi guys,

Thanks for the amazing work! I am trying to run the model but I got the following error:

Traceback (most recent call last):
  File "tools/test_grounding_net.py", line 222, in <module>
    main()
  File "tools/test_grounding_net.py", line 205, in main
    inference(
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/engine/inference.py", line 495, in inference
    output = model(images, captions=captions, positive_map=positive_map_label_to_token)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py", line 284, in forward
    proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 920, in forward
    proj_tokens, contrastive_logits, dot_product_logits, mlm_logits, shallow_img_emb_feats, fused_visual_features = self.head(features,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 739, in forward
    dyhead_tower = self.dyhead_tower(feat_inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 205, in forward
    temp_fea = [self.DyConv[1](feature, **conv_args)]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 135, in forward
    x = self.conv(input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 235, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 380, in forward
    return modulated_deform_conv(
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 184, in forward
    _C.modulated_deform_conv_forward(
RuntimeError: Not compiled with GPU support

My nvidia-smi output, I've run python setup.py build develop --user beforehand.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 30%   49C    P8    26W / 350W |    129MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Thanks :)

Cheers,

Francesco

MAP dataset-wise for ODinw

Dear all,

I hope to find you well. Thanks for the fantastic work and for releasing GLIP to the world :)

I am working at Robolow and we are excited you use our hosted datasets to build ODinw. We would like to know if you can share the MAP per dataset and not only the average. This would help us understand the performance on some datasets with a substantial domain shift compared with the training data.

Thanks a lot and looking forward to GLIP-v2 :)

Cheers,

Francesco

text encoder for detection

Hello, the author, great work! However, there's one detail that makes me very confused: The input length of the language encoder is limited (e.g. 512), but when there are many categories to be detected, up to thousands of categories(e.g. LVIS), how to input the detected categories into the model? And will the attention module in the language encoder cause some words to be ignored, thus affecting the detection performance?

Error in GLIP Colab and question

Hi! Thanks for this repo. This is just a heads-up that the Colab for GLIP is not working as it complains about % cd. If replaced with %cd, it's all good.

I also have a question. I tried setting show_mask_heatmaps to True and got an error. What should I change for this to work?

Torch 1.9.0 does not have torch.nn.functional.mish

I have pulled the 11.3 version docker image from the README.md, which has torch 1.9.0 installed. However, the code could not be run with torch 1.9.0 since it does not have the function "mish". Is there anything we can do about this?

Mask prediction

Thank you for your great work! I tried the demo and it's insanely well!

I'm wondering if your model contains a mask prediction head. Because in the GLIPDemo there's a show_mask_heatmaps parameter. When I set it to true, the prediction does not has mask field and therefore failed.

Do you have pretrained model with a mask prediction head?

About the deep fusion module?

Dear authors,

Thanks for presenting such a great work.

In your paper, you conduct the ablation study on the early fusion module, as GLIP-T(A) vs. GLIP-T(B), and demonstrates that deep fusion module brings greate improvements (zero-shot: 42.9 -> 44.9, FT: 52.9->53.8).

I have several questions about this module:

  1. In our replementation, I find that this 6-layer fusion module brings nearly twice the amount of computation. The fusion layer is composed of three sub-module: VLFuse(bi-attention), DyConv (vision only), BERTLayer(language only). Have you applyed a more detailed ablation experiment? For example, only keep VLFuse(bi-attention), remove DyConv and BERTLayer?

  2. In CoCa and ALBEF, they apply the constrastive learning before fusion, have you tried this paradigm with align before fusion?

Some questions about the paper GLIP_v1

In the appendix C.1. Pre-training Details: you say "with a probability of 0.5, we uniformly choose an integer N from [1,85 - |Cpos|] and put N categories in the prompt".
What's the meaning of [1,85 - |Cpos|], and how to deal with the probability of 0.5? I can understand how to get N? Can you explain it in more detail?

Custom Dataset - Some guidance

Hi there, I'm confused with the terms tokens_positive / tokens_negative and the image caption itself.

What should be the image caption if I have multiple objects with different attributes on the same image?
For instance:
Pink elephant, blue elephant and normal elephant on the same image.
Image caption on annotation file should be: blue elephant,normal elephant,pink elephant?
For the boxes, for each elephant should I have the tokens_positive as the correspondent elephant?
ex: for blue elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [0,13] }
ex: for normal elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [14,29] }
ex: for pink elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [30,44] }
"categories": [{
"supercategory": "animal",
"id": 1,
"name": "elephant"
}]

Do you know any guide for creating the dataset?
Thanks!

Colab example Cuda 10.2 install error

Hi

Firstly, thank you for sharing this project.

When running the demo on Colab, I get this error:

https://colab.research.google.com/drive/12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing#scrollTo=BtMdw_J6PprI

cuda-repo-ubuntu180 100%[===================>]   1.77G   233MB/s    in 7.7s    

2022-06-16 18:28:57 (235 MB/s) - ‘cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb’ saved [1896270068/1896270068]

Selecting previously unselected package cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01.
(Reading database ... 114088 files and directories currently installed.)
Preparing to unpack cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb ...
Unpacking cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
Setting up cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
OK
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package cuda

The rest of the notebook runs until this cell

image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
caption = 'bobble heads on top of the shelf'
result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
imshow(result, caption)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-6-d454bb231030>](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in <module>()
      1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
      2 caption = 'bobble heads on top of the shelf'
----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
      4 imshow(result, caption)

17 frames
[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
    201             ctx.groups,
    202             ctx.deformable_groups,
--> 203             ctx.with_bias
    204         )
    205         return output

RuntimeError: Not compiled with GPU support

Thank you very much!

GLIP-L fine tuning/training

Hi,

I've tried to run GLIP-L fine-tuning on COCO dataset and the estimate is around 44 days on GeForce RTX 3090 and 8 days on 4*A100 GPUs.

Could you please provide weights for the fine-tuned model and clarify whether the estimates are relevant or I do something wrong :-)?

RuntimeError: Not compiled with GPU support on Colab

Thank you for sharing this interesting work.
When I try to run the Colab example, the execution of 6th cell of the notebook resulted in the following error.

[[[0, 12]], [[16, 19]], [[23, 32]]]

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

[<ipython-input-6-d454bb231030>](https://localhost:8080/#) in <module>()
      1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
      2 caption = 'bobble heads on top of the shelf'
----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
      4 imshow(result, caption)

17 frames

[/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in run_on_web_image(self, original_image, original_caption, thresh, custom_entity, alpha)
    138             custom_entity = None,
    139             alpha = 0.0):
--> 140         predictions = self.compute_prediction(original_image, original_caption, custom_entity)
    141         top_predictions = self._post_process(predictions, thresh)
    142 

[/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in compute_prediction(self, original_image, original_caption, custom_entity)
    217         # compute predictions
    218         with torch.no_grad():
--> 219             predictions = self.model(image_list, captions=[original_caption], positive_map=positive_map_label_to_token)
    220             predictions = [o.to(self.cpu_device) for o in predictions]
    221         print("inference time per image: {}".format(timeit.time.perf_counter() - tic))

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py](https://localhost:8080/#) in forward(self, images, targets, captions, positive_map, greenlight_map)
    283         else:
    284             proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
--> 285                                               captions, swint_feature_c4)
    286         if self.roi_heads:
    287             if self.cfg.MODEL.ROI_MASK_HEAD.PREDICTOR.startswith("VL"):

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, images, features, targets, language_dict_features, positive_map, captions, swint_feature_c4)
    921                                                                         language_dict_features,
    922                                                                         embedding,
--> 923                                                                         swint_feature_c4
    924                                                                         )
    925         anchors = self.anchor_generator(images, features)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, x, language_dict_features, embedding, swint_feature_c4)
    737                        "lang": language_dict_features}
    738 
--> 739         dyhead_tower = self.dyhead_tower(feat_inputs)
    740 
    741         # soft token

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in forward(self, input)
    137     def forward(self, input):
    138         for module in self:
--> 139             input = module(input)
    140         return input
    141 

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, inputs)
    203                 conv_args = dict(offset=offset, mask=mask)
    204 
--> 205             temp_fea = [self.DyConv[1](feature, **conv_args)]
    206 
    207             if level > 0:

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, input, **kwargs)
    133 
    134     def forward(self, input, **kwargs):
--> 135         x = self.conv(input, **kwargs)
    136         if self.bn:
    137             x = self.bn(x)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/autocast_mode.py](https://localhost:8080/#) in decorate_fwd(*args, **kwargs)
    217                     return fwd(*_cast(args, cast_inputs), **_cast(kwargs, cast_inputs))
    218             else:
--> 219                 return fwd(*args, **kwargs)
    220     return decorate_fwd
    221 

[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(self, input, offset, mask)
    380         return modulated_deform_conv(
    381             input, offset, mask, self.weight, self.bias, self.stride,
--> 382             self.padding, self.dilation, self.groups, self.deformable_groups)
    383 
    384     def __repr__(self):

[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
    201             ctx.groups,
    202             ctx.deformable_groups,
--> 203             ctx.with_bias
    204         )
    205         return output

RuntimeError: Not compiled with GPU support

Here is the GPU information that I used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any advice is appreciated.

Sincerely,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.