jialianw / grit Goto Github PK

GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)

License: MIT License

Python 100.00%

grit's Introduction

GRiT: A Generative Region-to-text Transformer for Object Understanding

GRiT is a general and open-set object understanding framework that localizes objects and describes them with any style of free-form texts it was trained with, e.g., class names, descriptive sentences (including object attributes, actions, counts and many more).

GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
¹State University of New York at Buffalo, ²Microsoft
arXiv technical report (PDF)

Installation

Please follow Installation instructions.

ChatGPT with GRiT

We give ChatGPT GRiT's dense captioning outputs (object location and description) to have it describe the scene and even write poetry. ChatGPT can generate amazing scene descriptions given our dense captioning outputs. An example is shown below: 🤩🤩🤩

Object Understanding Demo - One Model Two tasks

Download the GRiT model or use the following commend to download:

mkdir models && cd models
wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth && cd ..

The downloaded GRiT model was jointly trained on dense captioning task and object detection task. With the same trained model, it can output both rich descriptive sentences and short class names by varying the flag --test-task. Play it as follows! 🤩

Output for Dense Captioning (rich descriptive sentences)

python demo.py --test-task DenseCap --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml  --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth

Output for Object Detection (short class names)

python demo.py --test-task ObjectDet --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml  --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth

Output images will be saved under the visualization folder, which looks like:

You can also try the Colab demo provided by the TWC team:

Benchmark Inference and Evaluation

Please follow dataset preparation instructions to download datasets.

Download our trained models and put them to models/ for evaluation.

Object Detection on COCO 2017 Dataset

Model	val AP	test-dev AP	Download
GRiT (ViT-B)	53.7	53.8	model
GRiT (ViT-L)	56.4	56.6	model
GRiT (ViT-H)	60.4	60.4	model

To evaluate the trained GRiT on coco 2017 val, run:

# GRiT (ViT-B)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet --eval-only MODEL.WEIGHTS models/grit_b_objectdet.pth
# GRiT (ViT-L)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_L_ObjectDet.yaml --output-dir-name ./output/grit_l_objectdet --eval-only MODEL.WEIGHTS models/grit_l_objectdet.pth
# GRiT (ViT-H)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_H_ObjectDet.yaml --output-dir-name ./output/grit_h_objectdet --eval-only MODEL.WEIGHTS models/grit_h_objectdet.pth

Dense Captioning on VG Dataset

Model	mAP	Download
GRiT (ViT-B)	15.5	model

To test on VG test set, run:

python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth

It will save the inference results to output/grit_b_densecap/vg_instances_results.json. We use the VG dense captioning official evaluation codebase to report the results. We didn't integrate the evaluation code into our project as it was written in Lua. To evaluate on VG, please follow the original codebase's instructions and test based upon it. We're happy to discuss in our issue section about the issues you may encounter when using their code.

Training

To save training memory, we use DeepSpeed for training which can work well for activation checkpointing in distributed training.

To train on single machine node, run:

python train_deepspeed.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

To train on multiple machine nodes, run:

python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

Acknowledgement

Our code is in part based on Detic, CenterNet2, detectron2, GIT, and transformers. We thank the authors and appreciate their great works!

Citation

If you find our work interesting and would like to cite it, please use the following BibTeX entry.

@article{wu2022grit,
  title={GRiT: A Generative Region-to-text Transformer for Object Understanding},
  author={Wu, Jialian and Wang, Jianfeng and Yang, Zhengyuan and Gan, Zhe and Liu, Zicheng and Yuan, Junsong and Wang, Lijuan},
  journal={arXiv preprint arXiv:2212.00280},
  year={2022}
}

grit's People

Contributors

Stargazers

Watchers

grit's Issues

KeyError: 'object_description'

i got this error:
`
File "/public/Medical_image_segmentation/lixi/detectron2/detectron2/data/common.py", line 90, in getitem
data = self._map_func(self._dataset[cur_idx])
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 53, in call
dataset_dict_out = self.prepare_data(dataset_dict)
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 99, in prepare_data
object_descriptions = [an['object_description'] for an in dataset_dict["annotations"]]
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 99, in
object_descriptions = [an['object_description'] for an in dataset_dict["annotations"]]

KeyError: 'object_description'
`
what was going wrong?

Willing to share the original annotations of Visual Genome dataset ?

Sorry to bother you, but seems the official website is down, and i need to use the official annotations of dense caption. I'm hoping that you can share the annotations, it will be very useful. Thx a lot~

Larger ViT backbone for dense captioning

Thank you for the nice work!

Is it possible to use larger ViT backbone for dense captioning?
Is there a reason that there is only ViT-B backbone for dense captioning?

Thank you.

Generate Caption on my own boxes

hello.Could you please tell me how to use 'demo.py' to generate captions on boxes I give to it? I don't need it to do object detection.

Installation instructions seem out of date

The installation instructions explain to clone the detectron2 repo and install from the clone

git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
git checkout cc87e7ec
pip install -e .

However, detectron2 is already cloned inside the GRiT repo as third_party/CenterNet2 and pip install -e . should be run from there.

Perhaps the instructions were written before the library was included

Maybe install instructions should be something like this:

conda create --name grit python=3.8 -y
conda activate grit
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

cd ..
git clone https://github.com/JialianW/GRiT.git
cd GRiT
pip install -r requirements.txt

cd third_party/CenterNet2
git checkout cc87e7ec
pip install -e .

If PR #16 is merged then installation is automated and this may become obsolete

Output the result in the format of text

I want to get the output like the image in the repo. I want to use the text result to continue my next work. How can I achieve this

Bug of corner case of proposals

Hi,
Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error

[01/16 12:04:41 d2.utils.events]:  eta: 1 day, 11:49:23  iter: 1360  total_loss: 2.975  loss_box_reg_stage0: 0.2477  loss_box_reg_stage1: 0.3255  loss_box_reg_stage2: 0.2068  loss_centernet_agn_neg: 0.0414  loss_centernet_agn_pos: 0.1851  loss_centernet_loc: 0.3947  loss_cls_stage0: 0.2062  loss_cls_stage1: 0.1867  loss_cls_stage2: 0.1439  loss_mask: 0.3913  text_decoder_loss: 0.6096  time: 0.7084  data_time: 0.0160  lr: 7.7501e-07  max_mem: 21398M
[01/16 12:04:42] grit.modeling.roi_heads.grit_roi_heads INFO: all proposals are background at stage 2
Traceback (most recent call last):
  File "train_deepspeed.py", line 263, in <module>
    launch_deepspeed(
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 67, in launch_deepspeed
    mp.spawn(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 133, in _distributed_worker
    main_func(*args)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 251, in main
    do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 175, in do_train
    loss_dict = model(data)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1656, in forward
    loss = self.module(*inputs, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/meta_arch/grit.py", line 59, in forward
    proposals, roihead_textdecoder_losses = self.roi_heads(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 302, in forward
    losses = self._forward_box(features, proposals, targets, task=targets_task)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 173, in _forward_box
    proposals = self.check_if_all_background(proposals, targets, k)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 142, in check_if_all_background
    proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
IndexError: index 0 is out of bounds for dimension 0 with size 0

The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.

Would you mind checking it as I'm not familiar enough with this repo

third_party project gitmodules

Under third_party/Centernet2 there is detectron2 and yet anohter project/Centernet2
The installation however requires installing detecton2 separately.

It is a little confusing as what is needed and what is not in the gitsubmodule

Support for Batch-Inference

Hi,

I need to perform batch inference for my use-case. I followed this thread here that extends the DefaultPredictor class to enable batched inputs. But I end up with this error

grit/modeling/roi_heads/grit_roi_heads.py:230, in GRiTROIHeadsAndTextDecoder._forward_box(self, features, proposals, targets, task)
    227 predictor, predictions, proposals = head_outputs[-1]
    228 boxes = predictor.predict_boxes(
    229     (predictions[0], predictions[1]), proposals)
--> 230 assert len(boxes) == 1
    231 pred_instances, _ = self.fast_rcnn_inference_GRiT(
    232     boxes,
    233     scores,
   (...)
    239     self.soft_nms_enabled,
    240 )
    242 assert len(pred_instances) == 1, "Only support one image"

AssertionError:

Uncommenting the assertion doesn't help either.

Can you provide the performance based on the GT boxes?

I notice that you present the results of in Table 1 of the paper, but these results are based on the bounding boxes predicted by your foreground object extractor, which might lead to error propagations.

As a result, the real performance of the text decoder is not clear and underestimated. So I'm curious about the real performance of the text decoder. Can you provide the performance based on the GT boxes?

No module named 'detectron2'

Despite the fact that I cloned the detectoron2 repository

Traceback (most recent call last):
File "D:\Python\VisionGRIT\GRiT\demo.py", line 9, in
from detectron2.config import get_cfg
ModuleNotFoundError: No module named 'detectron2'

Poor result in Densecap Evaluation

Hello,

I am trying to use the results produced by the provided checkpoint of densecap to evaluate on VG, and after replacing
logprobs(confidence/score), box, captions in addResult() as well as idx_to_token, vocab_size in model,
in densecap/eval_utils.lua, I got a mAP result of 0.000609. I found that the number of 'ok=1' is very small, meaning few ground truth are assigned to predictions. Seems like I have done something wrong.

I combined GRiT's boxes, descriptions and score predictions of an image together, and fed them into addResult() per image in densecap, but I got a reletively low mAP and I found that the IOU between ground truth and prediction boxes were very small, could you please tell me what I am wrong with? Thank you!

Here is the process of replacement:

` while true do
------- single image ------
counter = counter + 1

-- Grab a batch of data and convert it to the right dtype            batch_size = 1
local loader_kwargs = {split=split, iterate=true}
local img, gt_boxes, gt_labels, info, _ = loader:getBatch(loader_kwargs)
info = info[1]     
                                  
-- fine the index of corresponding preditions, the indexs of image_id,box,score,descriptions are same in the same image
for index, v in ipairs(my_results.image_id) do
    if tostring(v) == string.gsub(info.filename, '.jpg', '') then
         index_ = index
         print(index_)
    end
end

assert(string.gsub(info.filename, '.jpg', '') == tostring(my_results.image_id[index_]) )

-- replace these with the predictions of the corrsponding image in GRiT
local boxes, logprobs, captions = my_results.box[index_], my_results.score[index_], my_results.descriptions[index_]
local boxes, logprobs = torch.Tensor(boxes), torch.Tensor(logprobs)
local gt_captions = model.nets.language_model:decodeSequence(gt_labels[1])   -- seq: tensor of shape N x T    id_to_tokens      bs = 1

evaluator:addResult(logprobs, boxes, captions, gt_boxes[1], gt_captions)`

Question about the training time

Hey,

Thanks for your inspiring work.
How long will the training take with 8 x A100 GPUs?

Question about training on custom data

Hi.

I had a query about how to finetune the model basis a custom data set?
how do we prepare annotation for custom dataset?
Is there any flowchart which you know exist for this or any blog post which describes this would be helpful.

Batch size Configuration

Hello, Jialian.

I am currently training the model on 4 3090 GPUs and I find that the batch size is small.

I have changed the SOLVER: IMS_PER_BATCH from 64 to 128 in config/base.yaml but the memomy consumption doesn't seem to become larger.

Could you please tell me how could I increase it.

Thanks a lot.

eval code

Hello, could you please provide a tutorial on evaluating the model and what evaluation indicators are available.

Questions about multi-node deepspeed launcher

Hi, @JialianW! Thanks for your wonderful work!
I try to run GRiT on 4 nodes & 32 GPUs with following command:

python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

However, I notice that only one GPU is used on each node.
In implementation, there is no mp.spawn in multi-node deepspeed launcher, is this the reason and is there any plan to fix this?

Dense Captioning Evaluation on VG Dataset

Hello,

I am currently tring to reproduce the result of task dense captioning of GRiT. I have trained the model by default setting and got the checkpoint of it. Then I ran inference on VG test set and got the json result by

python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth

However, when installing the environment of DenseCap, I was stuck in the installation of torch on my GPU machine which has a CUDA version of 12.0. I always met this error:

Make Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_cublas_device_LIBRARY (ADVANCED)
linked by target "THC" in directory /root/torch/extra/cutorch/lib/THC

Could you tell me what platform you use to install DenseCap and perform evaluation?

Thanks a lot!