Hi, Thanks for your amazing work and I try to retrain the model on VG, however, th

Bug of corner case of proposals about grit HOT 24 OPEN

jshilong commented on July 24, 2024 2

Bug of corner case of proposals

from grit.

Comments (24)

Solacex commented on July 24, 2024 1

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

from grit.

JialianW commented on July 24, 2024 1

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.

from grit.

JialianW commented on July 24, 2024 1

Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu.

from grit.

JialianW commented on July 24, 2024

Thanks for your interest in GRiT and for re-training it on VG.

Do you know this error comes from "proposals[0].proposal_boxes.tensor[0, :]" or "targets[0].gt_boxes.tensor[0, :]"? If it is from the former one, I haven't met the case that there are no proposals. I think there should always be some proposals. Can you check if it is because there isn't any ground truth?

It would be great if you can print out this line of code so as to determine whether the issue is from the proposal or the ground truth.

from grit.

Solacex commented on July 24, 2024

Hello, I met the same issue here, any workaround now?

from grit.

Solacex commented on July 24, 2024

Hello, I think the problem is on the proposal side.

As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

from grit.

JialianW commented on July 24, 2024

Hello, I think the problem is on the proposal side.

As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.

from grit.

Solacex commented on July 24, 2024

Hello, I think the problem is on the proposal side.
As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.

Yes, I think the problem arises on the proposal side because the first usage seems fine. Do you mean the problem is caused by the wrong ground truth？

This still looks strange because the ground truth is not modified in this function. Do you have any idea to solve this？ This seems to be a common issue for running the objectDetect task: #5 (comment)

from grit.

Solacex commented on July 24, 2024

The error comes from GT as empty instances：
Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])

So can you share the COCO-json that you used with us？ @JialianW

from grit.

JialianW commented on July 24, 2024

The error comes from GT as empty instances： Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])

So can you share the COCO-json that you used with us？ @JialianW

We used the official annotations from the COCO website. The images without groundtruth should be already discarded as shown at

GRiT/grit/data/datasets/grit_coco.py

Line 93 in 39b33db

if len(record["annotations"]) == 0:

Can you post your config file?

from grit.

Solacex commented on July 24, 2024

I also use the official json files from COCO and run this code without any modifications.
This error looks so wired with after excluding the null instances as you pointed out.

from grit.

JialianW commented on July 24, 2024

The reason why the first call of "check_if_all_background" doesn't yield error may be it didn't enter "if all_background:". Probably the ground truth is empty from the beginning. In this case, maybe the groundtruth was removed when images were being augmented, where a background crop was fed into the model. Did you use our provided config file without any change?

from grit.

Solacex commented on July 24, 2024

Yes, without any change. And this error also shows when other people run it.

from grit.

JialianW commented on July 24, 2024

Can you make a change here to make sure the input image does have ground truth:

GRiT/grit/data/custom_dataset_mapper.py

Line 53 in 62ee07f

dataset_dict_out = self.prepare_data(dataset_dict)

Can you add some codes after that line like:
while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0:
dataset_dict_out = self.prepare_data(dataset_dict)

This is to ensure "self.prepare_data" does not empty ground truth when preparing data.

from grit.

Solacex commented on July 24, 2024

okay, I will try it as you suggested.

from grit.

Solacex commented on July 24, 2024

It looks fine by far. I will tell you later if it is fixed.

from grit.

Solacex commented on July 24, 2024

When it goes to 2.2w iters, OOM (out of memory) error occurs. The occupation of GPU memory increases as training progresses, as shown below

�[32m[03/16 10:20:04 d2.utils.events]: �[0m eta: 1 day, 23:52:09 iter: 80 total_loss: 7.998 loss_box_reg_stage0: 0.06707 loss_box_reg_stage1: 0.06987 loss_box_reg_stage2: 0.02389 loss_centernet_agn_neg: 0.05723 loss_centernet_agn_pos: 0.3456 loss_centernet_loc: 0.7329 loss_cls_stage0: 0.2401 loss_cls_stage1: 0.2133 loss_cls_stage2: 0.1666 loss_mask: 0.6922 text_decoder_loss: 5.463 time: 0.9160 last_time: 1.0366 data_time: 0.0180 last_data_time: 0.0144 lr: 4.878e-08 max_mem: 4452M
�[32m[03/16 10:20:24 d2.utils.events]: �[0m eta: 2 days, 0:05:33 iter: 100 total_loss: 6.563 loss_box_reg_stage0: 0.09858 loss_box_reg_stage1: 0.1065 loss_box_reg_stage2: 0.03283 loss_centernet_agn_neg: 0.03842 loss_centernet_agn_pos: 0.3317 loss_centernet_loc: 0.7042 loss_cls_stage0: 0.1995 loss_cls_stage1: 0.1513 loss_cls_stage2: 0.1101 loss_mask: 0.6907 text_decoder_loss: 4.111 time: 0.9317 last_time: 1.0245 data_time: 0.0176 last_data_time: 0.0070 lr: 6.4266e-08 max_mem: 4476M
�

[03/16 13:20:32 d2.utils.events]: �[0m eta: 2 days, 1:20:37 iter: 10460 total_loss: 2.685 loss_box_reg_stage0: 0.1964 loss_box_reg_stage1: 0.2311 loss_box_reg_stage2: 0.1316 loss_centernet_agn_neg: 0.04058 loss_centernet_agn_pos: 0.2017 loss_centernet_loc: 0.4001 loss_cls_stage0: 0.179 loss_cls_stage1: 0.159 loss_cls_stage2: 0.113 loss_mask: 0.4384 text_decoder_loss: 0.6743 time: 1.0258 last_time: 1.0547 data_time: 0.0179 last_data_time: 0.0608 lr: 7.687e-07 max_mem: 21002M
�[32m[03/16 13:20:53 d2.utils.events]: �[0m eta: 2 days, 1:20:39 iter: 10480 total_loss: 2.784 loss_box_reg_stage0: 0.2283 loss_box_reg_stage1: 0.2343 loss_box_reg_stage2: 0.1302 loss_centernet_agn_neg: 0.04495 loss_centernet_agn_pos: 0.2133 loss_centernet_loc: 0.3937 loss_cls_stage0: 0.1948 loss_cls_stage1: 0.1652 loss_cls_stage2: 0.1137 loss_mask: 0.4339 text_decoder_loss: 0.6384 time: 1.0259 last_time: 1.0150 data_time: 0.0162 last_data_time: 0.0061 lr: 7.6868e-07 max_mem: 21002M

�[32m[03/16 16:58:19 d2.utils.events]: �[0m eta: 1 day, 23:35:12 iter: 22300 total_loss: 2.63 loss_box_reg_stage0: 0.2285 loss_box_reg_stage1: 0.2795 loss_box_reg_stage2: 0.1659 loss_centernet_agn_neg: 0.04165 loss_centernet_agn_pos: 0.1809 loss_centernet_loc: 0.3561 loss_cls_stage0: 0.1952 loss_cls_stage1: 0.1732 loss_cls_stage2: 0.1347 loss_mask: 0.395 text_decoder_loss: 0.4412 time: 1.0581 last_time: 1.2151 data_time: 0.0211 last_data_time: 0.0035 lr: 7.4622e-07 max_mem: 37269M
�[32m[03/16 16:58:41 d2.utils.events]: �[0m eta: 1 day, 23:34:02 iter: 22320 total_loss: 2.535 loss_box_reg_stage0: 0.2358 loss_box_reg_stage1: 0.2703 loss_box_reg_stage2: 0.1736 loss_centernet_agn_neg: 0.044 loss_centernet_agn_pos: 0.1872 loss_centernet_loc: 0.3547 loss_cls_stage0: 0.1911 loss_cls_stage1: 0.1689 loss_cls_stage2: 0.131 loss_mask: 0.3955 text_decoder_loss: 0.4017 time: 1.0581 last_time: 1.2096 data_time: 0.0244 last_data_time: 0.0601 lr: 7.4617e-07 max_mem: 37269M

My experiments are run on 8xA100 GPUs.

How many GPUs do you use for training? Or have you met this before?

from grit.

Solacex commented on July 24, 2024

The above bug seems to be fixed, the results for 20000 th iter is �
�[copypaste: AP,AP50,AP75,APs,APm,APl
�[03/17 09:13:19 d2.evaluation.testing]: �[ 11.6693,20.4049,11.5888,4.6322,12.3034,17.4644]

The results are trained with 8 x A100 cards, and can you share the results for the same checkpoint, so as to verify the bug is fixed?

Besides, the training breaks at the 29980th iter with the following error：

Traceback (most recent call last):
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/xxx//guangrui/gDeco/lauch_deepspeed.py", line 133, in _distributed_worker
main_func(*args)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 252, in main
do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 209, in do_train
periodic_checkpointer.step(iteration)
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 416, in step
self.checkpointer.save(
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 106, in save
data[key] = obj.state_dict()
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in state_dict
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
File "/xxx/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
KeyError: 139902493578720

Have you met this before?

from grit.

JialianW commented on July 24, 2024

I didn't meet this error before. Looks like the error is from saving checkpoint. Was your previous checkpoint successfully saved?

from grit.

Solacex commented on July 24, 2024

Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.

I can only found a similar issue here: pytorch/pytorch#42428
It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?

from grit.

JialianW commented on July 24, 2024

Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.

I can only found a similar issue here: pytorch/pytorch#42428 It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?

Pls refer to Installation instructions for our pytorch version.

from grit.

Wykay commented on July 24, 2024

I have trained the model for description task on Visual Genome successfully.
My environmet building procedure follows the INSTALL.md

from grit.

hellowordo commented on July 24, 2024

@Evenyyy Hello, I'm sorry to bother you. Could you please tell me more details about evaluating the vg_instances_results.json file, or share your code? Thank you very much!

from grit.

yubo97 commented on July 24, 2024

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.

Can you make a change here to make sure the input image does have ground truth:

GRiT/grit/data/custom_dataset_mapper.py

Line 53 in 62ee07f

dataset_dict_out = self.prepare_data(dataset_dict)

Can you add some codes after that line like: while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0: dataset_dict_out = self.prepare_data(dataset_dict)

This is to ensure "self.prepare_data" does not empty ground truth when preparing data.

Thank you for your suggestion. I also encountered this issue. This problem has now been solved.

from grit.

Bug of corner case of proposals about grit HOT 24 OPEN

Comments (24)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent