Comments (24)
The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.
from grit.
The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.
Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.
from grit.
Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu.
from grit.
Thanks for your interest in GRiT and for re-training it on VG.
Do you know this error comes from "proposals[0].proposal_boxes.tensor[0, :]" or "targets[0].gt_boxes.tensor[0, :]"? If it is from the former one, I haven't met the case that there are no proposals. I think there should always be some proposals. Can you check if it is because there isn't any ground truth?
It would be great if you can print out this line of code so as to determine whether the issue is from the proposal or the ground truth.
from grit.
Hello, I met the same issue here, any workaround now?
from grit.
Hello, I think the problem is on the proposal side.
As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.
from grit.
Hello, I think the problem is on the proposal side.
As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.
Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.
from grit.
Hello, I think the problem is on the proposal side.
As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.
Yes, I think the problem arises on the proposal side because the first usage seems fine. Do you mean the problem is caused by the wrong ground truth?
This still looks strange because the ground truth is not modified in this function. Do you have any idea to solve this? This seems to be a common issue for running the objectDetect task: #5 (comment)
from grit.
The error comes from GT as empty instances:
Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])
So can you share the COCO-json that you used with us? @JialianW
from grit.
The error comes from GT as empty instances: Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])
So can you share the COCO-json that you used with us? @JialianW
We used the official annotations from the COCO website. The images without groundtruth should be already discarded as shown at
GRiT/grit/data/datasets/grit_coco.py
Line 93 in 39b33db
Can you post your config file?
from grit.
I also use the official json files from COCO and run this code without any modifications.
This error looks so wired with after excluding the null instances as you pointed out.
from grit.
The reason why the first call of "check_if_all_background" doesn't yield error may be it didn't enter "if all_background:". Probably the ground truth is empty from the beginning. In this case, maybe the groundtruth was removed when images were being augmented, where a background crop was fed into the model. Did you use our provided config file without any change?
from grit.
Yes, without any change. And this error also shows when other people run it.
from grit.
Can you make a change here to make sure the input image does have ground truth:
GRiT/grit/data/custom_dataset_mapper.py
Line 53 in 62ee07f
Can you add some codes after that line like:
while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0:
dataset_dict_out = self.prepare_data(dataset_dict)
This is to ensure "self.prepare_data" does not empty ground truth when preparing data.
from grit.
okay, I will try it as you suggested.
from grit.
It looks fine by far. I will tell you later if it is fixed.
from grit.
When it goes to 2.2w iters, OOM (out of memory) error occurs. The occupation of GPU memory increases as training progresses, as shown below
�[32m[03/16 10:20:04 d2.utils.events]: �[0m eta: 1 day, 23:52:09 iter: 80 total_loss: 7.998 loss_box_reg_stage0: 0.06707 loss_box_reg_stage1: 0.06987 loss_box_reg_stage2: 0.02389 loss_centernet_agn_neg: 0.05723 loss_centernet_agn_pos: 0.3456 loss_centernet_loc: 0.7329 loss_cls_stage0: 0.2401 loss_cls_stage1: 0.2133 loss_cls_stage2: 0.1666 loss_mask: 0.6922 text_decoder_loss: 5.463 time: 0.9160 last_time: 1.0366 data_time: 0.0180 last_data_time: 0.0144 lr: 4.878e-08 max_mem: 4452M
�[32m[03/16 10:20:24 d2.utils.events]: �[0m eta: 2 days, 0:05:33 iter: 100 total_loss: 6.563 loss_box_reg_stage0: 0.09858 loss_box_reg_stage1: 0.1065 loss_box_reg_stage2: 0.03283 loss_centernet_agn_neg: 0.03842 loss_centernet_agn_pos: 0.3317 loss_centernet_loc: 0.7042 loss_cls_stage0: 0.1995 loss_cls_stage1: 0.1513 loss_cls_stage2: 0.1101 loss_mask: 0.6907 text_decoder_loss: 4.111 time: 0.9317 last_time: 1.0245 data_time: 0.0176 last_data_time: 0.0070 lr: 6.4266e-08 max_mem: 4476M
�
[03/16 13:20:32 d2.utils.events]: �[0m eta: 2 days, 1:20:37 iter: 10460 total_loss: 2.685 loss_box_reg_stage0: 0.1964 loss_box_reg_stage1: 0.2311 loss_box_reg_stage2: 0.1316 loss_centernet_agn_neg: 0.04058 loss_centernet_agn_pos: 0.2017 loss_centernet_loc: 0.4001 loss_cls_stage0: 0.179 loss_cls_stage1: 0.159 loss_cls_stage2: 0.113 loss_mask: 0.4384 text_decoder_loss: 0.6743 time: 1.0258 last_time: 1.0547 data_time: 0.0179 last_data_time: 0.0608 lr: 7.687e-07 max_mem: 21002M
�[32m[03/16 13:20:53 d2.utils.events]: �[0m eta: 2 days, 1:20:39 iter: 10480 total_loss: 2.784 loss_box_reg_stage0: 0.2283 loss_box_reg_stage1: 0.2343 loss_box_reg_stage2: 0.1302 loss_centernet_agn_neg: 0.04495 loss_centernet_agn_pos: 0.2133 loss_centernet_loc: 0.3937 loss_cls_stage0: 0.1948 loss_cls_stage1: 0.1652 loss_cls_stage2: 0.1137 loss_mask: 0.4339 text_decoder_loss: 0.6384 time: 1.0259 last_time: 1.0150 data_time: 0.0162 last_data_time: 0.0061 lr: 7.6868e-07 max_mem: 21002M
�[32m[03/16 16:58:19 d2.utils.events]: �[0m eta: 1 day, 23:35:12 iter: 22300 total_loss: 2.63 loss_box_reg_stage0: 0.2285 loss_box_reg_stage1: 0.2795 loss_box_reg_stage2: 0.1659 loss_centernet_agn_neg: 0.04165 loss_centernet_agn_pos: 0.1809 loss_centernet_loc: 0.3561 loss_cls_stage0: 0.1952 loss_cls_stage1: 0.1732 loss_cls_stage2: 0.1347 loss_mask: 0.395 text_decoder_loss: 0.4412 time: 1.0581 last_time: 1.2151 data_time: 0.0211 last_data_time: 0.0035 lr: 7.4622e-07 max_mem: 37269M
�[32m[03/16 16:58:41 d2.utils.events]: �[0m eta: 1 day, 23:34:02 iter: 22320 total_loss: 2.535 loss_box_reg_stage0: 0.2358 loss_box_reg_stage1: 0.2703 loss_box_reg_stage2: 0.1736 loss_centernet_agn_neg: 0.044 loss_centernet_agn_pos: 0.1872 loss_centernet_loc: 0.3547 loss_cls_stage0: 0.1911 loss_cls_stage1: 0.1689 loss_cls_stage2: 0.131 loss_mask: 0.3955 text_decoder_loss: 0.4017 time: 1.0581 last_time: 1.2096 data_time: 0.0244 last_data_time: 0.0601 lr: 7.4617e-07 max_mem: 37269M
My experiments are run on 8xA100 GPUs.
How many GPUs do you use for training? Or have you met this before?
from grit.
The above bug seems to be fixed, the results for 20000 th iter is �
�[copypaste: AP,AP50,AP75,APs,APm,APl
�[03/17 09:13:19 d2.evaluation.testing]: �[ 11.6693,20.4049,11.5888,4.6322,12.3034,17.4644]
The results are trained with 8 x A100 cards, and can you share the results for the same checkpoint, so as to verify the bug is fixed?
Besides, the training breaks at the 29980th iter with the following error:
Traceback (most recent call last):
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/xxx//guangrui/gDeco/lauch_deepspeed.py", line 133, in _distributed_worker
main_func(*args)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 252, in main
do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 209, in do_train
periodic_checkpointer.step(iteration)
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 416, in step
self.checkpointer.save(
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 106, in save
data[key] = obj.state_dict()
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in state_dict
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
File "/xxx/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
KeyError: 139902493578720
Have you met this before?
from grit.
I didn't meet this error before. Looks like the error is from saving checkpoint. Was your previous checkpoint successfully saved?
from grit.
Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.
I can only found a similar issue here: pytorch/pytorch#42428
It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?
from grit.
Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.
I can only found a similar issue here: pytorch/pytorch#42428 It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?
Pls refer to Installation instructions for our pytorch version.
from grit.
I have trained the model for description task on Visual Genome successfully.
My environmet building procedure follows the INSTALL.md
from grit.
@Evenyyy Hello, I'm sorry to bother you. Could you please tell me more details about evaluating the vg_instances_results.json file, or share your code? Thank you very much!
from grit.
The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.
Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.
Can you make a change here to make sure the input image does have ground truth:
GRiT/grit/data/custom_dataset_mapper.py
Line 53 in 62ee07f
Can you add some codes after that line like: while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0: dataset_dict_out = self.prepare_data(dataset_dict)
This is to ensure "self.prepare_data" does not empty ground truth when preparing data.
Thank you for your suggestion. I also encountered this issue. This problem has now been solved.
from grit.
Related Issues (18)
- third_party project gitmodules HOT 1
- Questions about multi-node deepspeed launcher HOT 1
- Batch size Configuration HOT 1
- Question about the training time HOT 4
- Dense Captioning Evaluation on VG Dataset HOT 10
- Poor result in Densecap Evaluation HOT 6
- Willing to share the original annotations of Visual Genome dataset ? HOT 2
- Larger ViT backbone for dense captioning HOT 1
- Question about training on custom data HOT 1
- Can you provide the performance based on the GT boxes? HOT 3
- KeyError: 'object_description'
- No module named 'detectron2' HOT 2
- Support for Batch-Inference HOT 1
- Generate Caption on my own boxes HOT 1
- Installation instructions seem out of date HOT 1
- eval code HOT 1
- Output the result in the format of text
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grit.