Hi, Can you share the log files for your training? I am unable to reproduce the pe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Log files for training about irn HOT 11 CLOSED

jiwoon-ahn commented on September 25, 2024

Log files for training

from irn.

Comments (11)

jiwoon-ahn commented on September 25, 2024 3

@zhaohui-yang @adityaarun1,
I have just added data parallelism for IRNet training, and confirmed it still can reproduce the results in the paper.

step.train_cam: Tue Sep 17 14:26:39 2019
Epoch 1/5
step: 0/ 3305 loss:0.6661 imps:0.3 lr: 0.1000 etc:Thu Sep 19 09:26:40 2019
step: 100/ 3305 loss:0.1896 imps:25.8 lr: 0.0973 etc:Tue Sep 17 15:01:01 2019
step: 200/ 3305 loss:0.1140 imps:41.0 lr: 0.0945 etc:Tue Sep 17 14:48:17 2019
step: 300/ 3305 loss:0.0935 imps:51.2 lr: 0.0918 etc:Tue Sep 17 14:44:01 2019
step: 400/ 3305 loss:0.0898 imps:58.5 lr: 0.0890 etc:Tue Sep 17 14:41:53 2019
step: 500/ 3305 loss:0.0826 imps:63.9 lr: 0.0863 etc:Tue Sep 17 14:40:36 2019
step: 600/ 3305 loss:0.0831 imps:68.2 lr: 0.0835 etc:Tue Sep 17 14:39:45 2019
validating ... loss: 0.0757
Epoch 2/5
step: 700/ 3305 loss:0.0773 imps:40.1 lr: 0.0807 etc:Tue Sep 17 14:41:25 2019
step: 800/ 3305 loss:0.0720 imps:70.7 lr: 0.0779 etc:Tue Sep 17 14:40:40 2019
step: 900/ 3305 loss:0.0706 imps:81.0 lr: 0.0751 etc:Tue Sep 17 14:40:06 2019
step: 1000/ 3305 loss:0.0715 imps:86.0 lr: 0.0723 etc:Tue Sep 17 14:39:38 2019
step: 1100/ 3305 loss:0.0708 imps:89.1 lr: 0.0695 etc:Tue Sep 17 14:39:16 2019
step: 1200/ 3305 loss:0.0646 imps:91.2 lr: 0.0666 etc:Tue Sep 17 14:38:57 2019
step: 1300/ 3305 loss:0.0659 imps:92.7 lr: 0.0638 etc:Tue Sep 17 14:38:41 2019
validating ... loss: 0.0647
Epoch 3/5
step: 1400/ 3305 loss:0.0609 imps:54.5 lr: 0.0609 etc:Tue Sep 17 14:39:42 2019
step: 1500/ 3305 loss:0.0570 imps:73.4 lr: 0.0580 etc:Tue Sep 17 14:39:26 2019
step: 1600/ 3305 loss:0.0582 imps:81.5 lr: 0.0551 etc:Tue Sep 17 14:39:11 2019
step: 1700/ 3305 loss:0.0575 imps:86.1 lr: 0.0522 etc:Tue Sep 17 14:38:58 2019
step: 1800/ 3305 loss:0.0576 imps:88.9 lr: 0.0493 etc:Tue Sep 17 14:38:46 2019
step: 1900/ 3305 loss:0.0532 imps:91.0 lr: 0.0463 etc:Tue Sep 17 14:38:36 2019
validating ... loss: 0.0548
Epoch 4/5
step: 2000/ 3305 loss:0.0531 imps:22.7 lr: 0.0433 etc:Tue Sep 17 14:39:16 2019
step: 2100/ 3305 loss:0.0481 imps:66.3 lr: 0.0403 etc:Tue Sep 17 14:39:05 2019
step: 2200/ 3305 loss:0.0457 imps:79.0 lr: 0.0373 etc:Tue Sep 17 14:38:55 2019
step: 2300/ 3305 loss:0.0467 imps:85.0 lr: 0.0343 etc:Tue Sep 17 14:38:46 2019
step: 2400/ 3305 loss:0.0503 imps:88.4 lr: 0.0312 etc:Tue Sep 17 14:38:38 2019
step: 2500/ 3305 loss:0.0480 imps:90.7 lr: 0.0281 etc:Tue Sep 17 14:38:30 2019
step: 2600/ 3305 loss:0.0448 imps:92.4 lr: 0.0249 etc:Tue Sep 17 14:38:23 2019
validating ... loss: 0.0515
Epoch 5/5
step: 2700/ 3305 loss:0.0442 imps:48.7 lr: 0.0217 etc:Tue Sep 17 14:38:53 2019
step: 2800/ 3305 loss:0.0375 imps:72.9 lr: 0.0184 etc:Tue Sep 17 14:38:46 2019
step: 2900/ 3305 loss:0.0418 imps:81.9 lr: 0.0151 etc:Tue Sep 17 14:38:39 2019
step: 3000/ 3305 loss:0.0417 imps:86.7 lr: 0.0117 etc:Tue Sep 17 14:38:33 2019
step: 3100/ 3305 loss:0.0386 imps:89.6 lr: 0.0082 etc:Tue Sep 17 14:38:27 2019
step: 3200/ 3305 loss:0.0384 imps:91.5 lr: 0.0045 etc:Tue Sep 17 14:38:21 2019
step: 3300/ 3305 loss:0.0362 imps:93.0 lr: 0.0003 etc:Tue Sep 17 14:38:16 2019
validating ... loss: 0.0493
step.make_cam: Tue Sep 17 14:38:36 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_cam: Tue Sep 17 14:43:39 2019
{'iou': array([0.79388312, 0.43113067, 0.28864309, 0.444585 , 0.36172684,
0.46973761, 0.61380454, 0.54396673, 0.48715231, 0.28845281,
0.57491489, 0.40641602, 0.458491 , 0.49758201, 0.61701023,
0.52529238, 0.42383762, 0.61912264, 0.44892162, 0.49405836,
0.46604461]), 'miou': 0.4883225761592359}
step.cam_to_ir_label: Tue Sep 17 14:44:05 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 ]
step.train_irn: Tue Sep 17 14:59:55 2019
Epoch 1/5
step: 0/ 3305 loss:1.2434 0.2856 4.0134 0.1892 imps:0.7 lr: 0.1000 etc:Wed Sep 18 13:19:38 2019
step: 100/ 3305 loss:0.4934 0.4499 3.8754 0.1043 imps:17.6 lr: 0.0973 etc:Tue Sep 17 15:49:54 2019
step: 200/ 3305 loss:0.4166 0.3639 3.6527 0.1442 imps:22.1 lr: 0.0945 etc:Tue Sep 17 15:39:42 2019
step: 300/ 3305 loss:0.4041 0.3523 3.5186 0.1616 imps:24.2 lr: 0.0918 etc:Tue Sep 17 15:36:20 2019
step: 400/ 3305 loss:0.3994 0.3596 3.3185 0.2086 imps:25.4 lr: 0.0890 etc:Tue Sep 17 15:34:38 2019
step: 500/ 3305 loss:0.3839 0.3394 3.1721 0.2243 imps:26.2 lr: 0.0863 etc:Tue Sep 17 15:33:31 2019
step: 600/ 3305 loss:0.3940 0.3481 3.1202 0.2139 imps:26.8 lr: 0.0835 etc:Tue Sep 17 15:32:51 2019
Epoch 2/5
step: 700/ 3305 loss:0.3802 0.3347 3.0618 0.2176 imps:15.4 lr: 0.0807 etc:Tue Sep 17 15:33:59 2019
step: 800/ 3305 loss:0.3771 0.3315 3.0399 0.2230 imps:23.6 lr: 0.0779 etc:Tue Sep 17 15:33:23 2019
step: 900/ 3305 loss:0.3779 0.3291 3.0268 0.2232 imps:25.9 lr: 0.0751 etc:Tue Sep 17 15:32:57 2019
step: 1000/ 3305 loss:0.3765 0.3284 3.0080 0.2230 imps:27.0 lr: 0.0723 etc:Tue Sep 17 15:32:35 2019
step: 1100/ 3305 loss:0.3739 0.3317 2.9594 0.2122 imps:27.7 lr: 0.0695 etc:Tue Sep 17 15:32:15 2019
step: 1200/ 3305 loss:0.3791 0.3385 2.9742 0.2159 imps:28.1 lr: 0.0666 etc:Tue Sep 17 15:31:59 2019
step: 1300/ 3305 loss:0.3731 0.3293 2.9441 0.2143 imps:28.5 lr: 0.0638 etc:Tue Sep 17 15:31:45 2019
Epoch 3/5
step: 1400/ 3305 loss:0.3746 0.3318 2.9068 0.2132 imps:20.2 lr: 0.0609 etc:Tue Sep 17 15:32:24 2019
step: 1500/ 3305 loss:0.3722 0.3240 2.9324 0.2056 imps:24.6 lr: 0.0580 etc:Tue Sep 17 15:32:13 2019
step: 1600/ 3305 loss:0.3630 0.3207 2.9220 0.2053 imps:26.2 lr: 0.0551 etc:Tue Sep 17 15:32:03 2019
step: 1700/ 3305 loss:0.3734 0.3279 2.8887 0.2145 imps:27.2 lr: 0.0522 etc:Tue Sep 17 15:31:53 2019
step: 1800/ 3305 loss:0.3639 0.3170 2.8827 0.2084 imps:27.8 lr: 0.0493 etc:Tue Sep 17 15:31:42 2019
step: 1900/ 3305 loss:0.3662 0.3194 2.8690 0.2112 imps:28.2 lr: 0.0463 etc:Tue Sep 17 15:31:33 2019
Epoch 4/5
step: 2000/ 3305 loss:0.3596 0.3165 2.8963 0.2054 imps:9.7 lr: 0.0433 etc:Tue Sep 17 15:32:00 2019
step: 2100/ 3305 loss:0.3636 0.3154 2.8559 0.2216 imps:22.8 lr: 0.0403 etc:Tue Sep 17 15:31:52 2019
step: 2200/ 3305 loss:0.3569 0.3119 2.8304 0.2085 imps:25.5 lr: 0.0373 etc:Tue Sep 17 15:31:47 2019
step: 2300/ 3305 loss:0.3651 0.3224 2.8433 0.2046 imps:26.6 lr: 0.0343 etc:Tue Sep 17 15:31:41 2019
step: 2400/ 3305 loss:0.3563 0.3121 2.8420 0.2105 imps:27.3 lr: 0.0312 etc:Tue Sep 17 15:31:35 2019
step: 2500/ 3305 loss:0.3537 0.3078 2.8178 0.2024 imps:27.8 lr: 0.0281 etc:Tue Sep 17 15:31:30 2019
step: 2600/ 3305 loss:0.3619 0.3137 2.8092 0.2042 imps:28.1 lr: 0.0249 etc:Tue Sep 17 15:31:25 2019
Epoch 5/5
step: 2700/ 3305 loss:0.3569 0.3068 2.8103 0.1992 imps:18.2 lr: 0.0217 etc:Tue Sep 17 15:31:45 2019
step: 2800/ 3305 loss:0.3515 0.3021 2.7876 0.2031 imps:24.1 lr: 0.0184 etc:Tue Sep 17 15:31:41 2019
step: 2900/ 3305 loss:0.3555 0.3133 2.7876 0.1998 imps:26.1 lr: 0.0151 etc:Tue Sep 17 15:31:36 2019
step: 3000/ 3305 loss:0.3538 0.3070 2.7692 0.1936 imps:27.1 lr: 0.0117 etc:Tue Sep 17 15:31:31 2019
step: 3100/ 3305 loss:0.3605 0.3197 2.7654 0.1988 imps:27.6 lr: 0.0082 etc:Tue Sep 17 15:31:28 2019
step: 3200/ 3305 loss:0.3464 0.3009 2.7539 0.1878 imps:28.1 lr: 0.0045 etc:Tue Sep 17 15:31:23 2019
step: 3300/ 3305 loss:0.3555 0.3044 2.7294 0.1937 imps:28.4 lr: 0.0003 etc:Tue Sep 17 15:31:19 2019
Analyzing displacements mean ... done.
step.make_ins_seg_labels: Tue Sep 17 15:32:01 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_ins_seg: Tue Sep 17 15:43:17 2019
0.5iou: {'ap': array([0.36661056, 0.00547154, 0.574922 , 0.30045993, 0.15886656,
0.5995807 , 0.37568754, 0.67267337, 0.05372927, 0.51445742,
0.17437114, 0.56820096, 0.59092809, 0.47832304, 0.2252956 ,
0.07792715, 0.35047058, 0.36736559, 0.4721443 , 0.57766263]), 'map': 0.37525739839692324}
step.make_sem_seg_labels: Tue Sep 17 15:43:59 2019
[0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_sem_seg: Tue Sep 17 15:53:27 2019
0.07191166029597273 0.046077651513210416
0.14463595416636954 0.20396592048661016
{'iou': array([0.88201069, 0.66970496, 0.35053706, 0.77762538, 0.60824307,
0.61367099, 0.80644066, 0.71936881, 0.75847351, 0.35081316,
0.79498655, 0.42315924, 0.7327815 , 0.77665314, 0.76811858,
0.68669326, 0.53031597, 0.81550075, 0.58572881, 0.65244931,
0.60669782]), 'miou': 0.6623796759586297}
completed train process

from irn.

jiwoon-ahn commented on September 25, 2024 2

MeanShift layer is somewhat dependent on the batch sizes per gpu, and might cause the problem.
https://github.com/jiwoon-ahn/irn/blob/master/net/resnet50_irn.py#L99

from irn.

adityaarun1 commented on September 25, 2024

I figured out the issue. When I use a single GPU to train IRN then it works fine, when training on multiple GPUs the results are bad. I am closing this issue. Also, it will be nice if someone can figure out what's wrong with adding torch.nn.DataParallel in IRN [here].

from irn.

zhaohui-yang commented on September 25, 2024

@adityaarun1 Have you solved this problem? I utilized 4 GPUs to run train_irn.py in parallel and encountered the same problem with you. However, my Titan with 12G could only train irn with batch_size = 16, which achieved mAP = 35.7, I wonder how you solved this problem, thanks!

from irn.

adityaarun1 commented on September 25, 2024

@zhaohui-yang No, I haven't been able to solve this. Adding torch.nn.DataParallel works fine while training, but I am unable to replicate results using that.

I also tried running on a larger GPU (V100) using the default hyperparameters. Across various runs, the best accuracy that I have been able to achieve is 36.7 mAP. There still seems to a ~1% gap.

from irn.

zhaohui-yang commented on September 25, 2024

Yes, I observed that the loss decreasing as usual, and I'm not sure what is the reason. I used multi-GPUs for parallel training and single-GPU for evaluation, the problem exists. I think several reasons may affect :

The scatter and gather operation. For parallel training, data and target are automatically split into n_gpus splits and separately calculated. Gonna check the data shape and the data itself.
Incorrect data-target pair. Though converge, but may lead to the wrong target. (Personally, I don't think this is the reason).
Incorrect forward mode. The forward function contains a resnet50.forward() with eval() mode and edge_model, dp_model with train() mode. If you are familiar with the classification task, if the training mode is quite correct and eval mode is strange, the problem is mainly because of the BN parameters. Gonna split resnet50_irn into three networks: resnet50+edge_model+dp_model.

Thank you for your advice!

from irn.

adityaarun1 commented on September 25, 2024

I have checked for point 2 and it is fine. Point 3 can be an issue, but it seems to work fine on a single GPU, so I am not sure what goes wrong if you train on multiple GPUs but test on one.

from irn.

zhaohui-yang commented on September 25, 2024

I observed the loss curve. Though single-GPU and multi-GPUs would converge. However, loss of single-GPU converges till ~0.37. The loss of multi-GPUs converges till ~0.44. I think something's incorrect with the training stage. Maybe the point3 or the optimizer. Not sure.

from irn.

adityaarun1 commented on September 25, 2024

Yes, I have observed the same. But that still does not explain the big difference in the result in my opinion.

from irn.

adityaarun1 commented on September 25, 2024

@jiwoon-ahn Thanks. This helps. 😃

from irn.

zhaohui-yang commented on September 25, 2024

@jiwoon-ahn Thanks! Everythink's fine!

from irn.

Log files for training about irn HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent