Comments (11)
@zhaohui-yang @adityaarun1,
I have just added data parallelism for IRNet training, and confirmed it still can reproduce the results in the paper.
step.train_cam: Tue Sep 17 14:26:39 2019
Epoch 1/5
step: 0/ 3305 loss:0.6661 imps:0.3 lr: 0.1000 etc:Thu Sep 19 09:26:40 2019
step: 100/ 3305 loss:0.1896 imps:25.8 lr: 0.0973 etc:Tue Sep 17 15:01:01 2019
step: 200/ 3305 loss:0.1140 imps:41.0 lr: 0.0945 etc:Tue Sep 17 14:48:17 2019
step: 300/ 3305 loss:0.0935 imps:51.2 lr: 0.0918 etc:Tue Sep 17 14:44:01 2019
step: 400/ 3305 loss:0.0898 imps:58.5 lr: 0.0890 etc:Tue Sep 17 14:41:53 2019
step: 500/ 3305 loss:0.0826 imps:63.9 lr: 0.0863 etc:Tue Sep 17 14:40:36 2019
step: 600/ 3305 loss:0.0831 imps:68.2 lr: 0.0835 etc:Tue Sep 17 14:39:45 2019
validating ... loss: 0.0757
Epoch 2/5
step: 700/ 3305 loss:0.0773 imps:40.1 lr: 0.0807 etc:Tue Sep 17 14:41:25 2019
step: 800/ 3305 loss:0.0720 imps:70.7 lr: 0.0779 etc:Tue Sep 17 14:40:40 2019
step: 900/ 3305 loss:0.0706 imps:81.0 lr: 0.0751 etc:Tue Sep 17 14:40:06 2019
step: 1000/ 3305 loss:0.0715 imps:86.0 lr: 0.0723 etc:Tue Sep 17 14:39:38 2019
step: 1100/ 3305 loss:0.0708 imps:89.1 lr: 0.0695 etc:Tue Sep 17 14:39:16 2019
step: 1200/ 3305 loss:0.0646 imps:91.2 lr: 0.0666 etc:Tue Sep 17 14:38:57 2019
step: 1300/ 3305 loss:0.0659 imps:92.7 lr: 0.0638 etc:Tue Sep 17 14:38:41 2019
validating ... loss: 0.0647
Epoch 3/5
step: 1400/ 3305 loss:0.0609 imps:54.5 lr: 0.0609 etc:Tue Sep 17 14:39:42 2019
step: 1500/ 3305 loss:0.0570 imps:73.4 lr: 0.0580 etc:Tue Sep 17 14:39:26 2019
step: 1600/ 3305 loss:0.0582 imps:81.5 lr: 0.0551 etc:Tue Sep 17 14:39:11 2019
step: 1700/ 3305 loss:0.0575 imps:86.1 lr: 0.0522 etc:Tue Sep 17 14:38:58 2019
step: 1800/ 3305 loss:0.0576 imps:88.9 lr: 0.0493 etc:Tue Sep 17 14:38:46 2019
step: 1900/ 3305 loss:0.0532 imps:91.0 lr: 0.0463 etc:Tue Sep 17 14:38:36 2019
validating ... loss: 0.0548
Epoch 4/5
step: 2000/ 3305 loss:0.0531 imps:22.7 lr: 0.0433 etc:Tue Sep 17 14:39:16 2019
step: 2100/ 3305 loss:0.0481 imps:66.3 lr: 0.0403 etc:Tue Sep 17 14:39:05 2019
step: 2200/ 3305 loss:0.0457 imps:79.0 lr: 0.0373 etc:Tue Sep 17 14:38:55 2019
step: 2300/ 3305 loss:0.0467 imps:85.0 lr: 0.0343 etc:Tue Sep 17 14:38:46 2019
step: 2400/ 3305 loss:0.0503 imps:88.4 lr: 0.0312 etc:Tue Sep 17 14:38:38 2019
step: 2500/ 3305 loss:0.0480 imps:90.7 lr: 0.0281 etc:Tue Sep 17 14:38:30 2019
step: 2600/ 3305 loss:0.0448 imps:92.4 lr: 0.0249 etc:Tue Sep 17 14:38:23 2019
validating ... loss: 0.0515
Epoch 5/5
step: 2700/ 3305 loss:0.0442 imps:48.7 lr: 0.0217 etc:Tue Sep 17 14:38:53 2019
step: 2800/ 3305 loss:0.0375 imps:72.9 lr: 0.0184 etc:Tue Sep 17 14:38:46 2019
step: 2900/ 3305 loss:0.0418 imps:81.9 lr: 0.0151 etc:Tue Sep 17 14:38:39 2019
step: 3000/ 3305 loss:0.0417 imps:86.7 lr: 0.0117 etc:Tue Sep 17 14:38:33 2019
step: 3100/ 3305 loss:0.0386 imps:89.6 lr: 0.0082 etc:Tue Sep 17 14:38:27 2019
step: 3200/ 3305 loss:0.0384 imps:91.5 lr: 0.0045 etc:Tue Sep 17 14:38:21 2019
step: 3300/ 3305 loss:0.0362 imps:93.0 lr: 0.0003 etc:Tue Sep 17 14:38:16 2019
validating ... loss: 0.0493
step.make_cam: Tue Sep 17 14:38:36 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_cam: Tue Sep 17 14:43:39 2019
{'iou': array([0.79388312, 0.43113067, 0.28864309, 0.444585 , 0.36172684,
0.46973761, 0.61380454, 0.54396673, 0.48715231, 0.28845281,
0.57491489, 0.40641602, 0.458491 , 0.49758201, 0.61701023,
0.52529238, 0.42383762, 0.61912264, 0.44892162, 0.49405836,
0.46604461]), 'miou': 0.4883225761592359}
step.cam_to_ir_label: Tue Sep 17 14:44:05 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 ]
step.train_irn: Tue Sep 17 14:59:55 2019
Epoch 1/5
step: 0/ 3305 loss:1.2434 0.2856 4.0134 0.1892 imps:0.7 lr: 0.1000 etc:Wed Sep 18 13:19:38 2019
step: 100/ 3305 loss:0.4934 0.4499 3.8754 0.1043 imps:17.6 lr: 0.0973 etc:Tue Sep 17 15:49:54 2019
step: 200/ 3305 loss:0.4166 0.3639 3.6527 0.1442 imps:22.1 lr: 0.0945 etc:Tue Sep 17 15:39:42 2019
step: 300/ 3305 loss:0.4041 0.3523 3.5186 0.1616 imps:24.2 lr: 0.0918 etc:Tue Sep 17 15:36:20 2019
step: 400/ 3305 loss:0.3994 0.3596 3.3185 0.2086 imps:25.4 lr: 0.0890 etc:Tue Sep 17 15:34:38 2019
step: 500/ 3305 loss:0.3839 0.3394 3.1721 0.2243 imps:26.2 lr: 0.0863 etc:Tue Sep 17 15:33:31 2019
step: 600/ 3305 loss:0.3940 0.3481 3.1202 0.2139 imps:26.8 lr: 0.0835 etc:Tue Sep 17 15:32:51 2019
Epoch 2/5
step: 700/ 3305 loss:0.3802 0.3347 3.0618 0.2176 imps:15.4 lr: 0.0807 etc:Tue Sep 17 15:33:59 2019
step: 800/ 3305 loss:0.3771 0.3315 3.0399 0.2230 imps:23.6 lr: 0.0779 etc:Tue Sep 17 15:33:23 2019
step: 900/ 3305 loss:0.3779 0.3291 3.0268 0.2232 imps:25.9 lr: 0.0751 etc:Tue Sep 17 15:32:57 2019
step: 1000/ 3305 loss:0.3765 0.3284 3.0080 0.2230 imps:27.0 lr: 0.0723 etc:Tue Sep 17 15:32:35 2019
step: 1100/ 3305 loss:0.3739 0.3317 2.9594 0.2122 imps:27.7 lr: 0.0695 etc:Tue Sep 17 15:32:15 2019
step: 1200/ 3305 loss:0.3791 0.3385 2.9742 0.2159 imps:28.1 lr: 0.0666 etc:Tue Sep 17 15:31:59 2019
step: 1300/ 3305 loss:0.3731 0.3293 2.9441 0.2143 imps:28.5 lr: 0.0638 etc:Tue Sep 17 15:31:45 2019
Epoch 3/5
step: 1400/ 3305 loss:0.3746 0.3318 2.9068 0.2132 imps:20.2 lr: 0.0609 etc:Tue Sep 17 15:32:24 2019
step: 1500/ 3305 loss:0.3722 0.3240 2.9324 0.2056 imps:24.6 lr: 0.0580 etc:Tue Sep 17 15:32:13 2019
step: 1600/ 3305 loss:0.3630 0.3207 2.9220 0.2053 imps:26.2 lr: 0.0551 etc:Tue Sep 17 15:32:03 2019
step: 1700/ 3305 loss:0.3734 0.3279 2.8887 0.2145 imps:27.2 lr: 0.0522 etc:Tue Sep 17 15:31:53 2019
step: 1800/ 3305 loss:0.3639 0.3170 2.8827 0.2084 imps:27.8 lr: 0.0493 etc:Tue Sep 17 15:31:42 2019
step: 1900/ 3305 loss:0.3662 0.3194 2.8690 0.2112 imps:28.2 lr: 0.0463 etc:Tue Sep 17 15:31:33 2019
Epoch 4/5
step: 2000/ 3305 loss:0.3596 0.3165 2.8963 0.2054 imps:9.7 lr: 0.0433 etc:Tue Sep 17 15:32:00 2019
step: 2100/ 3305 loss:0.3636 0.3154 2.8559 0.2216 imps:22.8 lr: 0.0403 etc:Tue Sep 17 15:31:52 2019
step: 2200/ 3305 loss:0.3569 0.3119 2.8304 0.2085 imps:25.5 lr: 0.0373 etc:Tue Sep 17 15:31:47 2019
step: 2300/ 3305 loss:0.3651 0.3224 2.8433 0.2046 imps:26.6 lr: 0.0343 etc:Tue Sep 17 15:31:41 2019
step: 2400/ 3305 loss:0.3563 0.3121 2.8420 0.2105 imps:27.3 lr: 0.0312 etc:Tue Sep 17 15:31:35 2019
step: 2500/ 3305 loss:0.3537 0.3078 2.8178 0.2024 imps:27.8 lr: 0.0281 etc:Tue Sep 17 15:31:30 2019
step: 2600/ 3305 loss:0.3619 0.3137 2.8092 0.2042 imps:28.1 lr: 0.0249 etc:Tue Sep 17 15:31:25 2019
Epoch 5/5
step: 2700/ 3305 loss:0.3569 0.3068 2.8103 0.1992 imps:18.2 lr: 0.0217 etc:Tue Sep 17 15:31:45 2019
step: 2800/ 3305 loss:0.3515 0.3021 2.7876 0.2031 imps:24.1 lr: 0.0184 etc:Tue Sep 17 15:31:41 2019
step: 2900/ 3305 loss:0.3555 0.3133 2.7876 0.1998 imps:26.1 lr: 0.0151 etc:Tue Sep 17 15:31:36 2019
step: 3000/ 3305 loss:0.3538 0.3070 2.7692 0.1936 imps:27.1 lr: 0.0117 etc:Tue Sep 17 15:31:31 2019
step: 3100/ 3305 loss:0.3605 0.3197 2.7654 0.1988 imps:27.6 lr: 0.0082 etc:Tue Sep 17 15:31:28 2019
step: 3200/ 3305 loss:0.3464 0.3009 2.7539 0.1878 imps:28.1 lr: 0.0045 etc:Tue Sep 17 15:31:23 2019
step: 3300/ 3305 loss:0.3555 0.3044 2.7294 0.1937 imps:28.4 lr: 0.0003 etc:Tue Sep 17 15:31:19 2019
Analyzing displacements mean ... done.
step.make_ins_seg_labels: Tue Sep 17 15:32:01 2019
[ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_ins_seg: Tue Sep 17 15:43:17 2019
0.5iou: {'ap': array([0.36661056, 0.00547154, 0.574922 , 0.30045993, 0.15886656,
0.5995807 , 0.37568754, 0.67267337, 0.05372927, 0.51445742,
0.17437114, 0.56820096, 0.59092809, 0.47832304, 0.2252956 ,
0.07792715, 0.35047058, 0.36736559, 0.4721443 , 0.57766263]), 'map': 0.37525739839692324}
step.make_sem_seg_labels: Tue Sep 17 15:43:59 2019
[0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ]
step.eval_sem_seg: Tue Sep 17 15:53:27 2019
0.07191166029597273 0.046077651513210416
0.14463595416636954 0.20396592048661016
{'iou': array([0.88201069, 0.66970496, 0.35053706, 0.77762538, 0.60824307,
0.61367099, 0.80644066, 0.71936881, 0.75847351, 0.35081316,
0.79498655, 0.42315924, 0.7327815 , 0.77665314, 0.76811858,
0.68669326, 0.53031597, 0.81550075, 0.58572881, 0.65244931,
0.60669782]), 'miou': 0.6623796759586297}
completed train process
from irn.
MeanShift layer is somewhat dependent on the batch sizes per gpu, and might cause the problem.
https://github.com/jiwoon-ahn/irn/blob/master/net/resnet50_irn.py#L99
from irn.
I figured out the issue. When I use a single GPU to train IRN then it works fine, when training on multiple GPUs the results are bad. I am closing this issue. Also, it will be nice if someone can figure out what's wrong with adding torch.nn.DataParallel
in IRN [here].
from irn.
@adityaarun1 Have you solved this problem? I utilized 4 GPUs to run train_irn.py in parallel and encountered the same problem with you. However, my Titan with 12G could only train irn with batch_size = 16, which achieved mAP = 35.7, I wonder how you solved this problem, thanks!
from irn.
@zhaohui-yang No, I haven't been able to solve this. Adding torch.nn.DataParallel
works fine while training, but I am unable to replicate results using that.
I also tried running on a larger GPU (V100) using the default hyperparameters. Across various runs, the best accuracy that I have been able to achieve is 36.7 mAP. There still seems to a ~1% gap.
from irn.
Yes, I observed that the loss decreasing as usual, and I'm not sure what is the reason. I used multi-GPUs for parallel training and single-GPU for evaluation, the problem exists. I think several reasons may affect :
-
The scatter and gather operation. For parallel training, data and target are automatically split into n_gpus splits and separately calculated. Gonna check the data shape and the data itself.
-
Incorrect data-target pair. Though converge, but may lead to the wrong target. (Personally, I don't think this is the reason).
-
Incorrect forward mode. The forward function contains a resnet50.forward() with eval() mode and edge_model, dp_model with train() mode. If you are familiar with the classification task, if the training mode is quite correct and eval mode is strange, the problem is mainly because of the BN parameters. Gonna split resnet50_irn into three networks: resnet50+edge_model+dp_model.
Thank you for your advice!
from irn.
I have checked for point 2 and it is fine. Point 3 can be an issue, but it seems to work fine on a single GPU, so I am not sure what goes wrong if you train on multiple GPUs but test on one.
from irn.
I observed the loss curve. Though single-GPU and multi-GPUs would converge. However, loss of single-GPU converges till ~0.37. The loss of multi-GPUs converges till ~0.44. I think something's incorrect with the training stage. Maybe the point3 or the optimizer. Not sure.
from irn.
Yes, I have observed the same. But that still does not explain the big difference in the result in my opinion.
from irn.
@jiwoon-ahn Thanks. This helps. 😃
from irn.
@jiwoon-ahn Thanks! Everythink's fine!
from irn.
Related Issues (20)
- Would you share the weights of IRNet for generating the pseudo label?
- Inter-pixel relation mining. Point neighborhood.
- About 'x = x[0] + x[1].flip(-1)' in resnet50_cam.py. HOT 3
- using own dataset HOT 2
- How to apply CRF postprocessing at final stage, after making sem_seg_labels?
- cam_to_ir_label HOT 2
- Help For the CAMs
- For comparison with AffinityNet implementation details in your paper
- about the search indices HOT 1
- about the function of “Instance Map”
- On the number of convolutional filters in IRNet
- Time cost of generating one pseudo instance mask
- About the visualzation of edge map HOT 1
- About every time the results are unstable
- sqrt in CAM
- Training is so slow after first epoch
- run_samples.py ValueError HOT 1
- How to adjust the value of 'conf_fg_thres' 、’conf_bg_thres‘ 、’beta‘ and 'exp_times' HOT 1
- int32 error
- CAM_to_irlabel and train_irn
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from irn.