catalyst-team / catalyst Goto Github PK
View Code? Open in Web Editor NEWAccelerated deep learning R&D
Home Page: https://catalyst-team.com
License: Apache License 2.0
Accelerated deep learning R&D
Home Page: https://catalyst-team.com
License: Apache License 2.0
Hi guys, thank you for great project.
I tried to run notebook-example over my Anaconda environment and faced with "float division by zero exception"
This is a code which causes exception: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 (basically notebook-example copy-pasted inside one .py file)
This is stacktrace:
C:\Users\Daiver\Anaconda3\python.exe C:/Users/Daiver/PycharmProjects/untitled/main.py
Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
0 * Epoch (train): 0% 1/1563 [00:00<15:06, 1.72it/s, base/batch_time=0.01562, base/data_time=0.01562, base/sample_per_second=2048.43759, loss=2.32288, lr=0.00100, momentum=0.90000, precision01=3.12500, precision03=18.75000, precision05=56.25000]Traceback (most recent call last):
File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 113, in <module>
main()
File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 109, in main
epochs=n_epochs, verbose=True)
File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 210, in train
verbose=verbose
File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 159, in run
self.run_event(callbacks=callbacks, event="on_batch_end")
File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 92, in run_event
getattr(self.state, f"{event}_pre")(state=self.state)
File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\state.py", line 203, in on_batch_end_pre
state.batch_size / elapsed_time
ZeroDivisionError: float division by zero
Process finished with exit code 1
It can be fixed by adding zero check on elapsed_time
but i have no idea, why elapsed_time
is zero
My python/catalyst versions
Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6
Catalyst was installed by clonning current repo (master branch, last commit 892d5e5 "Merge pull request #56 from dbrainio/master")
add Pytorch.SWA to catalyst optimizers
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "catalyst/rl/offpolicy/scripts/run_samplers.py", line 162, in run_sampler
sampler.run()
File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 317, in run
self.buffer.push_transition(transition)
File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 107, in push_transition
self.observations[self.pointer + 1] = s_tp1
IndexError: index 5000 is out of bounds for axis 0 with size 5000
From the code it seems like there is nothing preventing from buffer overflow. I cant try to fix it if someone will approve that it's really that easy as it seems or tell me if I am missing something. I would just increase self.observations
size by one.
UPD: Oh gosh it keeps iterating further, seems like buffer size is not the limit
I have been using IOUcallback with default parameter to train the Unet model:
runner.train(
model=model,
main_metric = 'iou',
minimize_metric = False,
criterion=criterion,
optimizer=optimizer,
loaders=loaders,
logdir=logdir,
scheduler = scheduler,
callbacks=[
IouCallback(),
],
num_epochs=num_epochs,
verbose=True
)
As result metric log looks this way:
[2019-05-26 13:25:18,540]
0/10 * Epoch 0 (train): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=14.2914 | _timers/batch_time=0.4415 | _timers/data_time=0.3346 | _timers/model_time=0.1068 | iou=0.2843 | loss=1.2827
0/10 * Epoch 0 (valid): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=49.2591 | _timers/batch_time=0.1080 | _timers/data_time=0.0271 | _timers/model_time=0.0809 | iou=0.6478 | loss=0.5063
[2019-05-26 13:25:50,207]
1/10 * Epoch 1 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3666 | _timers/batch_time=0.3442 | _timers/data_time=0.3376 | _timers/model_time=0.0065 | iou=0.5010 | loss=0.4697
1/10 * Epoch 1 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=46.6509 | _timers/batch_time=0.0335 | _timers/data_time=0.0289 | _timers/model_time=0.0045 | iou=0.9374 | loss=-1.6844
[2019-05-26 13:26:21,913]
2/10 * Epoch 2 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.1218 | _timers/batch_time=0.3469 | _timers/data_time=0.3409 | _timers/model_time=0.0059 | iou=0.5980 | loss=0.0623
2/10 * Epoch 2 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.7021 | _timers/batch_time=0.0343 | _timers/data_time=0.0293 | _timers/model_time=0.0050 | iou=1.0979 | loss=-2.1567
[2019-05-26 13:26:52,914]
3/10 * Epoch 3 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3602 | _timers/batch_time=0.3504 | _timers/data_time=0.3443 | _timers/model_time=0.0061 | iou=0.5408 | loss=0.0550
3/10 * Epoch 3 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=48.0609 | _timers/batch_time=0.0321 | _timers/data_time=0.0276 | _timers/model_time=0.0044 | iou=0.9644 | loss=-1.6202
[2019-05-26 13:27:24,687]
4/10 * Epoch 4 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.0270 | _timers/batch_time=0.3544 | _timers/data_time=0.3484 | _timers/model_time=0.0059 | iou=0.7157 | loss=-0.8354
4/10 * Epoch 4 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.5536 | _timers/batch_time=0.0348 | _timers/data_time=0.0301 | _timers/model_time=0.0046 | iou=0.9381 | loss=-1.9926
[2019-05-26 13:27:57,148]
5/10 * Epoch 5 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=11.9987 | _timers/batch_time=0.3576 | _timers/data_time=0.3516 | _timers/model_time=0.0059 | iou=0.6653 | loss=-0.7207
5/10 * Epoch 5 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=47.2183 | _timers/batch_time=0.0330 | _timers/data_time=0.0285 | _timers/model_time=0.0044 | iou=1.0183 | loss=-2.6983
[2019-05-26 13:28:28,836]
6/10 * Epoch 6 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.2940 | _timers/batch_time=0.3451 | _timers/data_time=0.3389 | _timers/model_time=0.0061 | iou=0.7951 | loss=-1.6150
6/10 * Epoch 6 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.1915 | _timers/batch_time=0.0339 | _timers/data_time=0.0290 | _timers/model_time=0.0048 | iou=1.3360 | loss=-5.5529
[2019-05-26 13:28:59,628]
7/10 * Epoch 7 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3576 | _timers/batch_time=0.3424 | _timers/data_time=0.3360 | _timers/model_time=0.0063 | iou=0.7155 | loss=-1.0354
7/10 * Epoch 7 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.9911 | _timers/batch_time=0.0338 | _timers/data_time=0.0288 | _timers/model_time=0.0049 | iou=1.0820 | loss=-1.8514
[2019-05-26 13:29:30,871]
8/10 * Epoch 8 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5222 | _timers/batch_time=0.3394 | _timers/data_time=0.3337 | _timers/model_time=0.0057 | iou=0.6650 | loss=-0.8128
8/10 * Epoch 8 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=49.4219 | _timers/batch_time=0.0318 | _timers/data_time=0.0272 | _timers/model_time=0.0046 | iou=1.2081 | loss=-6.1373
[2019-05-26 13:30:01,903]
9/10 * Epoch 9 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5503 | _timers/batch_time=0.3436 | _timers/data_time=0.3380 | _timers/model_time=0.0056 | iou=0.7887 | loss=-1.5298
9/10 * Epoch 9 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.4443 | _timers/batch_time=0.0339 | _timers/data_time=0.0291 | _timers/model_time=0.0047 | iou=1.2653 | loss=-3.6054
As you can see some of them higher than 1, is it sum of IOU or I'm doing something wrong?
Hi guys!
I tried to run notebook example again (code: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 but i set verbose to False) on ubuntu machine
Script gives me following output
Python version 3.6.3 (default, Oct 3 2017, 21:45:48)
[GCC 7.2.0]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
[2019-01-17 11:39:03,741] 0 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00520 | base/sample_per_second: 6180.13137 | precision01: 41.76663 | precision03: 75.73776 | precision05: 89.51935 | lr: 0.00100 | momentum: 0.90000 | loss: 1.58647
[2019-01-17 11:39:03,741] 0 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00503 | base/sample_per_second: 6365.33178 | precision01: 50.07987 | precision03: 81.99880 | precision05: 93.29073 | lr: 0.00100 | momentum: 0.90000 | loss: 1.37917
[2019-01-17 11:39:03,741]
[2019-01-17 11:39:16,967] 1 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00519 | base/sample_per_second: 6168.41000 | precision01: 53.47689 | precision03: 84.46497 | precision05: 94.58573 | lr: 0.00100 | momentum: 0.90000 | loss: 1.28700
[2019-01-17 11:39:16,967] 1 * Epoch (valid) metrics: base/data_time: 0.00473 | base/batch_time: 0.00513 | base/sample_per_second: 6246.95064 | precision01: 55.16174 | precision03: 85.15375 | precision05: 94.75839 | lr: 0.00100 | momentum: 0.90000 | loss: 1.24638
[2019-01-17 11:39:16,967]
[2019-01-17 11:39:30,171] 2 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6184.96553 | precision01: 58.61124 | precision03: 86.94818 | precision05: 95.64539 | lr: 0.00100 | momentum: 0.90000 | loss: 1.15438
[2019-01-17 11:39:30,171] 2 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6210.20164 | precision01: 58.80591 | precision03: 87.11062 | precision05: 95.48722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.14073
[2019-01-17 11:39:30,171]
Epoch 3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171
[2019-01-17 11:39:43,270] 3 * Epoch (valid) metrics: base/data_time: 0.00464 | base/batch_time: 0.00503 | base/sample_per_second: 6367.18977 | precision01: 61.06230 | precision03: 87.85942 | precision05: 95.62700 | lr: 0.00100 | momentum: 0.90000 | loss: 1.10485
[2019-01-17 11:39:43,270]
[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941
[2019-01-17 11:39:56,491] 4 * Epoch (valid) metrics: base/data_time: 0.00474 | base/batch_time: 0.00513 | base/sample_per_second: 6240.70324 | precision01: 63.70807 | precision03: 89.10743 | precision05: 96.40575 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03544
[2019-01-17 11:39:56,491]
[2019-01-17 11:40:09,791] 5 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00523 | base/sample_per_second: 6126.94727 | precision01: 67.57038 | precision03: 91.11484 | precision05: 97.31086 | lr: 0.00050 | momentum: 0.90000 | loss: 0.91745
[2019-01-17 11:40:09,792] 5 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6216.97721 | precision01: 63.73802 | precision03: 89.18730 | precision05: 96.29593 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03911
[2019-01-17 11:40:09,792]
Epoch 6: reducing learning rate of group 0 to 2.5000e-04.
[2019-01-17 11:40:23,136] 6 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00522 | base/sample_per_second: 6135.11402 | precision01: 68.87796 | precision03: 91.71065 | precision05: 97.54479 | lr: 0.00050 | momentum: 0.90000 | loss: 0.88481
[2019-01-17 11:40:23,136] 6 * Epoch (valid) metrics: base/data_time: 0.00483 | base/batch_time: 0.00523 | base/sample_per_second: 6131.59957 | precision01: 63.83786 | precision03: 89.16733 | precision05: 96.18610 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03159
[2019-01-17 11:40:23,136]
[2019-01-17 11:40:36,215] 7 * Epoch (train) metrics: base/data_time: 0.00471 | base/batch_time: 0.00512 | base/sample_per_second: 6255.08350 | precision01: 70.90931 | precision03: 92.46041 | precision05: 97.84269 | lr: 0.00025 | momentum: 0.90000 | loss: 0.82662
[2019-01-17 11:40:36,215] 7 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00504 | base/sample_per_second: 6357.74667 | precision01: 64.66653 | precision03: 89.55671 | precision05: 96.43570 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01120
[2019-01-17 11:40:36,216]
[2019-01-17 11:40:49,560] 8 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6183.07309 | precision01: 71.50512 | precision03: 92.77031 | precision05: 97.91667 | lr: 0.00025 | momentum: 0.90000 | loss: 0.80761
[2019-01-17 11:40:49,560] 8 * Epoch (valid) metrics: base/data_time: 0.00497 | base/batch_time: 0.00540 | base/sample_per_second: 5983.79479 | precision01: 65.85463 | precision03: 89.98602 | precision05: 96.55551 | lr: 0.00025 | momentum: 0.90000 | loss: 0.99543
[2019-01-17 11:40:49,560]
Epoch 9: reducing learning rate of group 0 to 1.2500e-04.
[2019-01-17 11:41:03,294] 9 * Epoch (train) metrics: base/data_time: 0.00495 | base/batch_time: 0.00540 | base/sample_per_second: 5959.02559 | precision01: 72.08093 | precision03: 93.00224 | precision05: 98.04063 | lr: 0.00025 | momentum: 0.90000 | loss: 0.79061
[2019-01-17 11:41:03,294] 9 * Epoch (valid) metrics: base/data_time: 0.00479 | base/batch_time: 0.00519 | base/sample_per_second: 6174.30215 | precision01: 64.95607 | precision03: 89.87620 | precision05: 96.50559 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01265
[2019-01-17 11:41:03,294]
Top best models:
./logs/cifar_simple_notebook/checkpoint.None.8.pth.tar 0.9954
./logs/cifar_simple_notebook/checkpoint.None.7.pth.tar 1.0112
./logs/cifar_simple_notebook/checkpoint.None.9.pth.tar 1.0127
./logs/cifar_simple_notebook/checkpoint.None.6.pth.tar 1.0316
./logs/cifar_simple_notebook/checkpoint.None.4.pth.tar 1.0354
Everything looks ok, except one thing
Epoch 3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171
lr should be decreased (at epoch 3), but epoch (3) summary shows me old lr.
Summary of next epoch is ok.
[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941
Same stuff for other epochs
Hey, great framework!
I can't find in docs and don't see from API how to easily change checkpointer behavior - which metric it should monitor for checkpoints, max/min etc.
And I'm not sure about checkpoints at all - where should I find saved models, what is the format, are they enabled by default, what is the logic for saving them right now. And can I customize all of that?
They are all free for Open Source projects like this one.
https://github.com/marketplace/category/continuous-integration
'async' is a reserved word in Python 3.7. Cuda has shifted to 'non_blocking' instead. pytorch/pytorch#4999
flake8 testing of https://github.com/Scitator/prometheus on Python 3.7.0
$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
./losses/unet.py:8:23: E999 SyntaxError: invalid syntax
return x.cuda(async=True) if torch.cuda.is_available() else x
^
1 E999 SyntaxError: invalid syntax
1
It looks like UtilsFactory is missing a method.
The error is in this line: https://github.com/Scitator/catalyst/blob/master/utils/factory.py#L50
Hi,
I did my experiment, I dont know why best checkpoints are not saved. I tried to pdb
and found a bug.
valid map05 = 0
for allvalid map05
starts increasing.Following are console logs and checkpoints folder screenshot
From epoch 0->10
:
....
8/100 * Epoch (train): _fps=758.1664 | base/batch_time=0.0454 | base/data_time=0.0043 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=28.4960 | map05=0.0000
8/100 * Epoch (valid): _fps=753.5475 | base/batch_time=0.0635 | base/data_time=0.0245 | base/lr=0.0001 | base/model_time=0.0390 | base/momentum=0.9000 | loss=30.4348 | map05=0.0000
9/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.16it/s, _fps=733.518, loss=26.232, map05=0.000]
9/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.60it/s, _fps=787.242, loss=28.358, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
9
(Pdb) c
[2019-03-21 10:20:30,871]
9/100 * Epoch (train): _fps=765.4649 | base/batch_time=0.0453 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0408 | base/momentum=0.9000 | loss=27.7296 | map05=0.0000
9/100 * Epoch (valid): _fps=752.4676 | base/batch_time=0.0630 | base/data_time=0.0239 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.9424 | map05=0.0000
10/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=779.565, loss=24.555, map05=0.000]
10/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.95it/s, _fps=822.962, loss=27.833, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:20:46,505]
10/100 * Epoch (train): _fps=758.8756 | base/batch_time=0.0458 | base/data_time=0.0047 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=26.9246 | map05=0.0004
10/100 * Epoch (valid): _fps=754.0560 | base/batch_time=0.0616 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.3501 | map05=0.0000
11/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.27it/s, _fps=785.593, loss=26.380, map05=0.000]
11/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 14.73it/s, _fps=774.013, loss=27.053, map05=0.000]
...
From epoch 11:
11/100 * Epoch (train): _fps=761.4851 | base/batch_time=0.0455 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=26.1523 | map05=0.0014
11/100 * Epoch (valid): _fps=751.8519 | base/batch_time=0.0668 | base/data_time=0.0277 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=28.8740 | map05=0.0000
12/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.17it/s, _fps=765.323, loss=26.579, map05=0.000]
12/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.92it/s, _fps=824.337, loss=26.422, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
0.0011160714285714285
(Pdb) c
[2019-03-21 10:21:46,089]
12/100 * Epoch (train): _fps=758.7695 | base/batch_time=0.0456 | base/data_time=0.0046 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=25.5722 | map05=0.0015
12/100 * Epoch (valid): _fps=762.2009 | base/batch_time=0.0616 | base/data_time=0.0229 | base/lr=0.0001 | base/model_time=0.0387 | base/momentum=0.9000 | loss=28.3761 | map05=0.0011
13/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.99it/s, _fps=783.159, loss=23.634, map05=0.000]
13/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.25it/s, _fps=807.500, loss=26.013, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:21:55,990]
13/100 * Epoch (train): _fps=760.6421 | base/batch_time=0.0461 | base/data_time=0.0051 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.9090 | map05=0.0021
13/100 * Epoch (valid): _fps=758.2683 | base/batch_time=0.0604 | base/data_time=0.0215 | base/lr=0.0001 | base/model_time=0.0389 | base/momentum=0.9000 | loss=27.9700 | map05=0.0011
14/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.07it/s, _fps=770.857, loss=26.029, map05=0.000]
14/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.66it/s, _fps=811.955, loss=25.584, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:08,043]
14/100 * Epoch (train): _fps=762.4881 | base/batch_time=0.0452 | base/data_time=0.0042 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.1805 | map05=0.0037
14/100 * Epoch (valid): _fps=752.5557 | base/batch_time=0.0627 | base/data_time=0.0236 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=27.5076 | map05=0.0022
15/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=760.488, loss=22.381, map05=0.000]
15/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.09it/s, _fps=787.039, loss=25.456, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:20,274]
15/100 * Epoch (train): _fps=754.6090 | base/batch_time=0.0455 | base/data_time=0.0044 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=23.6162 | map05=0.0060
15/100 * Epoch (valid): _fps=739.9028 | base/batch_time=0.0610 | base/data_time=0.0212 | base/lr=0.0001 | base/model_time=0.0398 | base/momentum=0.9000 | loss=27.1939 | map05=0.0033
16/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.69it/s, _fps=778.896, loss=24.231, map05=0.000]
16/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.89it/s, _fps=807.296, loss=24.941, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:45,175]
16/100 * Epoch (train): _fps=756.6095 | base/batch_time=0.0466 | base/data_time=0.0054 | base/lr=0.0001 | base/model_time=0.0412 | base/momentum=0.9000 | loss=23.0421 | map05=0.0070
16/100 * Epoch (valid): _fps=749.7147 | base/batch_time=0.0618 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0393 | base/momentum=0.9000 | loss=26.7421 | map05=0.0045
17/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.92it/s, _fps=771.987, loss=23.759, map05=0.000]
17/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.72it/s, _fps=805.377, loss=24.638, map05=0.062]
Checkpoints folder:
Imgur
I found the problem is here
epoch 0->11
:checkpoint_metric = 0
checkpoint_metric = checkpoint_metric or epoch
=> checkpoint_metric = epoch
checkpoint_metric now is a number greater than 1.
epoch > 11
:checkpoint_metric = 0.xxxx
checkpoint_metric = checkpoint_metric or epoch
=> checkpoint_metric = 0.xxxx
checkpoint_metric now is a float number less than 1
=> The next best checkpoints will be removed after sorting.
Following is the log of self.top_best_metrics
[('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.11.pth', 11), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.10.pth', 10), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.9.pth', 9), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.8.pth', 8), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.7.pth', 7), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.18.pth', 0.006324404761904762)]
=> The last checkpoint is removed after
last_item = self.top_best_metrics.pop(-1)
last_filepath = last_item[0]
os.remove(last_filepath)
In case of default value for LRFinder
parameter n_steps=None
training terminates with
File "/home/ivb/Repos/contrib/catalyst/catalyst/dl/callbacks/schedulers.py", line 188, in on_loader_start
self.n_steps = self.n_steps or len(state.loader)
AttributeError: 'RunnerState' object has no attribute 'loader'
To reproduce just replace sheduler section in examples/cifar_simple/config.yml
with
scheduler:
callback: LRFinder
final_lr: 10
Also doc for LRFinder
contains unused parameter init_lr
.
Seems like need numpy update to >=1.16.0
and in requirements https://github.com/catalyst-team/catalyst/blob/master/requirements.txt#L2 i see 1.14.6
Problem was on 1.15.3 version, and update resolve problem (check PR)
I just want to inform that I am countering with Memory leak issue.
I am not 100% sure that It comes from catalyst. But I am sure that when I use my own code written by myself (without catalyst), there is no leaking issue. I tried to pre-procedure this issue from your examples, but it runs very fast and we could not see the leak.
Following is my leak descriptions:
Image classification task with ~17M images
The memory leak occurs every epoch and it happens at the next epoch again.
EX:
At the beginning of epoch 0, my memory is:
At the end of epoch 0, my memory is:
When epoch 0 ends, the memories (RAM + Swap) are released, (RAM: 13G/16G, SwapL: 0/50G). The same phenomena occurs at the next epoch 1.
I am debugging to see where the issue comes from. Could you please double check with your bigger data?
Proposal by @BloodAxe, use task-specific callbacks, like:
With internal metrics definition like:
SemanticSegmentationMetricsCallback(
need_confusion_matrix=True,
need_mAP=True,
need_IoU=False)
The metrics in the function precision is not actually precision, rather an accuracy. Same for average_precision and mean_average_precision. More detailed here
Suggestion:
accuracy
, average_accuracy
, and mean_average_accuracy
;precision
and recall
metrics.As typical use case for LRFinder
just set some large value for final_lr
, say 10, it would be convenient to stop iterating in case of divergence. And probably add default value for final_lr
.
If this sounds good, I'll contribute.
Global pooling layers are already easy to implement using adaptive pooling in pytorch: https://pytorch.org/docs/stable/nn.html#pooling-layers, so they can be removed from e.g.
import json; data=json.load(open(f'{logdir}/metrics.json')); print(data[-1])
)Use Ax with Catalyst.Experiments for automatic hyperparameters search.
For better experiment reproducibility:
pip/conda freeze
in the beginning of catalyst-dl run
sometimes package versions matters.
Does it support multiple learning rates and tta?
__init__.py
and model.py
for exampleHi.
I've got some weird unhanded exception during simple train run:
File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 127, in <module>
# metrics=["loss", "precision01", "precision03", "base/lr"])
File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 117, in main
],
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 271, in train
self.run_experiment(experiment, check=check)
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 194, in run_experiment
self._run_stage(stage)
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 175, in _run_stage
self._run_epoch(loaders)
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 160, in _run_epoch
self._run_loader(loaders[loader_name])
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 132, in _run_loader
self._run_event("batch_end")
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 97, in _run_event
getattr(self.state, f"on_{event}_post")()
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 153, in on_batch_end_post
self._handle_runner_metrics()
File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 114, in _handle_runner_metrics
self.batch_size / self.timer.elapsed["base/batch_time"]
ZeroDivisionError: float division by zero
Before it happened, training logs looked totally fine:
C:\Anaconda3\envs\kaggle\python.exe C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py
0/100 * Epoch (train): 100% 87/87 [01:37<00:00, 2.28it/s, _fps=16383.000, jaccard=0.000, loss=0.068]
0/100 * Epoch (valid): 100% 10/10 [00:39<00:00, 4.53s/it, _fps=240.926, jaccard=0.000, loss=0.085]
[2019-04-03 14:35:00,371]
0/100 * Epoch (train): _fps=13657.5198 | base/batch_time=0.7826 | base/data_time=0.7441 | base/lr=0.0010 | base/model_time=0.0385 | base/momentum=0.9000 | jaccard=0.0074 | loss=0.0802
0/100 * Epoch (valid): _fps=8236.0160 | base/batch_time=3.8731 | base/data_time=3.7668 | base/lr=0.0010 | base/model_time=0.1063 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0837
1/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.72it/s, _fps=16395.258, jaccard=0.001, loss=0.061]
1/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 3.78s/it, _fps=16383.000, jaccard=0.004, loss=0.077]
[2019-04-03 14:37:15,192]
1/100 * Epoch (train): _fps=13985.1989 | base/batch_time=0.8083 | base/data_time=0.8061 | base/lr=0.0010 | base/model_time=0.0022 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0642
1/100 * Epoch (valid): _fps=6082.4207 | base/batch_time=3.6415 | base/data_time=3.6382 | base/lr=0.0010 | base/model_time=0.0032 | base/momentum=0.9000 | jaccard=0.0033 | loss=0.0761
2/100 * Epoch (train): 100% 87/87 [01:34<00:00, 2.51it/s, _fps=16382.500, jaccard=0.186, loss=0.049]
2/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 6.22s/it, _fps=16382.000, jaccard=0.188, loss=0.064]
[2019-04-03 14:39:29,013]
2/100 * Epoch (train): _fps=13115.4018 | base/batch_time=0.7891 | base/data_time=0.7886 | base/lr=0.0010 | base/model_time=0.0005 | base/momentum=0.9000 | jaccard=0.0762 | loss=0.0538
2/100 * Epoch (valid): _fps=12309.1793 | base/batch_time=3.6892 | base/data_time=3.6892 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.1834 | loss=0.0630
3/100 * Epoch (train): 100% 87/87 [01:35<00:00, 2.46it/s, _fps=16387.001, jaccard=0.267, loss=0.043]
3/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 3.76s/it, _fps=16382.750, jaccard=0.273, loss=0.056]
[2019-04-03 14:41:42,811]
3/100 * Epoch (train): _fps=13093.6248 | base/batch_time=0.7970 | base/data_time=0.7961 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.2246 | loss=0.0447
3/100 * Epoch (valid): _fps=13118.9414 | base/batch_time=3.6233 | base/data_time=3.6233 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.2659 | loss=0.0549
4/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.62it/s, _fps=16249.120, jaccard=0.354, loss=0.038]
4/100 * Epoch (valid): 100% 10/10 [00:39<00:00, 3.93s/it, _fps=16383.000, jaccard=0.356, loss=0.049]
[2019-04-03 14:43:58,902]
4/100 * Epoch (train): _fps=13008.1172 | base/batch_time=0.8044 | base/data_time=0.8031 | base/lr=0.0010 | base/model_time=0.0014 | base/momentum=0.9000 | jaccard=0.3206 | loss=0.0382
4/100 * Epoch (valid): _fps=13161.2595 | base/batch_time=3.7907 | base/data_time=3.7907 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.3512 | loss=0.0484
5/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.57it/s, _fps=16387.001, jaccard=0.437, loss=0.032]
5/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 3.78s/it, _fps=16384.000, jaccard=0.431, loss=0.044]
[2019-04-03 14:46:13,698]
5/100 * Epoch (train): _fps=14208.6617 | base/batch_time=0.8076 | base/data_time=0.8067 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.4012 | loss=0.0333
5/100 * Epoch (valid): _fps=11602.1947 | base/batch_time=3.6362 | base/data_time=3.6362 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.4264 | loss=0.0436
6/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.82it/s, _fps=16382.750, jaccard=0.477, loss=0.030]
6/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 4.33s/it, _fps=19348.791, jaccard=0.481, loss=0.041]
[2019-04-03 14:48:28,650]
6/100 * Epoch (train): _fps=14899.4017 | base/batch_time=0.8033 | base/data_time=0.8029 | base/lr=0.0010 | base/model_time=0.0004 | base/momentum=0.9000 | jaccard=0.4614 | loss=0.0299
6/100 * Epoch (valid): _fps=10905.9308 | base/batch_time=3.6788 | base/data_time=3.6784 | base/lr=0.0010 | base/model_time=0.0003 | base/momentum=0.9000 | jaccard=0.4788 | loss=0.0399
7/100 * Epoch (train): 100% 87/87 [01:37<00:00, 2.58it/s, _fps=16387.001, jaccard=0.509, loss=0.030]
7/100 * Epoch (valid): 100% 10/10 [00:38<00:00, 4.41s/it, _fps=16382.500, jaccard=0.506, loss=0.039]
[2019-04-03 14:50:45,999]
7/100 * Epoch (train): _fps=14034.5155 | base/batch_time=0.8213 | base/data_time=0.8204 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.5000 | loss=0.0278
7/100 * Epoch (valid): _fps=10034.7244 | base/batch_time=3.7788 | base/data_time=3.7787 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5043 | loss=0.0382
8/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.66it/s, _fps=16351.813, jaccard=0.533, loss=0.028]
8/100 * Epoch (valid): 100% 10/10 [00:40<00:00, 4.00s/it, _fps=16210.115, jaccard=0.527, loss=0.038]
[2019-04-03 14:53:03,136]
8/100 * Epoch (train): _fps=13962.9568 | base/batch_time=0.8072 | base/data_time=0.8066 | base/lr=0.0010 | base/model_time=0.0006 | base/momentum=0.9000 | jaccard=0.5228 | loss=0.0264
8/100 * Epoch (valid): _fps=11596.8021 | base/batch_time=3.8571 | base/data_time=3.8570 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5253 | loss=0.0369
9/100 * Epoch (train): 100% 87/87 [01:36<00:00, 2.50it/s, _fps=16382.750, jaccard=0.535, loss=0.025]
9/100 * Epoch (valid): 100% 10/10 [00:37<00:00, 3.08s/it, _fps=16382.750, jaccard=0.541, loss=0.037]
[2019-04-03 14:55:18,959]
9/100 * Epoch (train): _fps=13944.8145 | base/batch_time=0.8145 | base/data_time=0.8135 | base/lr=0.0010 | base/model_time=0.0008 | base/momentum=0.9000 | jaccard=0.5375 | loss=0.0256
9/100 * Epoch (valid): _fps=9995.8059 | base/batch_time=3.6704 | base/data_time=3.6704 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.5390 | loss=0.0359
10/100 * Epoch (train): 100% 87/87 [01:35<00:00, 2.55it/s, _fps=16382.750, jaccard=0.560, loss=0.024]
10/100 * Epoch (valid): 80% 8/10 [00:37<00:12, 6.05s/it, _fps=16383.000, jaccard=0.552, loss=0.036]
For reference, train script: https://github.com/BloodAxe/pytorch-toolbelt/blob/feature/example-canny-cnn/examples/canny-edge-detector-in-cnn/example_canny_cnn.py
Environment:
OS: Windows 10
Python: 3.6
Catalyst: 19.3
PyTorch: 1.0.1
Raises this exception after first epoch.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-31-eecc50de0067> in <module>
3 callbacks=callbacks,
4 logdir=logdir,
----> 5 epochs=n_epochs, verbose=True)
~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in train(self, loaders, callbacks, state_params, epochs, start_epoch, verbose, logdir)
214 mode="train",
215 verbose=verbose,
--> 216 logdir=logdir,
217 )
218
~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run(self, loaders, callbacks, state_params, epochs, start_epoch, mode, verbose, logdir)
180 self.run_event(callbacks=callbacks, event="on_loader_end")
181
--> 182 self.run_event(callbacks=callbacks, event="on_epoch_end")
183
184 self.run_event(callbacks=callbacks, event=f"on_{mode}_end")
~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run_event(self, callbacks, event)
93 :param event:
94 """
---> 95 getattr(self.state, f"{event}_pre")(state=self.state)
96 for callback in callbacks.values():
97 getattr(callback, event)(state=self.state)
~/miniconda3/lib/python3.6/site-packages/catalyst/dl/state.py in on_epoch_end_pre(state)
146 valid_loader=state.valid_loader,
147 main_metric=state.main_metric,
--> 148 minimize=state.minimize_metric)
149 valid_metrics = {
150 key: value
~/miniconda3/lib/python3.6/site-packages/catalyst/dl/callbacks/utils.py in process_epoch_metrics(epoch_metrics, best_metrics, valid_loader, main_metric, minimize)
27 if best_metrics is None \
28 else (minimize != (
---> 29 valid_metrics[main_metric] > best_metrics[main_metric]))
30 best_metrics = valid_metrics if is_best else best_metrics
31 return best_metrics, valid_metrics, is_best
KeyError: 'loss'
I'm using these callbacks:
callbacks["loss"] = LossCallback()
callbacks["optimizer"] = OptimizerCallback()
callbacks["precision"] = PrecisionCallback(
precision_args=[1])
callbacks["scheduler"] = SchedulerCallback(
reduce_metric="precision01")
callbacks["saver"] = CheckpointCallback()
callbacks["logger"] = Logger()
callbacks["tflogger"] = TensorboardLogger()
Data and processing are taken from here:: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
Now:
P.S. pipenv repo - https://github.com/pypa/pipenv
Currently RunnerState
owns all metric management
Move it to MetricState
It will be nice if I can choose not one input dir here https://github.com/catalyst-team/catalyst/blob/master/catalyst/contrib/scripts/tag2label.py
For example it can looks like:
catalyst-contrib tag2label --in-dir=dataset1,dataset2
idea:
init_lr
to max_lr
max_lr
to min_lr
min_lr
batch
or epoch
lr schdulingthe same thing with momentum
Passing frozen=True to catalyst.contrib.models.ResnetEncoder does not actually make encoder not trainable. Simple code to check that tensor value is changing during training:
import torch
from catalyst.contrib.models import ResnetEncoder
class Net(torch.nn.Module):
def __init__(self):
super().__init__()
self.enc = ResnetEncoder(
arch="resnet18",
pooling="GlobalAvgPool2d"
)
self.logits = torch.nn.Linear(
self.enc.out_features, 1)
def forward(self, x):
return self.logits(self.enc(x))
model = Net()
old_value = model.enc.feature_net[0].weight[0,0,0,0].item()
inputs = torch.randn(8, 3, 224, 224)
targets = torch.ones(8, 1)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.BCEWithLogitsLoss()
optimizer.zero_grad()
loss = criterion(model(inputs), targets)
loss.backward()
optimizer.step()
new_value = model.enc.feature_net[0].weight[0,0,0,0].item()
print(old_value, new_value)
It will print different values. It is so because nn.Module has no attribute 'requires_grad', only nn.Parameter has it. I fix it in my fork, but I don't see 'dev' branch to open PR.
foe example : A pointer-generation model with Actor-Critic model.
Add automated model tracing & packing.
Thank for your awesome library.
I am wondering if this library supports training with KFold Cross-validation?
Very few issues to address https://travis-ci.com/Scitator/prometheus/jobs/149428974#L465
Use pytorch.tensorboard instead of tensorboardX.
I have several png images with corresponding masks:
Getting data:
data_dir = './data_objects'
def load_data(root_dir):
data = []
for stage in ['train']:
for content in ['images', 'segmentation']:
# construct path to each image
directory = os.path.join(root_dir, content)
fps = [os.path.join(directory, filename) for filename in os.listdir(directory)]
# read images
images = [imread(filepath) for filepath in fps]
# if images have different sizes you have to resize them before:
resized_images = [resize(image, (64, 64)) for image in images]
# stack to one np.array
np_images = np.stack(resized_images, axis=0)
data.append(np_images)
return data
x_train, y_train = load_data(data_dir)
y_train = y_train.reshape(19, 64, 64, 1)
Making data looks like in the example notebook:
x_train, X_test, y_train, y_test = train_test_split(
x_train, y_train, test_size=0.33, random_state=42)
train_data = list(zip(x_train, y_train))
valid_data = list(zip(X_test, y_test))
train_data[0][0].shape, train_data[0][1].shape, len(train_data)
(64, 64, 3), (64, 64, 1), 12
Calling train
returns error:
0/10 * Epoch (train): 0% 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-69a57e657d9d> in <module>()
26 logdir=logdir,
27 num_epochs=num_epochs,
---> 28 verbose=True
29 )
~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in train(self, model, criterion, optimizer, loaders, logdir, callbacks, scheduler, num_epochs, valid_loader, main_metric, minimize_metric, verbose, state_kwargs, check)
269 state_kwargs=state_kwargs
270 )
--> 271 self.run_experiment(experiment, check=check)
272
273 def infer(
~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in run_experiment(self, experiment, check)
192 self.experiment = experiment
193 for stage in self.experiment.stages:
--> 194 self._run_stage(stage)
195 return self
196
~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_stage(self, stage)
173
174 self._run_event("epoch_start")
--> 175 self._run_epoch(loaders)
176 self._run_event("epoch_end")
177
~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_epoch(self, loaders)
158 self._run_event("loader_start")
159 with torch.set_grad_enabled(self.state.need_backward):
--> 160 self._run_loader(loaders[loader_name])
161 self._run_event("loader_end")
162
~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_loader(self, loader)
119 self.state.timer.start("base/data_time")
120
--> 121 for i, batch in enumerate(loader):
122 batch = self._batch2device(batch, self.device)
123 self.state.timer.stop("base/data_time")
~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
635 self.reorder_dict[idx] = batch
636 continue
--> 637 return self._process_next_batch(batch)
638
639 next = __next__ # Python 2 compatibility
~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
--> 658 raise batch.exc_type(batch.exc_msg)
659 return batch
660
TypeError: Traceback (most recent call last):
File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/dataset.py", line 74, in __getitem__
dict_ = self.dict_transform(dict_)
File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 42, in __call__
img = t(img)
File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/augmentor.py", line 25, in __call__
] = self.augment_fn(dict_[self.dict_key], **self.default_kwargs)
File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 118, in __call__
return F.normalize(tensor, self.mean, self.std)
File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/functional.py", line 158, in normalize
raise TypeError('tensor is not a torch image.')
TypeError: tensor is not a torch image.
Then I tried adding more transforms, but failed to fix that problem, for example something like this:
data_transform = transforms.Compose([
Augmentor(
dict_key="features",
augment_fn=lambda x: \
torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
Augmentor(
dict_key="features",
augment_fn=transforms.ToPILImage()), #transforms.ToTensor()),
Augmentor(
dict_key="features",
augment_fn=transforms.Normalize(
(0.5, 0.5, 0.5),
(0.5, 0.5, 0.5))),
Augmentor(
dict_key="targets",
augment_fn=lambda x: \
torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
Augmentor(
dict_key="targets",
augment_fn= transforms.ToPILImage())#transforms.ToTensor())
])
What is wrong here?
I got OOM error during validation step. Here is the log
0/10 * Epoch (train): 100% 32/32 [00:41<00:00, 2.39s/it, _fps=11.486, loss=0.898]
0/10 * Epoch (valid): 3% 1/32 [00:01<00:46, 1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory
My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training.
I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.
I also tried to add some callback to freeze GPU:
class FreeGPU(Callback):
def on_stage_start(self, state):
torch.cuda.empty_cache()
def on_loader_start(self, state):
torch.cuda.empty_cache()
def on_loader_end(self, state):
torch.cuda.empty_cache()
def on_stage_end(self, state):
torch.cuda.empty_cache()
def on_epoch_start(self, state):
torch.cuda.empty_cache()
def on_epoch_end(self, state):
torch.cuda.empty_cache()
It does not help at all.
Do you have any ideas?.
P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.