Giter Club home page Giter Club logo

catalyst's People

Contributors

alekseysh avatar alexgrinch avatar and-kul avatar andreysheka avatar arquestro avatar asmekal avatar asteyo avatar bagxi avatar belskikh avatar bloodaxe avatar crafterkolyan avatar ditwoo avatar dokholyan avatar elephantmipt avatar gazay avatar hexfaker avatar ivbelkin avatar julia-shenshina avatar lightforever avatar lx-ykachan avatar ngxbac avatar nimrais avatar pokidyshev avatar scitator avatar sergunya17 avatar smivv avatar tezromach avatar vvelicodnii-sc avatar y-ksenia avatar zkid18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

catalyst's Issues

"float division by zero" exception during notebook-example on anaconda/windows

Hi guys, thank you for great project.

I tried to run notebook-example over my Anaconda environment and faced with "float division by zero exception"
This is a code which causes exception: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 (basically notebook-example copy-pasted inside one .py file)

This is stacktrace:

C:\Users\Daiver\Anaconda3\python.exe C:/Users/Daiver/PycharmProjects/untitled/main.py
Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
0 * Epoch (train):   0% 1/1563 [00:00<15:06,  1.72it/s, base/batch_time=0.01562, base/data_time=0.01562, base/sample_per_second=2048.43759, loss=2.32288, lr=0.00100, momentum=0.90000, precision01=3.12500, precision03=18.75000, precision05=56.25000]Traceback (most recent call last):
  File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 113, in <module>
    main()
  File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 109, in main
    epochs=n_epochs, verbose=True)
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 210, in train
    verbose=verbose
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 159, in run
    self.run_event(callbacks=callbacks, event="on_batch_end")
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 92, in run_event
    getattr(self.state, f"{event}_pre")(state=self.state)
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\state.py", line 203, in on_batch_end_pre
    state.batch_size / elapsed_time
ZeroDivisionError: float division by zero


Process finished with exit code 1

It can be fixed by adding zero check on elapsed_time but i have no idea, why elapsed_time is zero

My python/catalyst versions

Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6

Catalyst was installed by clonning current repo (master branch, last commit 892d5e5 "Merge pull request #56 from dbrainio/master")

[BUG] RL: Observation Buffer IndexError

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "catalyst/rl/offpolicy/scripts/run_samplers.py", line 162, in run_sampler
    sampler.run()
  File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 317, in run
    self.buffer.push_transition(transition)
  File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 107, in push_transition
    self.observations[self.pointer + 1] = s_tp1
IndexError: index 5000 is out of bounds for axis 0 with size 5000

From the code it seems like there is nothing preventing from buffer overflow. I cant try to fix it if someone will approve that it's really that easy as it seems or tell me if I am missing something. I would just increase self.observations size by one.

UPD: Oh gosh it keeps iterating further, seems like buffer size is not the limit

[solved] IOU metric (IouCallback) is bigger than 1

I have been using IOUcallback with default parameter to train the Unet model:

runner.train(
    model=model, 
    main_metric = 'iou',
    minimize_metric = False,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    logdir=logdir,
    scheduler = scheduler,
    callbacks=[ 
        IouCallback(),
    ],
    num_epochs=num_epochs,
    verbose=True
)

As result metric log looks this way:

[2019-05-26 13:25:18,540] 
0/10 * Epoch 0 (train): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=14.2914 | _timers/batch_time=0.4415 | _timers/data_time=0.3346 | _timers/model_time=0.1068 | iou=0.2843 | loss=1.2827
0/10 * Epoch 0 (valid): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=49.2591 | _timers/batch_time=0.1080 | _timers/data_time=0.0271 | _timers/model_time=0.0809 | iou=0.6478 | loss=0.5063
[2019-05-26 13:25:50,207] 
1/10 * Epoch 1 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3666 | _timers/batch_time=0.3442 | _timers/data_time=0.3376 | _timers/model_time=0.0065 | iou=0.5010 | loss=0.4697
1/10 * Epoch 1 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=46.6509 | _timers/batch_time=0.0335 | _timers/data_time=0.0289 | _timers/model_time=0.0045 | iou=0.9374 | loss=-1.6844
[2019-05-26 13:26:21,913] 
2/10 * Epoch 2 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.1218 | _timers/batch_time=0.3469 | _timers/data_time=0.3409 | _timers/model_time=0.0059 | iou=0.5980 | loss=0.0623
2/10 * Epoch 2 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.7021 | _timers/batch_time=0.0343 | _timers/data_time=0.0293 | _timers/model_time=0.0050 | iou=1.0979 | loss=-2.1567
[2019-05-26 13:26:52,914] 
3/10 * Epoch 3 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3602 | _timers/batch_time=0.3504 | _timers/data_time=0.3443 | _timers/model_time=0.0061 | iou=0.5408 | loss=0.0550
3/10 * Epoch 3 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=48.0609 | _timers/batch_time=0.0321 | _timers/data_time=0.0276 | _timers/model_time=0.0044 | iou=0.9644 | loss=-1.6202
[2019-05-26 13:27:24,687] 
4/10 * Epoch 4 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.0270 | _timers/batch_time=0.3544 | _timers/data_time=0.3484 | _timers/model_time=0.0059 | iou=0.7157 | loss=-0.8354
4/10 * Epoch 4 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.5536 | _timers/batch_time=0.0348 | _timers/data_time=0.0301 | _timers/model_time=0.0046 | iou=0.9381 | loss=-1.9926
[2019-05-26 13:27:57,148] 
5/10 * Epoch 5 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=11.9987 | _timers/batch_time=0.3576 | _timers/data_time=0.3516 | _timers/model_time=0.0059 | iou=0.6653 | loss=-0.7207
5/10 * Epoch 5 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=47.2183 | _timers/batch_time=0.0330 | _timers/data_time=0.0285 | _timers/model_time=0.0044 | iou=1.0183 | loss=-2.6983
[2019-05-26 13:28:28,836] 
6/10 * Epoch 6 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.2940 | _timers/batch_time=0.3451 | _timers/data_time=0.3389 | _timers/model_time=0.0061 | iou=0.7951 | loss=-1.6150
6/10 * Epoch 6 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.1915 | _timers/batch_time=0.0339 | _timers/data_time=0.0290 | _timers/model_time=0.0048 | iou=1.3360 | loss=-5.5529
[2019-05-26 13:28:59,628] 
7/10 * Epoch 7 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3576 | _timers/batch_time=0.3424 | _timers/data_time=0.3360 | _timers/model_time=0.0063 | iou=0.7155 | loss=-1.0354
7/10 * Epoch 7 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.9911 | _timers/batch_time=0.0338 | _timers/data_time=0.0288 | _timers/model_time=0.0049 | iou=1.0820 | loss=-1.8514
[2019-05-26 13:29:30,871] 
8/10 * Epoch 8 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5222 | _timers/batch_time=0.3394 | _timers/data_time=0.3337 | _timers/model_time=0.0057 | iou=0.6650 | loss=-0.8128
8/10 * Epoch 8 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=49.4219 | _timers/batch_time=0.0318 | _timers/data_time=0.0272 | _timers/model_time=0.0046 | iou=1.2081 | loss=-6.1373
[2019-05-26 13:30:01,903] 
9/10 * Epoch 9 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5503 | _timers/batch_time=0.3436 | _timers/data_time=0.3380 | _timers/model_time=0.0056 | iou=0.7887 | loss=-1.5298
9/10 * Epoch 9 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.4443 | _timers/batch_time=0.0339 | _timers/data_time=0.0291 | _timers/model_time=0.0047 | iou=1.2653 | loss=-3.6054

As you can see some of them higher than 1, is it sum of IOU or I'm doing something wrong?

[solved] inconsistence learning rate print

Hi guys!
I tried to run notebook example again (code: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 but i set verbose to False) on ubuntu machine

Script gives me following output

Python version 3.6.3 (default, Oct  3 2017, 21:45:48) 
[GCC 7.2.0]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
[2019-01-17 11:39:03,741] 0 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00520 | base/sample_per_second: 6180.13137 | precision01: 41.76663 | precision03: 75.73776 | precision05: 89.51935 | lr: 0.00100 | momentum: 0.90000 | loss: 1.58647
[2019-01-17 11:39:03,741] 0 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00503 | base/sample_per_second: 6365.33178 | precision01: 50.07987 | precision03: 81.99880 | precision05: 93.29073 | lr: 0.00100 | momentum: 0.90000 | loss: 1.37917
[2019-01-17 11:39:03,741] 

[2019-01-17 11:39:16,967] 1 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00519 | base/sample_per_second: 6168.41000 | precision01: 53.47689 | precision03: 84.46497 | precision05: 94.58573 | lr: 0.00100 | momentum: 0.90000 | loss: 1.28700
[2019-01-17 11:39:16,967] 1 * Epoch (valid) metrics: base/data_time: 0.00473 | base/batch_time: 0.00513 | base/sample_per_second: 6246.95064 | precision01: 55.16174 | precision03: 85.15375 | precision05: 94.75839 | lr: 0.00100 | momentum: 0.90000 | loss: 1.24638
[2019-01-17 11:39:16,967] 

[2019-01-17 11:39:30,171] 2 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6184.96553 | precision01: 58.61124 | precision03: 86.94818 | precision05: 95.64539 | lr: 0.00100 | momentum: 0.90000 | loss: 1.15438
[2019-01-17 11:39:30,171] 2 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6210.20164 | precision01: 58.80591 | precision03: 87.11062 | precision05: 95.48722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.14073
[2019-01-17 11:39:30,171] 

Epoch     3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171
[2019-01-17 11:39:43,270] 3 * Epoch (valid) metrics: base/data_time: 0.00464 | base/batch_time: 0.00503 | base/sample_per_second: 6367.18977 | precision01: 61.06230 | precision03: 87.85942 | precision05: 95.62700 | lr: 0.00100 | momentum: 0.90000 | loss: 1.10485
[2019-01-17 11:39:43,270] 

[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941
[2019-01-17 11:39:56,491] 4 * Epoch (valid) metrics: base/data_time: 0.00474 | base/batch_time: 0.00513 | base/sample_per_second: 6240.70324 | precision01: 63.70807 | precision03: 89.10743 | precision05: 96.40575 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03544
[2019-01-17 11:39:56,491] 

[2019-01-17 11:40:09,791] 5 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00523 | base/sample_per_second: 6126.94727 | precision01: 67.57038 | precision03: 91.11484 | precision05: 97.31086 | lr: 0.00050 | momentum: 0.90000 | loss: 0.91745
[2019-01-17 11:40:09,792] 5 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6216.97721 | precision01: 63.73802 | precision03: 89.18730 | precision05: 96.29593 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03911
[2019-01-17 11:40:09,792] 

Epoch     6: reducing learning rate of group 0 to 2.5000e-04.
[2019-01-17 11:40:23,136] 6 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00522 | base/sample_per_second: 6135.11402 | precision01: 68.87796 | precision03: 91.71065 | precision05: 97.54479 | lr: 0.00050 | momentum: 0.90000 | loss: 0.88481
[2019-01-17 11:40:23,136] 6 * Epoch (valid) metrics: base/data_time: 0.00483 | base/batch_time: 0.00523 | base/sample_per_second: 6131.59957 | precision01: 63.83786 | precision03: 89.16733 | precision05: 96.18610 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03159
[2019-01-17 11:40:23,136] 

[2019-01-17 11:40:36,215] 7 * Epoch (train) metrics: base/data_time: 0.00471 | base/batch_time: 0.00512 | base/sample_per_second: 6255.08350 | precision01: 70.90931 | precision03: 92.46041 | precision05: 97.84269 | lr: 0.00025 | momentum: 0.90000 | loss: 0.82662
[2019-01-17 11:40:36,215] 7 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00504 | base/sample_per_second: 6357.74667 | precision01: 64.66653 | precision03: 89.55671 | precision05: 96.43570 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01120
[2019-01-17 11:40:36,216] 

[2019-01-17 11:40:49,560] 8 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6183.07309 | precision01: 71.50512 | precision03: 92.77031 | precision05: 97.91667 | lr: 0.00025 | momentum: 0.90000 | loss: 0.80761
[2019-01-17 11:40:49,560] 8 * Epoch (valid) metrics: base/data_time: 0.00497 | base/batch_time: 0.00540 | base/sample_per_second: 5983.79479 | precision01: 65.85463 | precision03: 89.98602 | precision05: 96.55551 | lr: 0.00025 | momentum: 0.90000 | loss: 0.99543
[2019-01-17 11:40:49,560] 

Epoch     9: reducing learning rate of group 0 to 1.2500e-04.
[2019-01-17 11:41:03,294] 9 * Epoch (train) metrics: base/data_time: 0.00495 | base/batch_time: 0.00540 | base/sample_per_second: 5959.02559 | precision01: 72.08093 | precision03: 93.00224 | precision05: 98.04063 | lr: 0.00025 | momentum: 0.90000 | loss: 0.79061
[2019-01-17 11:41:03,294] 9 * Epoch (valid) metrics: base/data_time: 0.00479 | base/batch_time: 0.00519 | base/sample_per_second: 6174.30215 | precision01: 64.95607 | precision03: 89.87620 | precision05: 96.50559 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01265
[2019-01-17 11:41:03,294] 

Top best models:
./logs/cifar_simple_notebook/checkpoint.None.8.pth.tar	0.9954
./logs/cifar_simple_notebook/checkpoint.None.7.pth.tar	1.0112
./logs/cifar_simple_notebook/checkpoint.None.9.pth.tar	1.0127
./logs/cifar_simple_notebook/checkpoint.None.6.pth.tar	1.0316
./logs/cifar_simple_notebook/checkpoint.None.4.pth.tar	1.0354

Everything looks ok, except one thing

Epoch     3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171

lr should be decreased (at epoch 3), but epoch (3) summary shows me old lr.
Summary of next epoch is ok.

[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941

Same stuff for other epochs

Docs for checkpoints logic

Hey, great framework!

I can't find in docs and don't see from API how to easily change checkpointer behavior - which metric it should monitor for checkpoints, max/min etc.

And I'm not sure about checkpoints at all - where should I find saved models, what is the format, are they enabled by default, what is the logic for saving them right now. And can I customize all of that?

[BUG] CheckpointCallback is wrong

Hi,
I did my experiment, I dont know why best checkpoints are not saved. I tried to pdb and found a bug.

Settings

  • metrics: map05
  • minimize_metric: False

Scenario

  • From opoch 0->10, valid map05 = 0 for all
  • From epoch 11, valid map05 starts increasing.
  • From epoch 11, the best checkpoints are not saved

Following are console logs and checkpoints folder screenshot
From epoch 0->10:

....
8/100 * Epoch (train): _fps=758.1664 | base/batch_time=0.0454 | base/data_time=0.0043 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=28.4960 | map05=0.0000
8/100 * Epoch (valid): _fps=753.5475 | base/batch_time=0.0635 | base/data_time=0.0245 | base/lr=0.0001 | base/model_time=0.0390 | base/momentum=0.9000 | loss=30.4348 | map05=0.0000
9/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.16it/s, _fps=733.518, loss=26.232, map05=0.000]
9/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.60it/s, _fps=787.242, loss=28.358, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
9
(Pdb) c
[2019-03-21 10:20:30,871] 
9/100 * Epoch (train): _fps=765.4649 | base/batch_time=0.0453 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0408 | base/momentum=0.9000 | loss=27.7296 | map05=0.0000
9/100 * Epoch (valid): _fps=752.4676 | base/batch_time=0.0630 | base/data_time=0.0239 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.9424 | map05=0.0000
10/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=779.565, loss=24.555, map05=0.000]
10/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.95it/s, _fps=822.962, loss=27.833, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:20:46,505] 
10/100 * Epoch (train): _fps=758.8756 | base/batch_time=0.0458 | base/data_time=0.0047 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=26.9246 | map05=0.0004
10/100 * Epoch (valid): _fps=754.0560 | base/batch_time=0.0616 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.3501 | map05=0.0000
11/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.27it/s, _fps=785.593, loss=26.380, map05=0.000]
11/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 14.73it/s, _fps=774.013, loss=27.053, map05=0.000]
...

From epoch 11:

11/100 * Epoch (train): _fps=761.4851 | base/batch_time=0.0455 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=26.1523 | map05=0.0014
11/100 * Epoch (valid): _fps=751.8519 | base/batch_time=0.0668 | base/data_time=0.0277 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=28.8740 | map05=0.0000
12/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.17it/s, _fps=765.323, loss=26.579, map05=0.000]
12/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.92it/s, _fps=824.337, loss=26.422, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
0.0011160714285714285
(Pdb) c
[2019-03-21 10:21:46,089] 
12/100 * Epoch (train): _fps=758.7695 | base/batch_time=0.0456 | base/data_time=0.0046 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=25.5722 | map05=0.0015
12/100 * Epoch (valid): _fps=762.2009 | base/batch_time=0.0616 | base/data_time=0.0229 | base/lr=0.0001 | base/model_time=0.0387 | base/momentum=0.9000 | loss=28.3761 | map05=0.0011
13/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.99it/s, _fps=783.159, loss=23.634, map05=0.000]
13/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.25it/s, _fps=807.500, loss=26.013, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:21:55,990] 
13/100 * Epoch (train): _fps=760.6421 | base/batch_time=0.0461 | base/data_time=0.0051 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.9090 | map05=0.0021
13/100 * Epoch (valid): _fps=758.2683 | base/batch_time=0.0604 | base/data_time=0.0215 | base/lr=0.0001 | base/model_time=0.0389 | base/momentum=0.9000 | loss=27.9700 | map05=0.0011
14/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.07it/s, _fps=770.857, loss=26.029, map05=0.000]
14/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.66it/s, _fps=811.955, loss=25.584, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:08,043] 
14/100 * Epoch (train): _fps=762.4881 | base/batch_time=0.0452 | base/data_time=0.0042 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.1805 | map05=0.0037
14/100 * Epoch (valid): _fps=752.5557 | base/batch_time=0.0627 | base/data_time=0.0236 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=27.5076 | map05=0.0022
15/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=760.488, loss=22.381, map05=0.000]
15/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.09it/s, _fps=787.039, loss=25.456, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:20,274] 
15/100 * Epoch (train): _fps=754.6090 | base/batch_time=0.0455 | base/data_time=0.0044 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=23.6162 | map05=0.0060
15/100 * Epoch (valid): _fps=739.9028 | base/batch_time=0.0610 | base/data_time=0.0212 | base/lr=0.0001 | base/model_time=0.0398 | base/momentum=0.9000 | loss=27.1939 | map05=0.0033
16/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.69it/s, _fps=778.896, loss=24.231, map05=0.000]
16/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.89it/s, _fps=807.296, loss=24.941, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:45,175] 
16/100 * Epoch (train): _fps=756.6095 | base/batch_time=0.0466 | base/data_time=0.0054 | base/lr=0.0001 | base/model_time=0.0412 | base/momentum=0.9000 | loss=23.0421 | map05=0.0070
16/100 * Epoch (valid): _fps=749.7147 | base/batch_time=0.0618 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0393 | base/momentum=0.9000 | loss=26.7421 | map05=0.0045
17/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.92it/s, _fps=771.987, loss=23.759, map05=0.000]
17/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.72it/s, _fps=805.377, loss=24.638, map05=0.062]

Checkpoints folder:
Imgur

Problem

I found the problem is here

  • From epoch 0->11:
checkpoint_metric = 0
checkpoint_metric = checkpoint_metric or epoch
=> checkpoint_metric = epoch 

checkpoint_metric now is a number greater than 1.

  • From epoch > 11:
checkpoint_metric = 0.xxxx 
checkpoint_metric = checkpoint_metric or epoch 
=> checkpoint_metric = 0.xxxx

checkpoint_metric now is a float number less than 1
=> The next best checkpoints will be removed after sorting.

Following is the log of self.top_best_metrics

[('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.11.pth', 11), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.10.pth', 10), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.9.pth', 9), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.8.pth', 8), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.7.pth', 7), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.18.pth', 0.006324404761904762)]

=> The last checkpoint is removed after

            last_item = self.top_best_metrics.pop(-1)
            last_filepath = last_item[0]
            os.remove(last_filepath)

some bugs in LRFinder

In case of default value for LRFinder parameter n_steps=None training terminates with

File "/home/ivb/Repos/contrib/catalyst/catalyst/dl/callbacks/schedulers.py", line 188, in on_loader_start
    self.n_steps = self.n_steps or len(state.loader)
AttributeError: 'RunnerState' object has no attribute 'loader'

To reproduce just replace sheduler section in examples/cifar_simple/config.yml with

scheduler:
  callback: LRFinder
  final_lr: 10

Also doc for LRFinder contains unused parameter init_lr.

Memory Leak Issue

I just want to inform that I am countering with Memory leak issue.
I am not 100% sure that It comes from catalyst. But I am sure that when I use my own code written by myself (without catalyst), there is no leaking issue. I tried to pre-procedure this issue from your examples, but it runs very fast and we could not see the leak.

Following is my leak descriptions:

  • Image classification task with ~17M images

  • The memory leak occurs every epoch and it happens at the next epoch again.
    EX:
    At the beginning of epoch 0, my memory is:

    • RAM: 13G/16G
    • Swap: 0/50G

    At the end of epoch 0, my memory is:

    • RAM: 16G/16G
    • Swap: 25G/50G

    When epoch 0 ends, the memories (RAM + Swap) are released, (RAM: 13G/16G, SwapL: 0/50G). The same phenomena occurs at the next epoch 1.

I am debugging to see where the issue comes from. Could you please double check with your bigger data?

[feature] task-specific callbacks

Proposal by @BloodAxe, use task-specific callbacks, like:

  1. BinaryClassificationMetricsCallback
  2. MulticlassClassificationMetricsCallback
  3. MultiLabelClassificationMetricsCallback
  4. BinarySegmentationMetricsCallback
  5. SemanticSegmentationMetricsCallback
  6. ObjectDetectionMetricsCallback

With internal metrics definition like:

SemanticSegmentationMetricsCallback(
    need_confusion_matrix=True, 
    need_mAP=True, 
    need_IoU=False)

Some improvements of LRFinder

As typical use case for LRFinder just set some large value for final_lr, say 10, it would be convenient to stop iterating in case of divergence. And probably add default value for final_lr.
If this sounds good, I'll contribute.

[fix] fix json metric logger

@todo:

  • add json formatter to Logger
  • check (import json; data=json.load(open(f'{logdir}/metrics.json')); print(data[-1]))
  • add this check to travis CI

[feature] python dependencies logging

For better experiment reproducibility:

  • add something like pip/conda freeze in the beginning of catalyst-dl run
  • or just after the logdir creation

sometimes package versions matters.

Bug: ZeroDivisionError while computing FPS counter

Hi.

I've got some weird unhanded exception during simple train run:

  File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 127, in <module>
    #     metrics=["loss", "precision01", "precision03", "base/lr"])
  File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 117, in main
    ],
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 271, in train
    self.run_experiment(experiment, check=check)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 194, in run_experiment
    self._run_stage(stage)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 175, in _run_stage
    self._run_epoch(loaders)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 160, in _run_epoch
    self._run_loader(loaders[loader_name])
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 132, in _run_loader
    self._run_event("batch_end")
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 97, in _run_event
    getattr(self.state, f"on_{event}_post")()
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 153, in on_batch_end_post
    self._handle_runner_metrics()
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 114, in _handle_runner_metrics
    self.batch_size / self.timer.elapsed["base/batch_time"]
ZeroDivisionError: float division by zero

Before it happened, training logs looked totally fine:

C:\Anaconda3\envs\kaggle\python.exe C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py
0/100 * Epoch (train): 100% 87/87 [01:37<00:00,  2.28it/s, _fps=16383.000, jaccard=0.000, loss=0.068]
0/100 * Epoch (valid): 100% 10/10 [00:39<00:00,  4.53s/it, _fps=240.926, jaccard=0.000, loss=0.085]
[2019-04-03 14:35:00,371] 
0/100 * Epoch (train): _fps=13657.5198 | base/batch_time=0.7826 | base/data_time=0.7441 | base/lr=0.0010 | base/model_time=0.0385 | base/momentum=0.9000 | jaccard=0.0074 | loss=0.0802
0/100 * Epoch (valid): _fps=8236.0160 | base/batch_time=3.8731 | base/data_time=3.7668 | base/lr=0.0010 | base/model_time=0.1063 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0837
1/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.72it/s, _fps=16395.258, jaccard=0.001, loss=0.061]
1/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.78s/it, _fps=16383.000, jaccard=0.004, loss=0.077]
[2019-04-03 14:37:15,192] 
1/100 * Epoch (train): _fps=13985.1989 | base/batch_time=0.8083 | base/data_time=0.8061 | base/lr=0.0010 | base/model_time=0.0022 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0642
1/100 * Epoch (valid): _fps=6082.4207 | base/batch_time=3.6415 | base/data_time=3.6382 | base/lr=0.0010 | base/model_time=0.0032 | base/momentum=0.9000 | jaccard=0.0033 | loss=0.0761
2/100 * Epoch (train): 100% 87/87 [01:34<00:00,  2.51it/s, _fps=16382.500, jaccard=0.186, loss=0.049]
2/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  6.22s/it, _fps=16382.000, jaccard=0.188, loss=0.064]
[2019-04-03 14:39:29,013] 
2/100 * Epoch (train): _fps=13115.4018 | base/batch_time=0.7891 | base/data_time=0.7886 | base/lr=0.0010 | base/model_time=0.0005 | base/momentum=0.9000 | jaccard=0.0762 | loss=0.0538
2/100 * Epoch (valid): _fps=12309.1793 | base/batch_time=3.6892 | base/data_time=3.6892 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.1834 | loss=0.0630
3/100 * Epoch (train): 100% 87/87 [01:35<00:00,  2.46it/s, _fps=16387.001, jaccard=0.267, loss=0.043]
3/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.76s/it, _fps=16382.750, jaccard=0.273, loss=0.056]
[2019-04-03 14:41:42,811] 
3/100 * Epoch (train): _fps=13093.6248 | base/batch_time=0.7970 | base/data_time=0.7961 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.2246 | loss=0.0447
3/100 * Epoch (valid): _fps=13118.9414 | base/batch_time=3.6233 | base/data_time=3.6233 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.2659 | loss=0.0549
4/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.62it/s, _fps=16249.120, jaccard=0.354, loss=0.038]
4/100 * Epoch (valid): 100% 10/10 [00:39<00:00,  3.93s/it, _fps=16383.000, jaccard=0.356, loss=0.049]
[2019-04-03 14:43:58,902] 
4/100 * Epoch (train): _fps=13008.1172 | base/batch_time=0.8044 | base/data_time=0.8031 | base/lr=0.0010 | base/model_time=0.0014 | base/momentum=0.9000 | jaccard=0.3206 | loss=0.0382
4/100 * Epoch (valid): _fps=13161.2595 | base/batch_time=3.7907 | base/data_time=3.7907 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.3512 | loss=0.0484
5/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.57it/s, _fps=16387.001, jaccard=0.437, loss=0.032]
5/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.78s/it, _fps=16384.000, jaccard=0.431, loss=0.044]
[2019-04-03 14:46:13,698] 
5/100 * Epoch (train): _fps=14208.6617 | base/batch_time=0.8076 | base/data_time=0.8067 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.4012 | loss=0.0333
5/100 * Epoch (valid): _fps=11602.1947 | base/batch_time=3.6362 | base/data_time=3.6362 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.4264 | loss=0.0436
6/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.82it/s, _fps=16382.750, jaccard=0.477, loss=0.030]
6/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  4.33s/it, _fps=19348.791, jaccard=0.481, loss=0.041]
[2019-04-03 14:48:28,650] 
6/100 * Epoch (train): _fps=14899.4017 | base/batch_time=0.8033 | base/data_time=0.8029 | base/lr=0.0010 | base/model_time=0.0004 | base/momentum=0.9000 | jaccard=0.4614 | loss=0.0299
6/100 * Epoch (valid): _fps=10905.9308 | base/batch_time=3.6788 | base/data_time=3.6784 | base/lr=0.0010 | base/model_time=0.0003 | base/momentum=0.9000 | jaccard=0.4788 | loss=0.0399
7/100 * Epoch (train): 100% 87/87 [01:37<00:00,  2.58it/s, _fps=16387.001, jaccard=0.509, loss=0.030]
7/100 * Epoch (valid): 100% 10/10 [00:38<00:00,  4.41s/it, _fps=16382.500, jaccard=0.506, loss=0.039]
[2019-04-03 14:50:45,999] 
7/100 * Epoch (train): _fps=14034.5155 | base/batch_time=0.8213 | base/data_time=0.8204 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.5000 | loss=0.0278
7/100 * Epoch (valid): _fps=10034.7244 | base/batch_time=3.7788 | base/data_time=3.7787 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5043 | loss=0.0382
8/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.66it/s, _fps=16351.813, jaccard=0.533, loss=0.028]
8/100 * Epoch (valid): 100% 10/10 [00:40<00:00,  4.00s/it, _fps=16210.115, jaccard=0.527, loss=0.038]
[2019-04-03 14:53:03,136] 
8/100 * Epoch (train): _fps=13962.9568 | base/batch_time=0.8072 | base/data_time=0.8066 | base/lr=0.0010 | base/model_time=0.0006 | base/momentum=0.9000 | jaccard=0.5228 | loss=0.0264
8/100 * Epoch (valid): _fps=11596.8021 | base/batch_time=3.8571 | base/data_time=3.8570 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5253 | loss=0.0369
9/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.50it/s, _fps=16382.750, jaccard=0.535, loss=0.025]
9/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.08s/it, _fps=16382.750, jaccard=0.541, loss=0.037]
[2019-04-03 14:55:18,959] 
9/100 * Epoch (train): _fps=13944.8145 | base/batch_time=0.8145 | base/data_time=0.8135 | base/lr=0.0010 | base/model_time=0.0008 | base/momentum=0.9000 | jaccard=0.5375 | loss=0.0256
9/100 * Epoch (valid): _fps=9995.8059 | base/batch_time=3.6704 | base/data_time=3.6704 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.5390 | loss=0.0359
10/100 * Epoch (train): 100% 87/87 [01:35<00:00,  2.55it/s, _fps=16382.750, jaccard=0.560, loss=0.024]
10/100 * Epoch (valid):  80% 8/10 [00:37<00:12,  6.05s/it, _fps=16383.000, jaccard=0.552, loss=0.036]

For reference, train script: https://github.com/BloodAxe/pytorch-toolbelt/blob/feature/example-canny-cnn/examples/canny-edge-detector-in-cnn/example_canny_cnn.py

Environment:
OS: Windows 10
Python: 3.6
Catalyst: 19.3
PyTorch: 1.0.1

KeyError: 'loss'

Raises this exception after first epoch.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-eecc50de0067> in <module>
      3     callbacks=callbacks,
      4     logdir=logdir,
----> 5     epochs=n_epochs, verbose=True)

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in train(self, loaders, callbacks, state_params, epochs, start_epoch, verbose, logdir)
    214             mode="train",
    215             verbose=verbose,
--> 216             logdir=logdir,
    217         )
    218 

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run(self, loaders, callbacks, state_params, epochs, start_epoch, mode, verbose, logdir)
    180                 self.run_event(callbacks=callbacks, event="on_loader_end")
    181 
--> 182             self.run_event(callbacks=callbacks, event="on_epoch_end")
    183 
    184         self.run_event(callbacks=callbacks, event=f"on_{mode}_end")

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run_event(self, callbacks, event)
     93         :param event:
     94         """
---> 95         getattr(self.state, f"{event}_pre")(state=self.state)
     96         for callback in callbacks.values():
     97             getattr(callback, event)(state=self.state)

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/state.py in on_epoch_end_pre(state)
    146                 valid_loader=state.valid_loader,
    147                 main_metric=state.main_metric,
--> 148                 minimize=state.minimize_metric)
    149         valid_metrics = {
    150             key: value

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/callbacks/utils.py in process_epoch_metrics(epoch_metrics, best_metrics, valid_loader, main_metric, minimize)
     27         if best_metrics is None \
     28         else (minimize != (
---> 29             valid_metrics[main_metric] > best_metrics[main_metric]))
     30     best_metrics = valid_metrics if is_best else best_metrics
     31     return best_metrics, valid_metrics, is_best

KeyError: 'loss'

I'm using these callbacks:

callbacks["loss"] = LossCallback()
callbacks["optimizer"] = OptimizerCallback()
callbacks["precision"] = PrecisionCallback(
    precision_args=[1])

callbacks["scheduler"] = SchedulerCallback(
    reduce_metric="precision01")

callbacks["saver"] = CheckpointCallback()
callbacks["logger"] = Logger()
callbacks["tflogger"] = TensorboardLogger()

Data and processing are taken from here:: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

Refactor Algorithms

  • default/softmax/quantile as parameter
  • prepare_for trainer/sampler to algorithm class method
  • ALGO -> ALGORITHM (let's use full name)
  • ddpg/sac/td3 -> rl/offpolicy/algorithms
  • example

[feature] OneCycle general schduler

idea:

  • assume we have training stage process 0.0 -> 1.0
  • on 0 -> {warm_fraction}, increase lr from init_lr to max_lr
  • on {warm_fraction} -> {cool_fraction}, use linear/cosine decay from max_lr to min_lr
  • on {cool_fraction} -> 1.0, use min_lr
  • batch or epoch lr schduling

the same thing with momentum

[solved] ResnetEncoder 'frozen' argument doesn't work

Passing frozen=True to catalyst.contrib.models.ResnetEncoder does not actually make encoder not trainable. Simple code to check that tensor value is changing during training:

import torch
from catalyst.contrib.models import ResnetEncoder

class Net(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.enc = ResnetEncoder(
            arch="resnet18", 
            pooling="GlobalAvgPool2d"
        )
        self.logits = torch.nn.Linear(
            self.enc.out_features, 1)
    
    def forward(self, x):
        return self.logits(self.enc(x))
    
    
model = Net()

old_value = model.enc.feature_net[0].weight[0,0,0,0].item()

inputs = torch.randn(8, 3, 224, 224)
targets = torch.ones(8, 1)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.BCEWithLogitsLoss()

optimizer.zero_grad()
loss = criterion(model(inputs), targets)
loss.backward()
optimizer.step()

new_value = model.enc.feature_net[0].weight[0,0,0,0].item()

print(old_value, new_value)

It will print different values. It is so because nn.Module has no attribute 'requires_grad', only nn.Parameter has it. I fix it in my fork, but I don't see 'dev' branch to open PR.

Segmentation quickstartnotebook: TypeError: tensor is not a torch image.

I have several png images with corresponding masks:
Getting data:

data_dir = './data_objects'

def load_data(root_dir):
    data = []

    for stage in ['train']:
        for content in ['images', 'segmentation']:
            
            # construct path to each image
            directory = os.path.join(root_dir,  content)
            fps = [os.path.join(directory, filename) for filename in os.listdir(directory)]

            # read images

            images = [imread(filepath) for filepath in fps]
 
            # if images have different sizes you have to resize them before:
            resized_images = [resize(image, (64, 64)) for image in images]
            
            # stack to one np.array 
            np_images = np.stack(resized_images, axis=0)
            
            data.append(np_images)
            
    return data
x_train, y_train  = load_data(data_dir)
y_train = y_train.reshape(19, 64, 64, 1)

Making data looks like in the example notebook:

x_train, X_test, y_train, y_test = train_test_split(
         x_train, y_train, test_size=0.33, random_state=42)

train_data = list(zip(x_train, y_train))
valid_data = list(zip(X_test, y_test))

train_data[0][0].shape, train_data[0][1].shape, len(train_data)
(64, 64, 3), (64, 64, 1), 12

Calling train returns error:

0/10 * Epoch (train):   0% 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-69a57e657d9d> in <module>()
     26     logdir=logdir,
     27     num_epochs=num_epochs,
---> 28     verbose=True
     29 )

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in train(self, model, criterion, optimizer, loaders, logdir, callbacks, scheduler, num_epochs, valid_loader, main_metric, minimize_metric, verbose, state_kwargs, check)
    269             state_kwargs=state_kwargs
    270         )
--> 271         self.run_experiment(experiment, check=check)
    272 
    273     def infer(

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in run_experiment(self, experiment, check)
    192         self.experiment = experiment
    193         for stage in self.experiment.stages:
--> 194             self._run_stage(stage)
    195         return self
    196 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_stage(self, stage)
    173 
    174             self._run_event("epoch_start")
--> 175             self._run_epoch(loaders)
    176             self._run_event("epoch_end")
    177 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_epoch(self, loaders)
    158             self._run_event("loader_start")
    159             with torch.set_grad_enabled(self.state.need_backward):
--> 160                 self._run_loader(loaders[loader_name])
    161             self._run_event("loader_end")
    162 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_loader(self, loader)
    119         self.state.timer.start("base/data_time")
    120 
--> 121         for i, batch in enumerate(loader):
    122             batch = self._batch2device(batch, self.device)
    123             self.state.timer.stop("base/data_time")

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    635                 self.reorder_dict[idx] = batch
    636                 continue
--> 637             return self._process_next_batch(batch)
    638 
    639     next = __next__  # Python 2 compatibility

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    656         self._put_indices()
    657         if isinstance(batch, ExceptionWrapper):
--> 658             raise batch.exc_type(batch.exc_msg)
    659         return batch
    660 

TypeError: Traceback (most recent call last):
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/dataset.py", line 74, in __getitem__
    dict_ = self.dict_transform(dict_)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 42, in __call__
    img = t(img)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/augmentor.py", line 25, in __call__
    ] = self.augment_fn(dict_[self.dict_key], **self.default_kwargs)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 118, in __call__
    return F.normalize(tensor, self.mean, self.std)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/functional.py", line 158, in normalize
    raise TypeError('tensor is not a torch image.')
TypeError: tensor is not a torch image.

Then I tried adding more transforms, but failed to fix that problem, for example something like this:

data_transform = transforms.Compose([
    Augmentor(
        dict_key="features",
        augment_fn=lambda x: \
            torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
    Augmentor(
        dict_key="features",
        augment_fn=transforms.ToPILImage()),  #transforms.ToTensor()),
    Augmentor(
        dict_key="features",
        augment_fn=transforms.Normalize(
            (0.5, 0.5, 0.5),
            (0.5, 0.5, 0.5))),
    Augmentor(
        dict_key="targets",
        augment_fn=lambda x: \
            torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
    Augmentor(
        dict_key="targets",
        augment_fn= transforms.ToPILImage())#transforms.ToTensor())
])

What is wrong here?

Out of memory during validation step

I got OOM error during validation step. Here is the log

0/10 * Epoch (train): 100% 32/32 [00:41<00:00,  2.39s/it, _fps=11.486, loss=0.898] 
0/10 * Epoch (valid):   3% 1/32 [00:01<00:46,  1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory

My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training.
I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.

I also tried to add some callback to freeze GPU:

class FreeGPU(Callback):

    def on_stage_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_end(self, state):
        torch.cuda.empty_cache()

    def on_stage_end(self, state):
        torch.cuda.empty_cache()

    def on_epoch_start(self, state):
        torch.cuda.empty_cache()

    def on_epoch_end(self, state):
        torch.cuda.empty_cache()

It does not help at all.
Do you have any ideas?.
P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.