There seems to be a problem with the Hyperband promotion logic. How

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

OK, so <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

Debug code here: <a href="https://github.com/awslabs/syne-tune/blob/debug_martin/syne_

Promotion Logic Bug about syne-tune HOT 16 CLOSED

awslabs commented on May 18, 2024

Promotion Logic Bug

from syne-tune.

Comments (16)

mseeger commented on May 18, 2024 1

I can give this a try

from syne-tune.

mseeger commented on May 18, 2024

Hi Martin,

when testing DEHB, I also ran into an issue like this, for FCNet. It does not happen when you use max_resource_attr and pass it to the simulator backend as well, but this is of course not a proper fix.

Can we sync, and you tell me what you observed so far?

from syne-tune.

mseeger commented on May 18, 2024

INFO:syne_tune.optimizer.schedulers.searchers.bayesopt.utils.debug_log:[0: random]
num_layers: 3
max_units: 181
batch_size: 91
learning_rate: 0.003162277660168382
weight_decay: 0.050005
momentum: 0.545
max_dropout: 0.5

INFO:syne_tune.optimizer.schedulers.searchers.bayesopt.utils.debug_log:[1: random]
num_layers: 1
max_units: 144
batch_size: 411
learning_rate: 0.009322270368292423
weight_decay: 0.05591940112979374
momentum: 0.7192215979996427
max_dropout: 0.11133237466616985

DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.93, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.93, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.94, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 65.91, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 88.05, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 110.15, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 132.23, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 154.31, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 176.51, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 198.60, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push CompleteEvent:time = 1063.64, trial_id = 0]

DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 30.35, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 40.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 50.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 60.40, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 70.41, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 80.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 90.36, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push CompleteEvent:time = 480.27, trial_id = 1]

These cumulative times for lcbench-airlines seem odd. They are virtually the same for epochs 1, 2, 3, as if epochs 2, 3 took no time at all, and epoch 1 quite a lot.

Note: The jump from the last OnTrialResultEvent to the CompleteEvent is because I only output the first 10 OnTrialResultEvent.

This should still not imply that any of them are skipped. But clearly, there seems to be some data issue.

from syne-tune.

mseeger commented on May 18, 2024

OK, there are two things here.

First, there seems to be data errors in some of our blackboxes. For example, in lcbench-airlines, results for the first 3 epochs return the same value for metric_elapsed_time.

Martin, can you confirm this? I hope this is not some other bug somewhere.

Second, the simulator backend should still work correctly even if some epochs take no time at all, and it does not. Fixing this is a bit more tricky, it relates to the fact that the simulator backend simulates behaviour with checkpointing without actualluy storing info about the trials. In particular, we need to know from which resource to resume a trial.

from syne-tune.

mseeger commented on May 18, 2024

I should say that everything works fine if you use max_resource_attr instead of max_t with HyperbandScheduler, and also pass max_resource_attr to the simulator backend. The value of this field is often "epochs".

To me, this is anyway the preferred way of doing things. The training script usually has information about the maximum resource in its config, and both scheduler and backend should get this info from there.

If this is used, the backend knows how many resources a trial will emit when it gets started, and this prevents this bug from happening.

from syne-tune.

mseeger commented on May 18, 2024

OK, so #317 fixes this, in the sense that code will run. But the data errors in the BB tables remain.

from syne-tune.

mseeger commented on May 18, 2024

I am closing this issue, but instead we have another one, namely validating (and maybe fixing) data in lcbench, and maybe also in fcnet.

The issue is that for many configs (maybe for all?), the elapsed_time values in the first 3 epochs are identical or very close to identical.

from syne-tune.

wistuba commented on May 18, 2024

I can't find the new issue so I'll leave my comment here: I can confirm that apparently the first epoch requires much more time while the following two don't need any. I could not find anything in the raw data which would explain this behavior.

from syne-tune.

mseeger commented on May 18, 2024

OK, I keep getting this error that the elapsed_time (called "time") and also "val_accuracy" are exactly the same for the first 3 epochs.

This does not happen for all configs sampled, but for the large majority of them.

I pushed code with debug outputs to branch debug_martin

from syne-tune.

mseeger commented on May 18, 2024

Debug code here: https://github.com/awslabs/syne-tune/blob/debug_martin/syne_tune/blackbox_repository/simulated_tabular_backend.py#L156
And here: https://github.com/awslabs/syne-tune/blob/debug_martin/syne_tune/blackbox_repository/utils.py#L54

from syne-tune.

mseeger commented on May 18, 2024

If I then run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

I get output like this:

Index(['val_accuracy', 'time'], dtype='object') {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 1} {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 2} {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 3} {'val_accuracy': 55.08092, 'time': 50.357838, 'epoch': 4} {'val_accuracy': 55.907024, 'time': 67.25511, 'epoch': 5} {'val_accuracy': 56.68802, 'time': 84.19331, 'epoch': 6} {'val_accuracy': 57.218803, 'time': 101.09334, 'epoch': 7} {'val_accuracy': 57.582386, 'time': 118.04459, 'epoch': 8} {'val_accuracy': 57.81063, 'time': 135.03284, 'epoch': 9} {'val_accuracy': 57.95385, 'time': 152.08, 'epoch': 10} {'val_accuracy': 58.067047, 'time': 169.16852, 'epoch': 11} {'val_accuracy': 58.141846, 'time': 186.28845, 'epoch': 12} {'val_accuracy': 58.18696, 'time': 203.40231, 'epoch': 13} {'val_accuracy': 58.220665, 'time': 220.46123, 'epoch': 14} {'val_accuracy': 58.24464, 'time': 237.45142, 'epoch': 15} {'val_accuracy': 58.243977, 'time': 254.37263, 'epoch': 16} {'val_accuracy': 58.246155, 'time': 271.27945, 'epoch': 17} {'val_accuracy': 58.26594, 'time': 288.19318, 'epoch': 18} {'val_accuracy': 58.282715, 'time': 305.1667, 'epoch': 19} {'val_accuracy': 58.286236, 'time': 322.18546, 'epoch': 20} {'val_accuracy': 58.30233, 'time': 339.26794, 'epoch': 21} {'val_accuracy': 58.31726, 'time': 356.34616, 'epoch': 22} {'val_accuracy': 58.31894, 'time': 373.41287, 'epoch': 23} {'val_accuracy': 58.321625, 'time': 390.44696, 'epoch': 24} {'val_accuracy': 58.326984, 'time': 407.4664, 'epoch': 25} {'val_accuracy': 58.333534, 'time': 424.4817, 'epoch': 26} {'val_accuracy': 58.336884, 'time': 441.51205, 'epoch': 27} {'val_accuracy': 58.336548, 'time': 458.55728, 'epoch': 28} {'val_accuracy': 58.336548, 'time': 475.6127, 'epoch': 29} {'val_accuracy': 58.338898, 'time': 492.60458, 'epoch': 30} {'val_accuracy': 58.338726, 'time': 509.54263, 'epoch': 31} {'val_accuracy': 58.341076, 'time': 526.4424, 'epoch': 32} {'val_accuracy': 58.34728, 'time': 543.30194, 'epoch': 33} {'val_accuracy': 58.353485, 'time': 560.1487, 'epoch': 34} {'val_accuracy': 58.3612, 'time': 577.0582, 'epoch': 35} {'val_accuracy': 58.36808, 'time': 594.0279, 'epoch': 36} {'val_accuracy': 58.373608, 'time': 611.07294, 'epoch': 37} {'val_accuracy': 58.375286, 'time': 628.1597, 'epoch': 38} {'val_accuracy': 58.3778, 'time': 645.2522, 'epoch': 39} {'val_accuracy': 58.379982, 'time': 662.3246, 'epoch': 40} {'val_accuracy': 58.383003, 'time': 679.3447, 'epoch': 41} {'val_accuracy': 58.385017, 'time': 696.27783, 'epoch': 42} {'val_accuracy': 58.388203, 'time': 713.1839, 'epoch': 43} {'val_accuracy': 58.390717, 'time': 730.07983, 'epoch': 44} {'val_accuracy': 58.39256, 'time': 746.9867, 'epoch': 45} {'val_accuracy': 58.391724, 'time': 763.9096, 'epoch': 46} {'val_accuracy': 58.391724, 'time': 780.819, 'epoch': 47} {'val_accuracy': 58.391216, 'time': 797.7084, 'epoch': 48} {'val_accuracy': 58.39038, 'time': 814.5716, 'epoch': 49} {'val_accuracy': 58.39038, 'time': 814.57153, 'epoch': 50} {'val_accuracy': 58.39038, 'time': 814.5716, 'epoch': 51} INFO:syne_tune.blackbox_repository.simulated_tabular_backend:Trial 23: Fetching results: r=1, elapsed_time = 33.525 r=2, elapsed_time = 33.525 r=3, elapsed_time = 33.525 r=4, elapsed_time = 50.358 r=5, elapsed_time = 67.255 r=6, elapsed_time = 84.193 r=7, elapsed_time = 101.093 r=8, elapsed_time = 118.045 r=9, elapsed_time = 135.033 r=10, elapsed_time = 152.080 r=11, elapsed_time = 169.169 r=12, elapsed_time = 186.288 r=13, elapsed_time = 203.402 r=14, elapsed_time = 220.461 r=15, elapsed_time = 237.451 r=16, elapsed_time = 254.373 r=17, elapsed_time = 271.279 r=18, elapsed_time = 288.193 r=19, elapsed_time = 305.167 r=20, elapsed_time = 322.185 r=21, elapsed_time = 339.268 r=22, elapsed_time = 356.346 r=23, elapsed_time = 373.413 r=24, elapsed_time = 390.447 r=25, elapsed_time = 407.466 r=26, elapsed_time = 424.482 r=27, elapsed_time = 441.512 r=28, elapsed_time = 458.557 r=29, elapsed_time = 475.613 r=30, elapsed_time = 492.605 r=31, elapsed_time = 509.543 r=32, elapsed_time = 526.442 r=33, elapsed_time = 543.302 r=34, elapsed_time = 560.149 r=35, elapsed_time = 577.058 r=36, elapsed_time = 594.028 r=37, elapsed_time = 611.073 r=38, elapsed_time = 628.160 r=39, elapsed_time = 645.252 r=40, elapsed_time = 662.325 r=41, elapsed_time = 679.345 r=42, elapsed_time = 696.278 r=43, elapsed_time = 713.184 r=44, elapsed_time = 730.080 r=45, elapsed_time = 746.987 r=46, elapsed_time = 763.910 r=47, elapsed_time = 780.819 r=48, elapsed_time = 797.708 r=49, elapsed_time = 814.572 r=50, elapsed_time = 814.572 r=51, elapsed_time = 814.572

from syne-tune.

mseeger commented on May 18, 2024

Interestingly, for this record, the values are the same for r=1,2,3 and also for r=49,50,51.

Maybe whoever created this data, did some funny "imputing"?

from syne-tune.

mseeger commented on May 18, 2024

This "r=1,2,3" and "r=49,50,51" being the same happens for all sorts of other configs as well.

from syne-tune.

mseeger commented on May 18, 2024

OK, I confirm Martin's observation that these errors are not in the raw data.

My best guess is this has something to do with the surrogate regression. This should really only be done with the config as input, so that a record (config1, resource1) can only be interpolated with data from the same resource level.

from syne-tune.

mseeger commented on May 18, 2024

A safe way of making sure things work would be to use surrogate regression for each resource level separately

from syne-tune.

mseeger commented on May 18, 2024

The bug as stated has been resolved, but follow-up error moved to #319

from syne-tune.

Promotion Logic Bug about syne-tune HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent