Giter Club home page Giter Club logo

Comments (16)

mseeger avatar mseeger commented on May 18, 2024 1

I can give this a try

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

Hi Martin,

when testing DEHB, I also ran into an issue like this, for FCNet. It does not happen when you use max_resource_attr and pass it to the simulator backend as well, but this is of course not a proper fix.

Can we sync, and you tell me what you observed so far?

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024
INFO:syne_tune.optimizer.schedulers.searchers.bayesopt.utils.debug_log:[0: random]
num_layers: 3
max_units: 181
batch_size: 91
learning_rate: 0.003162277660168382
weight_decay: 0.050005
momentum: 0.545
max_dropout: 0.5

INFO:syne_tune.optimizer.schedulers.searchers.bayesopt.utils.debug_log:[1: random]
num_layers: 1
max_units: 144
batch_size: 411
learning_rate: 0.009322270368292423
weight_decay: 0.05591940112979374
momentum: 0.7192215979996427
max_dropout: 0.11133237466616985

DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.93, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.93, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 43.94, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 65.91, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 88.05, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 110.15, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 132.23, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 154.31, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 176.51, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 198.60, trial_id = 0]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push CompleteEvent:time = 1063.64, trial_id = 0]

DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 20.28, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 30.35, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 40.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 50.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 60.40, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 70.41, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 80.39, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push OnTrialResultEvent:time = 90.36, trial_id = 1]
DEBUG:syne_tune.backend.simulator_backend.simulator_backend:[push CompleteEvent:time = 480.27, trial_id = 1]

These cumulative times for lcbench-airlines seem odd. They are virtually the same for epochs 1, 2, 3, as if epochs 2, 3 took no time at all, and epoch 1 quite a lot.

Note: The jump from the last OnTrialResultEvent to the CompleteEvent is because I only output the first 10 OnTrialResultEvent.

This should still not imply that any of them are skipped. But clearly, there seems to be some data issue.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

OK, there are two things here.

First, there seems to be data errors in some of our blackboxes. For example, in lcbench-airlines, results for the first 3 epochs return the same value for metric_elapsed_time.

Martin, can you confirm this? I hope this is not some other bug somewhere.

Second, the simulator backend should still work correctly even if some epochs take no time at all, and it does not. Fixing this is a bit more tricky, it relates to the fact that the simulator backend simulates behaviour with checkpointing without actualluy storing info about the trials. In particular, we need to know from which resource to resume a trial.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

I should say that everything works fine if you use max_resource_attr instead of max_t with HyperbandScheduler, and also pass max_resource_attr to the simulator backend. The value of this field is often "epochs".

To me, this is anyway the preferred way of doing things. The training script usually has information about the maximum resource in its config, and both scheduler and backend should get this info from there.

If this is used, the backend knows how many resources a trial will emit when it gets started, and this prevents this bug from happening.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

OK, so #317 fixes this, in the sense that code will run. But the data errors in the BB tables remain.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

I am closing this issue, but instead we have another one, namely validating (and maybe fixing) data in lcbench, and maybe also in fcnet.

The issue is that for many configs (maybe for all?), the elapsed_time values in the first 3 epochs are identical or very close to identical.

from syne-tune.

wistuba avatar wistuba commented on May 18, 2024

I can't find the new issue so I'll leave my comment here: I can confirm that apparently the first epoch requires much more time while the following two don't need any. I could not find anything in the raw data which would explain this behavior.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

OK, I keep getting this error that the elapsed_time (called "time") and also "val_accuracy" are exactly the same for the first 3 epochs.

This does not happen for all configs sampled, but for the large majority of them.

I pushed code with debug outputs to branch debug_martin

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

Debug code here: https://github.com/awslabs/syne-tune/blob/debug_martin/syne_tune/blackbox_repository/simulated_tabular_backend.py#L156
And here: https://github.com/awslabs/syne-tune/blob/debug_martin/syne_tune/blackbox_repository/utils.py#L54

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

If I then run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

I get output like this:

Index(['val_accuracy', 'time'], dtype='object') {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 1} {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 2} {'val_accuracy': 54.633484, 'time': 33.52457, 'epoch': 3} {'val_accuracy': 55.08092, 'time': 50.357838, 'epoch': 4} {'val_accuracy': 55.907024, 'time': 67.25511, 'epoch': 5} {'val_accuracy': 56.68802, 'time': 84.19331, 'epoch': 6} {'val_accuracy': 57.218803, 'time': 101.09334, 'epoch': 7} {'val_accuracy': 57.582386, 'time': 118.04459, 'epoch': 8} {'val_accuracy': 57.81063, 'time': 135.03284, 'epoch': 9} {'val_accuracy': 57.95385, 'time': 152.08, 'epoch': 10} {'val_accuracy': 58.067047, 'time': 169.16852, 'epoch': 11} {'val_accuracy': 58.141846, 'time': 186.28845, 'epoch': 12} {'val_accuracy': 58.18696, 'time': 203.40231, 'epoch': 13} {'val_accuracy': 58.220665, 'time': 220.46123, 'epoch': 14} {'val_accuracy': 58.24464, 'time': 237.45142, 'epoch': 15} {'val_accuracy': 58.243977, 'time': 254.37263, 'epoch': 16} {'val_accuracy': 58.246155, 'time': 271.27945, 'epoch': 17} {'val_accuracy': 58.26594, 'time': 288.19318, 'epoch': 18} {'val_accuracy': 58.282715, 'time': 305.1667, 'epoch': 19} {'val_accuracy': 58.286236, 'time': 322.18546, 'epoch': 20} {'val_accuracy': 58.30233, 'time': 339.26794, 'epoch': 21} {'val_accuracy': 58.31726, 'time': 356.34616, 'epoch': 22} {'val_accuracy': 58.31894, 'time': 373.41287, 'epoch': 23} {'val_accuracy': 58.321625, 'time': 390.44696, 'epoch': 24} {'val_accuracy': 58.326984, 'time': 407.4664, 'epoch': 25} {'val_accuracy': 58.333534, 'time': 424.4817, 'epoch': 26} {'val_accuracy': 58.336884, 'time': 441.51205, 'epoch': 27} {'val_accuracy': 58.336548, 'time': 458.55728, 'epoch': 28} {'val_accuracy': 58.336548, 'time': 475.6127, 'epoch': 29} {'val_accuracy': 58.338898, 'time': 492.60458, 'epoch': 30} {'val_accuracy': 58.338726, 'time': 509.54263, 'epoch': 31} {'val_accuracy': 58.341076, 'time': 526.4424, 'epoch': 32} {'val_accuracy': 58.34728, 'time': 543.30194, 'epoch': 33} {'val_accuracy': 58.353485, 'time': 560.1487, 'epoch': 34} {'val_accuracy': 58.3612, 'time': 577.0582, 'epoch': 35} {'val_accuracy': 58.36808, 'time': 594.0279, 'epoch': 36} {'val_accuracy': 58.373608, 'time': 611.07294, 'epoch': 37} {'val_accuracy': 58.375286, 'time': 628.1597, 'epoch': 38} {'val_accuracy': 58.3778, 'time': 645.2522, 'epoch': 39} {'val_accuracy': 58.379982, 'time': 662.3246, 'epoch': 40} {'val_accuracy': 58.383003, 'time': 679.3447, 'epoch': 41} {'val_accuracy': 58.385017, 'time': 696.27783, 'epoch': 42} {'val_accuracy': 58.388203, 'time': 713.1839, 'epoch': 43} {'val_accuracy': 58.390717, 'time': 730.07983, 'epoch': 44} {'val_accuracy': 58.39256, 'time': 746.9867, 'epoch': 45} {'val_accuracy': 58.391724, 'time': 763.9096, 'epoch': 46} {'val_accuracy': 58.391724, 'time': 780.819, 'epoch': 47} {'val_accuracy': 58.391216, 'time': 797.7084, 'epoch': 48} {'val_accuracy': 58.39038, 'time': 814.5716, 'epoch': 49} {'val_accuracy': 58.39038, 'time': 814.57153, 'epoch': 50} {'val_accuracy': 58.39038, 'time': 814.5716, 'epoch': 51} INFO:syne_tune.blackbox_repository.simulated_tabular_backend:Trial 23: Fetching results: r=1, elapsed_time = 33.525 r=2, elapsed_time = 33.525 r=3, elapsed_time = 33.525 r=4, elapsed_time = 50.358 r=5, elapsed_time = 67.255 r=6, elapsed_time = 84.193 r=7, elapsed_time = 101.093 r=8, elapsed_time = 118.045 r=9, elapsed_time = 135.033 r=10, elapsed_time = 152.080 r=11, elapsed_time = 169.169 r=12, elapsed_time = 186.288 r=13, elapsed_time = 203.402 r=14, elapsed_time = 220.461 r=15, elapsed_time = 237.451 r=16, elapsed_time = 254.373 r=17, elapsed_time = 271.279 r=18, elapsed_time = 288.193 r=19, elapsed_time = 305.167 r=20, elapsed_time = 322.185 r=21, elapsed_time = 339.268 r=22, elapsed_time = 356.346 r=23, elapsed_time = 373.413 r=24, elapsed_time = 390.447 r=25, elapsed_time = 407.466 r=26, elapsed_time = 424.482 r=27, elapsed_time = 441.512 r=28, elapsed_time = 458.557 r=29, elapsed_time = 475.613 r=30, elapsed_time = 492.605 r=31, elapsed_time = 509.543 r=32, elapsed_time = 526.442 r=33, elapsed_time = 543.302 r=34, elapsed_time = 560.149 r=35, elapsed_time = 577.058 r=36, elapsed_time = 594.028 r=37, elapsed_time = 611.073 r=38, elapsed_time = 628.160 r=39, elapsed_time = 645.252 r=40, elapsed_time = 662.325 r=41, elapsed_time = 679.345 r=42, elapsed_time = 696.278 r=43, elapsed_time = 713.184 r=44, elapsed_time = 730.080 r=45, elapsed_time = 746.987 r=46, elapsed_time = 763.910 r=47, elapsed_time = 780.819 r=48, elapsed_time = 797.708 r=49, elapsed_time = 814.572 r=50, elapsed_time = 814.572 r=51, elapsed_time = 814.572

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

Interestingly, for this record, the values are the same for r=1,2,3 and also for r=49,50,51.

Maybe whoever created this data, did some funny "imputing"?

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

This "r=1,2,3" and "r=49,50,51" being the same happens for all sorts of other configs as well.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

OK, I confirm Martin's observation that these errors are not in the raw data.

My best guess is this has something to do with the surrogate regression. This should really only be done with the config as input, so that a record (config1, resource1) can only be interpolated with data from the same resource level.

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

A safe way of making sure things work would be to use surrogate regression for each resource level separately

from syne-tune.

mseeger avatar mseeger commented on May 18, 2024

The bug as stated has been resolved, but follow-up error moved to #319

from syne-tune.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.