Giter Club home page Giter Club logo

Comments (7)

geoalgo avatar geoalgo commented on June 9, 2024 1

Hi Andreas, thanks this is a great suggestion and I agree it makes a lot of sense.
There are many use-cases for this, as you say the runtimes can be very heterogenous between epochs (even worse if you are searching across multiple model families) so it would be a valuable feature to have.

from syne-tune.

amueller avatar amueller commented on June 9, 2024

Hm so even trying to manually implement this seems a bit more tricky than I had anticipated.

The issue is that not all milestones will be covered by something time-based. I wrote a little simulation like this:

import logging
import time
import os
import numpy as np
from pathlib import Path
import time

from syne_tune import Reporter
from argparse import ArgumentParser
import pickle

if __name__ == '__main__':

    parser = ArgumentParser()
    parser.add_argument('--parameter', type=int)
    parser.add_argument('--st_checkpoint_dir', type=str, default=None)
    # parser.add_argument('--stop-after-epochs', type=int)
    parser.add_argument('--epochs', type=int)  # ignored


    args = parser.parse_args()
    report = Reporter()
    start = 1
    extra_time = 0
    checkpoint_dir = args.st_checkpoint_dir
    if checkpoint_dir is not None:
        os.makedirs(checkpoint_dir, exist_ok=True)
        checkpoint_path = Path(checkpoint_dir) / "checkpoint.iter"
        if checkpoint_path.exists():
            with open(checkpoint_path, "rb") as f:
                state = pickle.load(f)
                start = state['epoch'] + 1
                extra_time = state['time']
    tick = time.time()

    for i in range(start, args.epochs):
        time.sleep(0.1 * (args.parameter + 1))
        current_time = time.time() - tick + extra_time
        report(epoch=i, loss=np.exp(-1/1000 *  (np.sqrt(args.parameter + 1) * i)), time=max(1, int(current_time)))
        if checkpoint_dir is not None:
            with open(checkpoint_path, "wb") as f:
                pickle.dump({'epoch': i, 'time': current_time}, f)

with a tuner like this

import logging

from syne_tune import Tuner, StoppingCriterion
from syne_tune.backend import LocalBackend
from syne_tune.config_space import randint, loguniform, uniform, lograndint, choice, logfinrange
from syne_tune.optimizer.baselines import ASHA, MOBSTER, HyperTune
root = logging.getLogger()
root.setLevel(logging.INFO)

tuner_name = "test-timing-tune-time-7"

# hyperparameter search space to consider
config_space = {
    'parameter': randint(0, 100),
    'epochs': 1000,
}

tuner = Tuner(
    trial_backend=LocalBackend(entry_point='train_test_timing.py'),
        scheduler=MOBSTER(
        config_space,
        metric='loss',
        resource_attr='time',
        max_resource_attr="time",
        search_options={'debug_log': False},
        mode='min',
        type="promotion",
        grace_period=2,
    ),
    max_failures=1000,
    results_update_interval=60,
    print_update_interval=120,
    #stop_criterion=StoppingCriterion(max_wallclock_time=60 *60 * 60),
    stop_criterion=StoppingCriterion(max_num_trials_started=5000),
    n_workers=4,  # how many trials are evaluated in parallel
    tuner_name=tuner_name,
    trial_backend_path=f"/synetune_checkpoints/{tuner_name}/"
)
tuner.run()

The problem is constructed so that the relationship between parameter and speed in epochs is reverse to the relationship between parameter and speed in time, i.e. the best parameter if you measure in epochs is 100, and the best if you measure in time is 0.

The above runs into issues with not getting data for specific resource levels, and potentially getting more than one for a particular resource level. If I comment out the two asserts, it seems to work, but I'm a bit doubtful if that's the right thing to do?

from syne-tune.

mseeger avatar mseeger commented on June 9, 2024

Hello Andreas,

we decided at some point that resource levels, the entities our multi-fidelity schedulers use, should be discrete (int), because as you noticed, we would like to compare them for equality ("this trial has reached that level").

Of course, we could have used inequality as well ("this trial has surpassed this level"), but we felt this would complicate things.

Many schedulers also rely on the resource unit to be equi-spaced (so moving from 1 to 3 takes as many resources/time as moving from 3 to 5), and ideally should be comparable across trials.

Having said that, one could support something like a binning system to sit between the training function reporting time (float values) and the scheduler receiving int values (in which bin does the time fall).

from syne-tune.

mseeger avatar mseeger commented on June 9, 2024

Having time as a resource also complicated pause and resume scheduling to some extent. Right now, for example, we support running ASHA promotion even if your script does not do checkpointing. In that case, say you resume from epoch 3, the script re-trains until epoch 3. If time was a resource unit, the script would have to train until a certain time and then resume. This is hard, while it is easy in most scripts to loop over epochs.

from syne-tune.

mseeger avatar mseeger commented on June 9, 2024

I hope you are OK if I close this issue for now, but feel free to re-open it as a feature request. Just this is not trivial, so we'd like to know that you really need this.

from syne-tune.

amueller avatar amueller commented on June 9, 2024

Yes, this is the main feature that I really need, without it, I'm not sure synetune will be very useful. My main constraint is compute budget and time-to-results. Things like model size can not really be searched over if you don't support time as a resource. I'm using MOBSTER with promotion, which, in my understanding, should be the most effective way to run it when using multiple workers (I'm using 4). Please let me know if there's some other suggestions you have.

Without time as a resource, the tuner could start a process with a really deep or wide network, and basically hog the GPU until it reached the end of the grace period, which could be ... anything... like a couple of days or weeks.
A priory I don't really know how long each architecture will take for an epoch, and how fast learning progress will be in terms of epochs. Constraining the search space to exclude very slow architectures will mean I can no longer use product search spaces and have to put much more effort into the search space design - in a way that's not really possible a-priori, since I don't know what the learning speed of any of the architecture is.

So with a promotion based schedule, it's likely that after a while (say two weeks) all the workers (I only have 4) will be stuck on something that's very long-running, and I'll have to manually kill them if I want to make any progress.

Maybe there's an alternative that resolves this problem that isn't time as a resource?

I don't seem to be able to re-open the issue btw.

from syne-tune.

mseeger avatar mseeger commented on June 9, 2024

Well, I think what I want to say, is that supporting time as a resource is not that hard on our end (schedulers), we'd just somehow bin time, and that would be easy.

But it would likely be a bit harder to support when writing the training script. First, you need to report in regular time intervals, so say you need to check a timer after every min-batch update, and report when it falls into a new bin. OK, that is still doable. And if the checkpoint stores the time spent training until it was saved, one can make it work. But the hard part all happens inside the training script.

If you put up the work, we can certainly support binning time as resource attribute.

from syne-tune.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.