Giter Club home page Giter Club logo

dehb's Issues

Updating and populating the documentation

Currently our documentation solely consists of the documentation extracted from doc strings. Especially a landing page with a project overview, instructions on how to install the package and maybe a quick example on how to use it would greatly improve the documentation.

Handling min_budget = max_budget situation

When specifying the minimum budget equal to the maximum budget, DEHB performs plain DE. There are two potential ways of dealing with this:

  1. Break and throw an error communicating to the user, that this setup would result in plain DE and therefore it wont be done or
  2. Return a warning and perform DE with a population consisting of one individual on the max_budget. However only random configs are used for mutation

Personally, I would prefer option 2, since I do not feel the need to restrict the user. However communicating, that this setup does not perform DEHB should be done.

Edit: Currently we break and throw an error. However if we skip the assertion it runs through normally --> TODO: Check what happens under the hood

Adhering to computation cost budget better

The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.

Toy failure mode:
The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.

For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.

Using Dask ```Client``` as context manager

As described by @eddiebergman in #45:

Relying on __del__ for for cleanup is usually a bad practice and can lead to problems if DEHB would be the last thing to be cleaned up.
The correct way, is that Dask Clients can be used as context managers and is likely what you want to have around your hot loop. In the case of a user supplied dask, not sure what you want, might be bad form to automatically shut down their client unless they've specified it somehow.

Before implementing this, we should definitely discuss how we should handle the situation where the user supplies their client. I would agree with Eddie, that simply shutting it down might be bad form.

[Question] Understanding how the parent pool is generated

In the part of mutation when the parent pool is filled with candidates generated using the global population, it appears to me this is carried out in the following steps:

  1. Promote the top K individuals from the lower budget
  2. Set M equal to the number of additional candidates that need to be generated for the chosen mutation strategy (e.g. 3 - K in the case of rand1)
  3. Construct the "global population" by taking the union of all budget subpopulations
  4. Distil the global population to a smaller size M by performing DE mutation M times using the global population, and then append each mutant vector to the parent pool
  5. Perform mutation once using the parent pool

4.2 of the paper gave me the impression that "sampled from the global population pool" meant chosen at random, but the code seems to perform a prior round of mutation to generate parents, particularly at the highest budgets where very few configurations are promoted.

Always training from scratch?

Hello! I'm interested in using DEHB for HPO of neural networks.
But I couldn't find any code related to model checkpointing. Does training for every budget start from scratch?

Support for python 3.7

Hello!

Is there any issue with current implementation to support python 3.7? DEHB used to work with Oríon using 3.7 until last weekend. For some reason pip was not enforcing the dependency on '>=3.8' since February and started doing so previous weekend. Would it be possible to change the python dependency to '>=3.7' or some changes would be required to DEHB's code?

Thank you!

Implement parallel DE

Currently we only support to run DEHB in parallel, while DE can only be run with a single worker.

It would be a nice addition to also implement the parallel setup for DE, so that our library supports both parallel DEHB and DE.

Checkpointing optimization run

A nice addition to the package could be checkpointing the state of the optimizer in order to restart the optimization e.g. when an optimization run has finished or crashed.

This could also be interesting for issue #30 in order to restart the optimization with a higher computational budget.

PR #6 has already touched this topic and it might make sense to get some inspiration form there.

We could then also adjust the pytorch example to feature this.

Continuing configuration evolution when run budget is exhausted.

Note: This issues is only relevant when the Config ID changes have been merged.

While waiting to fetch all results when the run budget is exhausted, we keep sampling/evolving new configurations:

if self.is_worker_available():
    job_info = self._get_next_job()
    if brackets is not None and job_info["bracket_id"] >= brackets:
        # ignore submission and only collect results

While this was not an issue before, we would end up filling the ConfigRepository with unnecessary configurations, which will never be evaluated. We should think about how to redesign this.

No support for Constant type in vector_to_configspace() and configspace_to_vector()

Hello,

I've been working with DEHB on my dataset and came across an issue with the processing of ConfigSpace in the de.py file. Specifically, in the functions vector_to_configspace() and configspace_to_vector(), there seems to be no option or support for handling the ConfigSpace Constant type.

Is this an oversight or intentional? If it's the latter, should I avoid including the Constant type in my ConfigSpace or is there a recommended workaround?

Thanks for your assistance!

DE Optimiser return syntax is deprecated

Dear @Neeratyoy, please see the log from running the DE optimiser:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/dehb/optimizers/de.py:810: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return np.array(self.traj), np.array(self.runtime), np.array(self.history)

Upgrade ConfigSpace

The current version of DEHB uses version 0.4.16 of ConfigSpace which doesn't package any wheels which could cause issues as it then has to be built on the users end when doing pip install dehb.

The newest versions of ConfigSpace, namely v0.5.x and v0.6.0 are compatible with the old one and now ship the wheels, which should make it easier. The wheels come with v0.5.0 and v0.6.0 give a much easier interface for defining search spaces.

Implement versioning for documentation

Giving the user the opportunity to select the documentation of the specfific package version they are using would increase the package quality and usability.

Example on how to train an RF leaks data

Specifically

train_X, test_X, train_y, test_y = train_test_split(
    _data.get("data"), 
    _data.get("target"), 
    test_size=0.1, 
    shuffle=True, 
    random_state=seed
)
train_X, valid_X, train_y, valid_y = train_test_split(
    _data.get("data"), 
    _data.get("target"), 
    test_size=0.3, 
    shuffle=True, 
    random_state=seed
)

Should be:

train_x, rest_x, train_y, rest_y = train_test_split(
  _data.get("data"), 
  _data.get("target"), 
  train_size=0.6, 
  shuffle=True, 
  random_state=seed
)

# 50% of the 40% of the data left over from above (20%) 
valid_x, test_x, valid_y, test_y = train_test_split(
  rest_x, rest_y,
  test_size=0.5, 
  shuffle=True, 
  random_state=seed
)

.. or however you like, as long as the same data that will be used is not re-split multiple times

ERROR - Failed to communicate with scheduler during heartbeat. followed by TimeoutError: No valid workers found

Hi,

I'm using DEHB to find parameters for several models running in a loop. During execution I get the following error which effectively terminates the program.

TimeoutError: No valid workers found

Monitoring CPU usage, I see that utilization for the python process gradually climbs to 100%, even though each run only uses 2 workers, and the host computer has 8 cores.

The code instantiating the DEHB instance is this

dehb = DEHB(
            f  = self.objective,
            cs  = self.conf_space,
            dimensions = len(self.conf_space.get_hyperparameters()),
            min_budget = 2,
            max_budget = 10,
            n_workers  = 2,
            output_path="./temp"
        )
trajectory, runtime, history = dehb.run(
                total_cost        = self.total_cost,
                verbose           = False,
                save_intermediate = False,
                seed              = self.seed,
                train_X           = train_X,
                train_y           = train_y,
                valid_X           = valid_X,
                valid_y           = valid_y,
                max_budget  = dehb.max_budget, 
                save_history  = False,
            )

The rest of the code is roughly taken from the random forest example found here: https://automl.github.io/DEHB/latest/examples/01.1_Optimizing_RandomForest_using_DEHB/

Here is the full error right before the No valid worker found error:

2024-05-13 16:53:36,517 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/worker.py", line 1252, in heartbeat
    response = await retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/utils_comm.py", line 455, in retry_operation
    return await retry(
           ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/utils_comm.py", line 434, in retry
    return await coro()
           ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/core.py", line 1394, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/core.py", line 1153, in send_recv
    response = await comm.read(deserializers=deserializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/comm/tcp.py", line 237, in read
    convert_stream_closed_error(self, e)
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:54877 remote=tcp://127.0.0.1:54831>: Stream is closed
2024-05-13 16:53:37,512 - distributed.nanny - WRN - Restarting worker
2024-05-13 16:53:37,523 - distributed.nanny - WRN - Restarting worker
2024-05-13 16:54:11,726 - distributed.core - ERR - Exception while handling op scatter
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/core.py", line 969, in _handle_comm
    result = await result
             ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/distributed/scheduler.py", line 6022, in scatter
    raise TimeoutError("No valid workers found")
TimeoutError: No valid workers found

Improve doc strings and add type annotations

Improve docstrings of DEHB to fit the Google Style Guide for python comments. When adjusting the doc strings it would make sense to adjust the signatures to add type annotations. These changes would improve the general readability of the code.

Continuously develop unit tests

In order to ensure, that the mechanics of DEHB function correctly (especially after changes/refactorings), it would be helpful to develop unit tests.

These could be divided into three major categories:

  1. DEHB
  2. Bracket Manager
  3. DE

I would prioritize these test as given by the order of the list above, since DEHB itself and the bracket manager are more likely to be adjusted in the future and therefore the unit tests will act like a safety net to ensure the components still work as they are supposed to.

Depending on which DE implementation we used, there might even be some unit tests in the original repo or did you write the DE code yourself? @Neeratyoy

Introduce IDs for evaluated configurations

Some advantages of assigning an unique ID to each configuration could be:

  • Better logging (e.g. better readability, result dicts for each configuration)
  • Checkpointing (and restarting) an optimization run

Regarding the implementation, I would suggest using incrementing IDs. However we would need to implement a check everytime we sample/generate a new configuration, testing if it has been used before. For this check I suggest to use the vector representation of the configuration, so that both using and not using ConfigSpace is supported out of the box.

Using an element-wise comparison between each hyperparameter specified in the config could potentially be costly, namely O(n*d) if n is the number of sampled configs and d is the dimensionality of the configuration (number of hyperparameters). However since d tends to be rather small, I think this will not be a drastic computational overhead.

An alternative could be hashing the configurations and then using this hash as an ID directly, which would result in a faster comparison. However hash collisions could map different configs to the same ID.

Improve readability of DEHB examples and docstrings

It will be useful to have more elaborate filler text in the notebooks serving as examples for using DEHB.

Also, it will be useful to have more docstrings that help explain DEHB function arguments. The example notebooks should also highlight how information on the arguments can be retrieved, for example as ?dehb.run.

Test backward compatibility with NumPy2.0

For now we should limit Numpy dependency to be <2.0.
Given we manipulate arrays and they form the crucial data structures for the algorithm, we should explicitly test what breaks and how significant changes are required for the new Numpy version.

[Bug] Seeding is not applied to configspace

Please see this reproducible example and output, I believe it's becuase the configspace is never seeded:

from __future__ import annotations

from ConfigSpace import ConfigurationSpace

from dehb import DEHB

cs = ConfigurationSpace({"a": (1.0, 10.0)})


def f(config, fidelity):
    print(config, fidelity)
    return {"fitness": config["a"] ** 2, "cost": 1.0, "info": {}}


D = DEHB(cs, f=f, seed=1, min_fidelity=1, max_fidelity=100, n_workers=1)
D.run(1)

D = DEHB(cs, f=f, seed=1, min_fidelity=1, max_fidelity=100, n_workers=1)
D.run(1)

Is there any way to hide non-critical logs to stdout?

Hi,

The dehb package is producing a large amount of logs into stdout, and they're not contained within the dehb namespace, but rather polluting the global namespace, which makes it difficult to read logs coming from any host program.

Is there anyway to turn off this behavior and only have it log errors like other programs?

Thanks,

Configuration and budgets in each brackets

Hi, is it possible to add in how many configuration and budget is observed for each bracket?
For easier to look and manipulate.

I have a question, if I would want to set 100 models with resources of 100 epochs. How should I manipulate the parameters?

Keep active brackets in a dictionary

Changing the list of active brackets to a dictionary with the bracket_id as keys would make the code on some occasions more readable. For example this:

# pass information of job submission to Bracket Manager
for bracket in self.active_brackets:
    if bracket.bracket_id == job_info['bracket_id']:
        # registering is IMPORTANT for Bracket Manager to perform SH
        bracket.register_job(job_info['budget'])
        break

Could then be rewritten as:

bracket_id = job_info['bracket_id']
self.active_brackets[bracket_id].register_job(job_info['budget'])

This would also have runtime advantages, however these are rather small, given that our active_brackets list is mostly small.

To clean up the dictionary we can then simlpy iterate over the key-value pairs and delete the brackets, that are already done.

DEHB and non-async DE optimiser update design

Dear @Neeratyoy and @noorawad, according to the paper "DEHB: Evolutionary Hyperband for Scalable, Robust, and Efficient Hyperparameter Optimization," DEHB employs an immediate update design for DE by utilising the AsyncDE class. However, despite passing the async strategy argument in DEHB, it does not change the async strategy of the AsyncDE class, which by default is 'deferred.'

As I understand, the DE class currently only uses deferred updates. If one would like to compare DEHB and its DE component as separate optimisers for optimising a model's validation score, would the update design of deferred vs immediate have a significant impact on the final validation score or simply the wall-clock time? I'd like to compare DE and DEHB using the same update design, and the paper suggests that the immediate update design outperforms deferred DE.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.