Giter Club home page Giter Club logo

Comments (3)

jsfreischuetz avatar jsfreischuetz commented on August 21, 2024

@bpkroth @motus

from mlos.

bpkroth avatar bpkroth commented on August 21, 2024

Looks to me like we are using this functionality already:

self._opt.register(configs=df_configs, scores=df_scores[opt_targets].astype(float))

It's especially useful for resuming an experiment and re-warming the optimizer with prior results.
I think the rough idea was that for backend optimizers that supported it, we could amortize the cost of retraining the model.
That said, I don't think this is being done well atm, but let's leave that topic for a future PR.
(for instance, for SMAC, I think we probably want to throw out the existing backend optimizer and bulk register everything in one shot at its initialization whenever we do that)

Back to this change, which was really only meant for code readability improvements,
in part what you're proposing here is to change the API to return NamedTuples of individual (config, score, context, metadata) instances instead of DataFrames of the same. That is a much wider reaching change than I was originally envisioning, I think.

Your proposal of having a class called an Observation is nice, but what I might do instead is something more like the following (quickly hacked together with some copilot help).

Note that we could pretty easily extend some of these helper functions for conversion to other data structure types in the future as well.

Some caveats:

The __iter__ method returns the wrong type info for scores and configs due to needing to support optional values for contexts and metadatas.

An alternative would be to explicitly change the received return type in all callsites, but that might be a bunch of work too.

There's probably some other refinements that could happen as well. As I said, this is just a quick stub idea.

"""Simple dataclass for storing observations from a hyperparameter optimization run."""

import itertools
from dataclasses import dataclass
from typing import List, Optional, Iterator

import pandas as pd


@dataclass(frozen=True)
class Observation:
    """Simple dataclass for storing a single Observation."""

    config: pd.Series
    score: pd.Series
    context: Optional[pd.Series] = None
    metadata: Optional[pd.Series] = None

    def __iter__(self) -> Iterator[Optional[pd.Series]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.config, self.score, self.context, self.metadata))


@dataclass(frozen=True)
class Observations:
    """Simple dataclass for storing observations from a hyperparameter optimization
    run.
    """

    configs: pd.DataFrame
    scores: pd.DataFrame
    context: Optional[pd.DataFrame] = None
    metadata: Optional[pd.DataFrame] = None

    def __post_init__(self) -> None:
        assert len(self.configs) == len(self.scores)
        if self.context is not None:
            assert len(self.configs) == len(self.context)
        if self.metadata is not None:
            assert len(self.configs) == len(self.metadata)

    def __iter__(self) -> Iterator[Optional[pd.DataFrame]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.configs, self.scores, self.context, self.metadata))

    def to_observation_list(self) -> List[Observation]:
        """Convert the Observations object to a list of Observation objects."""
        return [
            Observation(
                config=config,
                score=score,
                context=context,
                metadata=metadata,
            )
            for config, score, context, metadata in zip(
                self.configs.iterrows(),
                self.scores.iterrows(),
                self.context.iterrows() if self.context is not None else itertools.repeat(None),
                self.metadata.iterrows() if self.metadata is not None else itertools.repeat(None),
            )
        ]


def get_observations() -> Observations:
    """Get some dummy observations."""
    # Create some dummy data
    configs = pd.DataFrame(
        {
            "x": [1, 2, 3],
            "y": [4, 5, 6],
        }
    )
    scores = pd.DataFrame(
        {
            "score": [0.1, 0.2, 0.3],
        }
    )

    # Create an Observations object
    return Observations(configs=configs, scores=scores)


def test_observations() -> None:
    """Test the Observations dataclass."""
    # Create an Observations object
    observations = get_observations()
    observation = observations.to_observation_list()[0]

    # Print the Observations object
    print(observations)
    print(observations.configs, observations.scores)
    print(observation)
    print(observation.config, observation.score)

    # Or in tuple form using the __iter__ method:
    configs, scores, contexts, metadatas = get_observations()
    print(configs)
    print(scores)
    print(contexts)
    print(metadatas)

    config, score, context, metadata = get_observations().to_observation_list()[0]
    print(config)
    print(score)
    print(context)
    print(metadata)


if __name__ == "__main__":
    test_observations()

from mlos.

bpkroth avatar bpkroth commented on August 21, 2024

And tbh, that __iter__ hack kinda defeats the purpose a bit of what we're trying to do with this change - avoid mistakes of reordering metadata, context in return(ed) values by making them more explicit.

from mlos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.