Giter Club home page Giter Club logo

neuraxio / neuraxle Goto Github PK

View Code? Open in Web Editor NEW
597.0 19.0 60.0 8.36 MB

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.

Home Page: https://www.neuraxle.org/

License: Apache License 2.0

Python 99.93% Shell 0.07%
pipeline pipeline-framework machine-learning deep-learning framework python-library hyperparameter-optimization hyperparameter-tuning hyperparameter-search hyperparameters

neuraxle's Introduction

Neuraxle Pipelines

Code Machine Learning Pipelines - The Right Way.

image

image

image

image

image

image

Neuraxle Logo

Neuraxle is a Machine Learning (ML) library for building clean machine learning pipelines using the right abstractions.

  • Component-Based: Build encapsulated steps, then compose them to build complex pipelines.
  • Evolving State: Each pipeline step can fit, and evolve through the learning process
  • Hyperparameter Tuning: Optimize your pipelines using AutoML, where each pipeline step has their own hyperparameter space.
  • Compatible: Use your favorite machine learning libraries inside and outside Neuraxle pipelines.
  • Production Ready: Pipeline steps can manage how they are saved by themselves, and the lifecycle of the objects allow for train, and test modes.
  • Streaming Pipeline: Transform data in many pipeline steps at the same time in parallel using multiprocessing Queues.

Documentation

You can find the Neuraxle documentation on the website. It also contains multiple examples demonstrating some of its features.

Installation

Simply do:

pip install neuraxle

Examples

We have several examples on the website.

For example, you can build a time series processing pipeline as such:

p = Pipeline([
    TrainOnlyWrapper(DataShuffler()),
    WindowTimeSeries(),

])

# Load data
X_train, y_train, X_test, y_test = generate_classification_data()

# The pipeline will learn on the data and acquire state.
p = p.fit(X_train, y_train)

# Once it learned, the pipeline can process new and
# unseen data for making predictions.
y_test_predicted = p.predict(X_test)

You can also tune your hyperparameters using AutoML algorithms such as the TPE:

# Define classification models with hyperparams.

# All SKLearn models can be used and compared to each other.
# Define them an hyperparameter space like this:
decision_tree_classifier = SKLearnWrapper(
    DecisionTreeClassifier(),
    HyperparameterSpace({
        'criterion': Choice(['gini', 'entropy']),
        'splitter': Choice(['best', 'random']),
        'min_samples_leaf': RandInt(2, 5),
        'min_samples_split': RandInt(2, 4)
    }))

# More SKLearn models can be added (code details skipped):
random_forest_classifier = ...
logistic_regression = ...

# It's possible to mix TensorFlow models into Neuraxle as well, 
# using Neuraxle-Tensorflow' Tensorflow2ModelStep class, passing in
# the TensorFlow functions like create_model and create_optimizer:
minibatched_tensorflow_classifier = EpochRepeater(MiniBatchSequentialPipeline([
        Tensorflow2ModelStep(
            create_model=create_linear_model,
            create_optimizer=create_adam_optimizer,
            create_loss=create_mse_loss_with_regularization
        ).set_hyperparams_space(HyperparameterSpace({
            'hidden_dim': RandInt(6, 750),
            'layers_stacked_count': RandInt(1, 4),
            'lambda_loss_amount': Uniform(0.0003, 0.001),
            'learning_rate': Uniform(0.001, 0.01),
            'window_size_future': FixedHyperparameter(sequence_length),
            'output_dim': FixedHyperparameter(output_dim),
            'input_dim': FixedHyperparameter(input_dim)
        }))
    ]), epochs=42)

# Define a classification pipeline that lets the AutoML loop choose one of the classifier.
# See also ChooseOneStepOf documentation: https://www.neuraxle.org/stable/api/steps/neuraxle.steps.flow.html#neuraxle.steps.flow.ChooseOneStepOf
pipeline = Pipeline([
    ChooseOneStepOf([
        decision_tree_classifier,
        random_forest_classifier,
        logistic_regression,
        minibatched_tensorflow_classifier,
    ])
])

# Create the AutoML loop object.
# See also AutoML documentation: https://www.neuraxle.org/stable/api/metaopt/neuraxle.metaopt.auto_ml.html#neuraxle.metaopt.auto_ml.AutoML
auto_ml = AutoML(
    pipeline=pipeline,
    hyperparams_optimizer=TreeParzenEstimator(
        # This is the TPE as in Hyperopt.
        number_of_initial_random_step=20,
    ),
    validation_splitter=ValidationSplitter(validation_size=0.20),
    scoring_callback=ScoringCallback(accuracy_score, higher_score_is_better=True),
    n_trials=40,
    epochs=1,  # Could be higher if only TF models were used.
    hyperparams_repository=HyperparamsOnDiskRepository(cache_folder=neuraxle_dashboard),
    refit_best_trial=True,
    continue_loop_on_error=False
)

# Load data, and launch AutoML loop!
X_train, y_train, X_test, y_test = generate_classification_data()
auto_ml = auto_ml.fit(X_train, y_train)

# Get the model from the best trial, and make predictions using predict, as per the `refit_best_trial=True` argument to AutoML.
y_pred = auto_ml.predict(X_test)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Test accuracy score:", accuracy)

Why Neuraxle ?

Most research projects don't ever get to production. However, you want your project to be production-ready and already adaptable (clean) by the time you finish it. You also want things to be simple so that you can get started quickly. Read more about the why of Neuraxle here.

Community

For technical questions, please post them on StackOverflow using the neuraxle tag. The StackOverflow question will automatically be posted in Neuraxio's Slack workspace and our Gitter in the #Neuraxle channel.

For suggestions, feature requests, and error reports, please open an issue.

For contributors, we recommend using the PyCharm code editor and to let it manage the virtual environment, with the default code auto-formatter, and using pytest as a test runner. To contribute, first fork the project, then do your changes, and then open a pull request in the main repository. Please make your pull request(s) editable, such as for us to add you to the list of contributors if you didn't add the entry, for example. Ensure that all tests run before opening a pull request. You'll also agree that your contributions will be licensed under the Apache 2.0 License, which is required for everyone to be able to use your open-source contributions.

Finally, you can as well join our Slack workspace and our Gitter to collaborate with us. We <3 collaborators. You can also subscribe to our mailing list where we will post some updates and news.

License

Neuraxle is licensed under the Apache License, Version 2.0.

Citation

You may cite our extended abstract that was presented at the Montreal Artificial Intelligence Symposium (MAIS) 2019. Here is the bibtex code to cite:

@misc{neuraxle,
author = {Chevalier, Guillaume and Brillant, Alexandre and Hamel, Eric},
year = {2019},
month = {09},
pages = {},
title = {Neuraxle - A Python Framework for Neat Machine Learning Pipelines},
doi = {10.13140/RG.2.2.33135.59043}
}

Contributors

Thanks to everyone who contributed to the project:

Supported By

We thank these organisations for generously supporting the project:

neuraxle's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuraxle's Issues

BaseStep must have a custom saver.

There is a problem in the following code:

class ResumablePipeline(Pipeline, ResumableStepMixin):
    """
    Fits and transform steps after latest checkpoint
    """

    def __init__(self, steps: NamedTupleList, pipeline_saver: PipelineSaver = None):
        Pipeline.__init__(self, steps=steps)

        if pipeline_saver is None:
            self.pipeline_saver = JoblibPipelineSaver(DEFAULT_CACHE_FOLDER)
        else:
            self.pipeline_saver = pipeline_saver

    # ...

It is that the pipeline decides of the saver. However, that's wrong (invalid). The pipeline should allow the steps to use a custom saver. For instance, a TensorFlow, Keras, or PyTorch model will need special serialization using their own methods.

Suggested fix

Have a class like the hasher that allows the objects to change how they are saved.

It is okay for a resumable pipeline to be able to pass in a default saver, but just when the pipeline steps don't have a saver of themselves. The pipeline can't force the saving.

What it will impact

The pipeline steps might not need a reference to the parent anymore to be able to save themselves. They should save themselves in a directory passed to them in the context.

Suggested fix to do in _fit_transform_core and other core methods:

for step_name, step in steps_left_to_do:
    step, data_container = step.handle_fit_transform(data_container, context)

The context class:

class Context: 
    - current_tmp_path: str  # path for the current object. 
    - stack_of_tmp_paths_of_parents: List[str]
    - stack_of_parents: List[BaseStep]
    - stack_is_parent_saved: List[bool]  # useful to avoid overwriting too many times. 

    def pop(): 
        return Context(
            self.stack_of_tmp_paths_of_parents[-1], 
            self.stack_of_tmp_paths_of_parents[:-1], 
            self.stack_of_parents[:-1]
        )

    def push():
        # the inverse of pop. Here, add something on the stack instead of removing. 

Remove all the `_one` methods?

Could the streaming pipelines or minibatch streaming pipelines just use the regular methods and let the implementer of the step choose how to handle many successive minibatches or single items?

StepClonerForEachDataInput doesn't propagate hyperparams.

This is the same problem as described in #28, however here it is about the StepClonerForEachDataInput: it doesn't get_hyperparams not set_hyperparams to its contained pipelines. It'd be good here to not return many duplicates of the same params and have something that upon a get, gets oinly the hyperparams of one instance, and upon a set, sets the same hyperparams for each instance.

Same goes for spaces with set_hyperparams_space.

ReversiblePreprocessingWrapper

We'll need something like this:

class ReversiblePreprocessingWrapper(MetaStepMixin, BaseStep): 
    def __init__(
        self, 
        wrapped_step: BaseStep, 
        reversible_preprocessing_pipeline: BaseStep
    ):
        pass  # ...

    def transform(self, data_inputs): 
        data_inputs = self.reversible_preprocessing_pipeline.transform(data_inputs)
        data_inputs = self.wrapped_step.transform(data_inputs)
        data_inputs = self.reversible_preprocessing_pipeline.inverse_transform(data_inputs)
        return data_inputs

Pipeline checkpoints needs to hash data and hyperparams

A pipeline can be run on many datasets and it can be re-trained with many different hyperparameters on all of those datasets. Thus, we need a way to make the difference between each checkpoint. This will allow hyperparameter tuning when the hyperparameters of the steps change, and this will allow not reusing the same checkpoints between train data and test data (and other data) if checkpoints are enabled.

Dynamically create subfolders of pipeline cache according to:

  • data_hash: The hash of the input data to the pipeline. We don't want to load checkpoints for new data by mistake.
  • hyperparams_hash: The hyperparameter samples of the pipeline, e.g.: hash(p.get_hyperparams()).

This means that for each pipeline, the subfolders tree could look like that for example:

./cache
    {data_hash}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        subpipeline_c_{hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{also another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
    {data_hash for another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...
    {data_hash for still another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...

Interesting facts and discussion points, assuming each step or most step is checkpointed :

  • Sometimes, the hyperparameters of a few pipeline steps will be the same, and only the final pipeline step will change. This means it's possible to reuse the same checkpoints for each first steps given a dataset, and only the last step will need two different checkpoints.
  • If a pipeline step has hyperparameter that changes, but that is executed on the same data, the checkpoint name (suffix past a final underscore delimiter "__") will be different. (or if hash is fast to compute, check if the new checkpoint is the same than the old one, and if so it's possible to avoid re-writing to disks?)
  • Hashing huge numpy arrays can be a lengthy process, so perhaps we could add hashers such as just taking the shape of the input array to hash it when the input is an np array, and so forth.
  • The hasher could be sent as an argument of the PipelineRunner or Pipeline, and could be deactivated by sending a hasher that always returns the same value such that every checkpoint just always trigger (?). In fact, there should also be a way to deactivate checkpoints completely (e.g.: for sending models in production).
  • For AutoML, we need to reuse the same checkpoints if the hyperparameters of the previous steps AND the current step are unchanged (hashes needs to take ranges of steps before the checkpoint, not just the hyperparams of the checkpoint itself).

AutoMLSequentialWrapper

Do something like this for meta_fit:

class AutoMLSequentialWrapper:


	def __init__(self, wrapped_pipeline, auto_ml_strategy, validation_technique, score_function, hyperparams_repository, n_iters): 

		self.toute = toute...

	def fit(self, di, eo): 

		for i in n_iters: 
			hps: List[HyperparameterSamples], scores: List[float] = hyperparams_repository.load_all()
			
			auto_ml_strategy = auto_ml_strategy.fit(hps, scores)
			
			next_model_to_try_hps = auto_ml_strategy.guess_next_best_params(i, n_iters, wrapped_pipeline.get_hyperparams_space())
			hyperparams_repository.register_new_untrained_trial(next_model_to_try_hps)
			
			validation_wrapper = validation_technique(copy(wrapped_pipeline).set_hyperparams(next_model_to_try_hps))
			validation_wrapper, predicted_eo = validation_wrapper.fit_transform(di, eo)

			score = score_function(predicted_eo, eo)  # TODO: review order of arguments here.

			hyperparams_repository.set_score_for_trial(next_model_to_try_hps, score)

I'd like to validate the OOP object structure. For instance, what will we do when we'll run trials in parallel? This for loop is not enough, it'd be more like a pool of workers that tries the N next best samples.

We also need a way to indicate that the trial crashed so that the auto_ml_strategy doesn't try that point again.

Any comments/suggestions on that @mlevesquedion @alexbrillant @Eric2Hamel?

fix `fit_transform` in sklearnwrapper

Consider using if hasattr(self.wrapped_sklearn_predictor, 'fit_transform'): which is important to save time (e.g.: avoid doing fit then transform which might duplicate some computations and can cause pipelines to take 2x the time to compute).

Add Neuraxle to Awesome Python

The pull request here needs 20 thumbs up (+1 👍) for it to be merged. Please leave your thumbs up at the pull request here. If you see that it already have 20 thumbs up, bump again the bot perhaps.

Implement Pipeline Setup And Teardown Methods

Add setup and teardown methods to base step.
Setup : Recursively call setup methods of each pipeline steps at the beginning of the execution.
Teardown : Will be used to close session, connections, etc. at the end of the execution

Add DataLoader class: a way to lazy-load iterable datasets.

Using generators lazily and for instance overloading the iter and len methods. To be used with Streaming Pipelines. DataLoaders should be able to be nested.

class DataLoader: 

    def __iter__(self): 
        # ...

    def __getitem__(self): 
        # ...

    def __len__(self): # Len must be defined not to empty/iterate the loader completely upon just checking the length. 
        # ...

Question: could they be without length / infinite?

I'd also like to think about how we could have things that enable to duplicate the data (e.g.: introduce local shuffle (with window size) or the concept of epoch loops to train NNs.

Missing features

One missing feature is lack of
Features union, subset feature processing.

Imagine you have tex and numerics in same dataframe...

FlattenForEach step

We already have a ForEachDataInputs and a CloneStepForEachDataInputs. I think we also need a Flatten3Dto2DForEachDataInputs which is the same concept as a ForEachDataInputs but that reduces a dimension instead of manually looping on it.

Example: instead of looping on the data like for di in data_inputs: self.wrapped.transform(di);, the Flatten3Dto2DForEachDataInputs step instead does that:

reduced_data_inputs = sum(list(data_inputs), [])  # converts data inputs from 3D to 2D. 
outs = self.wrapped.transform(reduced_data_inputs)

# bear with me: the following is like doing: out.reshape(len(data_inputs), outs.shape[0]/len(data_inputs), *outs.shape[1:]...)
# but we can't call `.shape` nor `.reshape` because the data type might not be a np.array or might not be a list, we want to keep things generic. 
reshaped_outs_to_reaugmented = self._re_augment_data(outs)  

return reshaped_outs_to_reaugmented

Note: what self._re_augment_data is to re-create the missing dimension SUCH THAT THE ARRAY THAT WE RETURN has the same number of dimensions that the data inputs had.

Pipeline Runners should be able to transform x AND y at the same time provided a new OutputTransformerMixin step.

See the Autoregress in this slide of the talk: https://youtu.be/WXWDDEkuSaE?t=513

Autoregress takes an input X and returns not just an X upon transform, but also creates a Y. Example:
X_subset, Y_subset = Autoregress().transform(X)

We could perhaps have a Mixin class that is an OutputTransformerMixin. The PipelineRunner, upon seeing this class, would know that the class changes the X and the Y at the same time. E.g.:

if isinstance(step, OutputTransformerMixin): 
    X, y = step.transform(X, y)

This is to be done within the transform loop and the fit_transform loop. So for example, a fit transform would be unpackeable this:

if isinstance(step, OutputTransformerMixin): 
    step, (X, y) = step.fit_transform(X, y)

Broken Pipeline Runner causes infinite recursion.

The default pipeline runner, an argument of the Pipeline class' constructor, is reused across different Pipeline instances as the default argument is created only once. This sometimes causes a recursion error when the pipeline runner gets its steps setted everywhere at once with set_steps and loops on itself.

Quick fix: do a copy of the pipeline runner in the constructor of the Pipeline as such:
self.pipeline_runner: BasePipelineRunner = copy(pipeline_runner)

Better ways to fix that may be possible.

Add BaseStep.config, BaseStep.get_config() and BaseStep.set_config()

The concept is the same as for the Hyperparameter Samples and the Hyperparameter Spaces. But a config shouldn't change what happens to the data, just how it is treated (e.g.: number of cores).

It's interesting to move some parameters to a config for when those parameters are system-related or misc. We don't want some of those parameters to alter the hashes (e.g.: n_jobs in FeatureUnion shouldn't change the outcome and should be modified to be such a config parameter).

Need a better common base class for meta steps (that handles `get_hyperparams` and `set_hyperparams`)

Problem:

  • MetaStep and MetaSteps doesn't implement get_hyperparams nor set_hyperparams.
  • TruncableSteps does.

What should be done about it:

  • Move some logic from TruncableSteps to MetaSteps
  • Have MetaStep act the same, probably by inheriting MetaSteps but setting only one such meta step.

Other better solutions could probably be possible. Basically, we need not only to have nested (recursive) pipelines to be able to return their hyperparams, but also nested objects that are MetaStep(s). MetaStep(s) do contain other step(s) and should be able to get and set their hyperparams recursively as done in TruncableSteps.

@alexbrillant I'd like your thoughts on this.

Perhaps use an `apply` method to avoid duplicate code.

TODO: read and understand all the code contained in PyTorch's nn.Module class:
https://github.com/pytorch/pytorch/blob/d3e90bc47d21149545992f183ee4130a79934cca/torch/nn/modules/module.py#L31

This nn.Module class works somehow like our TruncableSteps or somehow like our BaseStep, which makes it interesting code to read to get inspiration.

Especially look at Module.apply, Module._apply, Module.apply, Module.cuda, Module.float, Module.to, the "hook" methods, Module.parameters, Module.named_parameters, childrens, modules, named modules, Module.train, Module.eval, and so forth.

Perhaps this could be useful to avoid duplicate code in the TruncableStep and in the BaseStep. For example, it seems to me that all those classes could use the same "apply" logic and thus avoid duplicating code as is done currently: get_hyperparams, set_hyperparams, get_hyperparams_space, set_hyperparams_space, setup, teardown, and so forth. We might want to think of pipelines as trees in which we can apply functions. I'd like to validate this idea.

Errors related to the `HyperparameterSamples` and to `HyperparameterSpace` types.

A few things to fix:

  • The constructor of the BaseStep should parse the provided hyperparameters to HyperparameterSamples and to HyperparameterSpace types by using the setter methods to ensure that their types is converted if a simple dict was provided by error.
  • The get_hyperparams_space and get_hyperparams of the truncable steps should not return a dict, but instead, should return something of the good type (HyperparameterSamples or HyperparameterSpace).
  • The HyperparameterSpace.rvs() method should perhaps return a HyperparameterSamples object instead of a HyperparameterSpace object since the distribution collapses from a space to a sample of the space upon calling rvs (random variable sample).

Optional points:

  • See if by default we want the spaces and the samples to be flat or nested. After usage, it seems to me that it might be good to have them flat by default instead of nested by default.

Checkpoints Should Only Save Data Inputs

Output Transformers will perform a list zip operation to be able to save the transformed expected outputs on disk as well.

class OutputTransformerWrapper(MetaStepMixin, BaseStep):
    def __init__(self, wrapped: BaseStep):
        MetaStepMixin.__init__(self, wrapped)

    def transform(self, data_inputs):
        data_inputs, expected_outputs = data_inputs
        return self.wrapped.transform(list(zip(data_inputs, expected_outputs)))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.