What is Goldilox?

Goldilox is a tool to empower data scientists to build machine learning solutions into production.

This is in current development, please wait for the first stable version.

Key features

One line from POC to production
Flexible and yet simple
Technology agnostic
Things you didn't know you want:
- Serialization validation
- Missing values validation
- Output validation
- I/O examples
- Variables and description queries

Installing

With pip:

$ pip install goldilox

Pandas + Sklearn support

Any Sklearn + Pandas pipeline/transformer/estimator works can turn to a pipeline with one line of code, which tou can save and run as a server with the CLI. well.

Vaex native

Vaex is an open-source big data technology with similar APIs to Pandas.
We use some of Vaex's special sauce to allow the extreme flexibility for advance pipeline solutions while insuring we have a tool that works on big data.

Examples

1. Data science

SKlearn

import pandas as pd
from xgboost.sklearn import XGBClassifier
from goldilox.datasets import load_iris

# Get teh data
df, features, target = load_iris()

# modeling
model = XGBClassifier().fit(df[features], df[target])

Vaex

import vaex
from goldilox.datasets import load_iris
from vaex.ml.xgboost import XGBoostModel
import numpy as np

df, features, target = load_iris()
df = vaex.from_pandas(df)

# feature engineering example
df["petal_ratio"] = df["petal_length"] / df["petal_width"]

features.append('petal_ratio')
# modeling
booster = XGBoostModel(
    features=features,
    target=target,
    prediction_name="prediction",
    num_boost_round=500,
)
booster.fit(df)
df = booster.transform(df)

# post modeling processing example 
df["prediction"] = np.around(df["prediction"])
df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

2. Build a production ready pipeline

In one line (-:

from goldilox import Pipeline

# sklearn - When using sklearn, we want to have an example of the raw production query data
pipeline = Pipeline.from_sklearn(model, raw=Pipeline.to_raw(df[features]))

# vaex
pipeline = Pipeline.from_vaex(df)

# Save and load
pipeline.save( < path >)
pipeline = Pipeline.from_file( < path >)

3. Deploy

glx serve <path>

[2021-11-16 18:54:44 +0100] [74906] [INFO] Starting gunicorn 20.1.0
[2021-11-16 18:54:44 +0100] [74906] [INFO] Listening at: http://127.0.0.1:5000 (74906)
[2021-11-16 18:54:44 +0100] [74906] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 18:54:44 +0100] [74911] [INFO] Booting worker with pid: 74911
[2021-11-16 18:54:44 +0100] [74911] [INFO] Started server process [74911]
[2021-11-16 18:54:44 +0100] [74911] [INFO] Waiting for application startup.
[2021-11-16 18:54:44 +0100] [74911] [INFO] Application startup complete.

4. Training: For experiments, cloud training, automations, etc,.

With Vaex, you put everything you want to do to a function which receives and returns a Vaex DataFrame

from vaex.ml.datasets import load_iris
from goldilox import Pipeline


def fit(df):
    from vaex.ml.xgboost import XGBoostModel
    import numpy as np

    df = load_iris()

    # feature engineering example
    df["petal_ratio"] = df["petal_length"] / df["petal_width"]

    # modeling
    booster = XGBoostModel(
        features=['petal_length', 'petal_width', 'sepal_length', 'sepal_width', 'petal_ratio'],
        target='class_',
        prediction_name="prediction",
        num_boost_round=500,
    )
    booster.fit(df)
    df = booster.transform(df)

    # post modeling procssing example 
    df['prediction'] = np.around(df['prediction'])
    df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
    return df


df = load_iris()
pipeline = Pipeline.from_vaex(df, fit=fit).fit(df)

With Sklearn the fit would be the standard X and y.

import pandas as pd
from sklearn.datasets import load_iris
from xgboost.sklearn import XGBClassifier

iris = load_iris()
features = iris.feature_names
df = pd.DataFrame(iris.data, columns=features)
df['target'] = iris.target

# we don't need to provide raw example if we do the training from the Goldilox Pipeline - it would be taken automatically from the first row.
classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)
pipeline = Pipeline.from_sklearn(classifier).fit(df[features], df['target'])

WARNING: Pipeline doesn't handle na for sepal_length
WARNING: Pipeline doesn't handle na for sepal_width
WARNING: Pipeline doesn 't handle na for petal_length
WARNING: Pipeline doesn't handle na for petal_width

We do not handle missing values? Let's fix that!

from goldilox.sklearn.transformers import Imputer

classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)

sk_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                         ('classifier', classifier)])

pipeline = Pipeline.from_sklearn(sk_pipeline).fit(df[features], df[target])

We can still deploy a pipeline that doesn't deal with missing values if we want. Other validations such as serialization, and prediction-on-raw must pass.

CLI

Some tools

# Serve model
glx serve <pipeline-path>

# get the variables straight from the file.
glx variables <pipeline-path>

# get the description straight from the file.
glx description <pipeline-path>

# get the raw data example from the file.
glx raw <pipeline-path>

# Get the pipeline requirements
glx freeze <pipeline-path> <path-to-requirements-file-output.txt>

# Update a pipeline file metadata or variables 
glx udpate <pipeline-path> key value --file --variable

Docker

You can build a docker image from a pipeline.

Reference

glx build <pipeline-path> --platform=linux/amd64

MLOps

Export to MLFlow

pipeline.export_mlflow(path, **kwargs)

Export to Gunicorn

pipeline.export_gunicorn(path, **kwargs)

Data science examples

Example Notebooks

Classification / Regression
- LightGBM
- XGBoost
- Catbboost
- Skleran
Clustering
- Kmeans
- hdbscan
Nearest Neighbours
Recommendations
- Implicit (Matrix Factorization)
- Lightfm (Matrix Factorization with side features)
Online Learning
- River
- Vowpal Wabbit
Predictions with Explanations
- SHAP
- Interpret
NLP
Deep Learning
- PyTorch
- MXNet #TODO
- Keras #TODO
- Tensorflow #TODO
Training
Advance
- Sklearn vs Vaex vs PySprak
- Using a package which is not pickalbe

FAQ

Why the name "Goldilox"?
Because most solutions out there are either tou need to do everything from scratch per solution, or you have to take it as it. We consider ourselves in between, you can do most things, with minimal adjustments.
Why do you work with Vaex and not just Pandas? Vaex handles Big-Data on normal computers, which is our target audience. And we relay heavily on it's lazy evaluation which pandas doesn't have.
Why do you use "inference" for predictions and not "predict" or "transform"? Sklearn has a standard, "transform" returns a dataframe, "predict" a numpy array, we wanted to have another word for inference. We want the pipeline to also follow the sklearn standard with fit, transform, and predict.
M1 mac with docker?
You probably want to use --platform=linux/amd64
How to send arguments to the docker serve?
docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH <args> example:

docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH --host=0.0.0.0:5000

Contributing

See contributing page.

Notebooks can be a great contribution too!

Roadmap

See roadmap page.

Typos

Vaex First

Typo: "Vaex is an open-soruce..."
Fix: "Vaex is an open-source"

Typo: "...to allow the extreme flexibility for advance pipeline solutions..."
Fix: "...to allow the extreme flexibility for advanced pipeline solutions..."

Best Practices

Add columns!

Typo: "...for every value tou would want..."
Fix: "...for every value you would want..."

Typo: "...which explain the XGBoost prediction."
Fix: "...which explains the XGBoost prediction."

Typo: "..., prediciton with distance (for confidance) etc,."
Fix: "..., prediction with distance (for confidence) etc,."

DataFrames

Typo: "In production, this allow you do make sure you... and passthrough elements you..."
Fix: "In production, this allows you to make sure you... and pass through elements you..."

Big Data -> Vaex

Typo: "Vaex is excellent for big data - lazy evlaution...
Fix: "Vaex is excellent for big data -- lazy evaluation..."

Variables and description

Typo: "... - any constant you what the backend/frontend could query."
Fix: "... - any constant you want, the backend/frontend could query." -> assuming this is the sentence you were going for

Advance -> Fix: "Advanced"

Complicated pipelines

sklearn_vs_vaex_vs_pyspark.ipynb

Typo: "..., you should give her a rise!"
Fix: "..., you should give her a raise!"

Ensembles with LightGBM, XGBoost, and CatBoost

ensemble_example.ipynb

Typo: "Crazy ensmble logic example"
Fix: "Crazy ensemble logic example"

Data science examples

Vaex Skleran Predictor -> Vaex Sklearn Prediction

Typo: "The predictor can apply any skleran..."
Fix: "The predictor can applly any sklearn..."

LightGBM

lightgbm.ipynb

Typo: "Variebels and description"
Fix: "Variables and description"

Typo: "...which want to assosiate..."
Fix: "...which want to associate..."

Typo: "A greate place..."
Fix: "A great place..."

xdssio / goldilox Goto Github PK

goldilox's Introduction

What is Goldilox?

Key features

Installing

Pandas + Sklearn support

Vaex native

Examples

Docker

MLOps

FAQ

Contributing

Roadmap

goldilox's People

Contributors

Stargazers

Watchers

Forkers

goldilox's Issues

Vaex First

Best Practices

Add columns!

DataFrames

Big Data -> Vaex

Variables and description

Advance -> Fix: "Advanced"

Complicated pipelines

sklearn_vs_vaex_vs_pyspark.ipynb

Ensembles with LightGBM, XGBoost, and CatBoost

ensemble_example.ipynb

Data science examples

Vaex Skleran Predictor -> Vaex Sklearn Prediction

LightGBM

lightgbm.ipynb

Recommend Projects

Recommend Topics

Recommend Org