Giter Club home page Giter Club logo

goldilox's Introduction

Logo

What is Goldilox?

Goldilox is a tool to empower data scientists to build machine learning solutions into production.

  • This is in current development, please wait for the first stable version.

For more details, see the documentation

Key features

  • One line from POC to production
  • Flexible and yet simple
  • Technology agnostic
  • Things you didn't know you want:
    • Serialization validation
    • Missing values validation
    • Output validation
    • I/O examples
    • Variables and description queries

Installing

With pip:

$ pip install goldilox

Pandas + Sklearn support

Any Sklearn + Pandas pipeline/transformer/estimator works can turn to a pipeline with one line of code, which tou can save and run as a server with the CLI. well.

Vaex native

Vaex is an open-source big data technology with similar APIs to Pandas.
We use some of Vaex's special sauce to allow the extreme flexibility for advance pipeline solutions while insuring we have a tool that works on big data.

  • Documentation

Examples

1. Data science

SKlearn

import pandas as pd
from xgboost.sklearn import XGBClassifier
from goldilox.datasets import load_iris

# Get teh data
df, features, target = load_iris()

# modeling
model = XGBClassifier().fit(df[features], df[target])

Vaex

import vaex
from goldilox.datasets import load_iris
from vaex.ml.xgboost import XGBoostModel
import numpy as np

df, features, target = load_iris()
df = vaex.from_pandas(df)

# feature engineering example
df["petal_ratio"] = df["petal_length"] / df["petal_width"]

features.append('petal_ratio')
# modeling
booster = XGBoostModel(
    features=features,
    target=target,
    prediction_name="prediction",
    num_boost_round=500,
)
booster.fit(df)
df = booster.transform(df)

# post modeling processing example 
df["prediction"] = np.around(df["prediction"])
df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

2. Build a production ready pipeline

  • In one line (-:
from goldilox import Pipeline

# sklearn - When using sklearn, we want to have an example of the raw production query data
pipeline = Pipeline.from_sklearn(model, raw=Pipeline.to_raw(df[features]))

# vaex
pipeline = Pipeline.from_vaex(df)

# Save and load
pipeline.save( < path >)
pipeline = Pipeline.from_file( < path >)

3. Deploy

glx serve <path>

[2021-11-16 18:54:44 +0100] [74906] [INFO] Starting gunicorn 20.1.0
[2021-11-16 18:54:44 +0100] [74906] [INFO] Listening at: http://127.0.0.1:5000 (74906)
[2021-11-16 18:54:44 +0100] [74906] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 18:54:44 +0100] [74911] [INFO] Booting worker with pid: 74911
[2021-11-16 18:54:44 +0100] [74911] [INFO] Started server process [74911]
[2021-11-16 18:54:44 +0100] [74911] [INFO] Waiting for application startup.
[2021-11-16 18:54:44 +0100] [74911] [INFO] Application startup complete.

Alt text

4. Training: For experiments, cloud training, automations, etc,.

With Vaex, you put everything you want to do to a function which receives and returns a Vaex DataFrame

from vaex.ml.datasets import load_iris
from goldilox import Pipeline


def fit(df):
    from vaex.ml.xgboost import XGBoostModel
    import numpy as np

    df = load_iris()

    # feature engineering example
    df["petal_ratio"] = df["petal_length"] / df["petal_width"]

    # modeling
    booster = XGBoostModel(
        features=['petal_length', 'petal_width', 'sepal_length', 'sepal_width', 'petal_ratio'],
        target='class_',
        prediction_name="prediction",
        num_boost_round=500,
    )
    booster.fit(df)
    df = booster.transform(df)

    # post modeling procssing example 
    df['prediction'] = np.around(df['prediction'])
    df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
    return df


df = load_iris()
pipeline = Pipeline.from_vaex(df, fit=fit).fit(df)

With Sklearn the fit would be the standard X and y.

import pandas as pd
from sklearn.datasets import load_iris
from xgboost.sklearn import XGBClassifier

iris = load_iris()
features = iris.feature_names
df = pd.DataFrame(iris.data, columns=features)
df['target'] = iris.target

# we don't need to provide raw example if we do the training from the Goldilox Pipeline - it would be taken automatically from the first row.
classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)
pipeline = Pipeline.from_sklearn(classifier).fit(df[features], df['target'])
WARNING: Pipeline doesn't handle na for sepal_length
WARNING: Pipeline doesn't handle na for sepal_width
WARNING: Pipeline doesn 't handle na for petal_length
WARNING: Pipeline doesn't handle na for petal_width

We do not handle missing values? Let's fix that!

from goldilox.sklearn.transformers import Imputer

classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)

sk_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                         ('classifier', classifier)])

pipeline = Pipeline.from_sklearn(sk_pipeline).fit(df[features], df[target])                          
  • We can still deploy a pipeline that doesn't deal with missing values if we want. Other validations such as serialization, and prediction-on-raw must pass.

Some tools

# Serve model
glx serve <pipeline-path>

# get the variables straight from the file.
glx variables <pipeline-path>

# get the description straight from the file.
glx description <pipeline-path>

# get the raw data example from the file.
glx raw <pipeline-path>

# Get the pipeline requirements
glx freeze <pipeline-path> <path-to-requirements-file-output.txt>

# Update a pipeline file metadata or variables 
glx udpate <pipeline-path> key value --file --variable

Docker

You can build a docker image from a pipeline.

glx build <pipeline-path> --platform=linux/amd64

MLOps

Export to MLFlow

pipeline.export_mlflow(path, **kwargs)

Export to Gunicorn

pipeline.export_gunicorn(path, **kwargs)

FAQ

  1. Why the name "Goldilox"?
    Because most solutions out there are either tou need to do everything from scratch per solution, or you have to take it as it. We consider ourselves in between, you can do most things, with minimal adjustments.
  2. Why do you work with Vaex and not just Pandas? Vaex handles Big-Data on normal computers, which is our target audience. And we relay heavily on it's lazy evaluation which pandas doesn't have.
  3. Why do you use "inference" for predictions and not "predict" or "transform"? Sklearn has a standard, "transform" returns a dataframe, "predict" a numpy array, we wanted to have another word for inference. We want the pipeline to also follow the sklearn standard with fit, transform, and predict.
  4. M1 mac with docker?
    You probably want to use --platform=linux/amd64
  5. How to send arguments to the docker serve?
  6. docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH <args> example:
  • docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH --host=0.0.0.0:5000

Contributing

See contributing page.

  • Notebooks can be a great contribution too!

Roadmap

See roadmap page.

goldilox's People

Contributors

xdssio avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

ilonatz

goldilox's Issues

Add inference_steps

from_sklearn(..., inference_steps=None)

This will allow a pipeline for training, but removing some steps for inference
The main use for it is cleaning data in re-fit.

pipeline = Pipeline.from_sklearn(pipeline, inference_steps=[1,2, 5])

pipeline.fit(X, y) # uses all steps
inference(X, y) # uses only steps [1,2,5]

Add Polars Pipeline

Polars lazy dataframe can be serialised.

Pseudo idea

  1. We need to serialize a lazy frame into a "state"
  2. Remove "input" and replace with some Special token.
  3. Load new data, take it's input, insert to the "state".
  • Unclear how to find the exact location as it can be adjusted.
  • Might deal with selections - remove by default or keep.
  1. result = pl.LazyFrame.read_json(io.BytesIO(json.dumps(state).encode())).lazy()

It could work in theory.

Add glx build

Add a glx build wich build a docker image for you

glx build <pipeline path> <image_name>

Typos

Vaex First

Typo: "Vaex is an open-soruce..."
Fix: "Vaex is an open-source"

Typo: "...to allow the extreme flexibility for advance pipeline solutions..."
Fix: "...to allow the extreme flexibility for advanced pipeline solutions..."

Best Practices

Add columns!

Typo: "...for every value tou would want..."
Fix: "...for every value you would want..."

Typo: "...which explain the XGBoost prediction."
Fix: "...which explains the XGBoost prediction."

Typo: "..., prediciton with distance (for confidance) etc,."
Fix: "..., prediction with distance (for confidence) etc,."

DataFrames

Typo: "In production, this allow you do make sure you... and passthrough elements you..."
Fix: "In production, this allows you to make sure you... and pass through elements you..."

Big Data -> Vaex

Typo: "Vaex is excellent for big data - lazy evlaution...
Fix: "Vaex is excellent for big data -- lazy evaluation..."

Variables and description

Typo: "... - any constant you what the backend/frontend could query."
Fix: "... - any constant you want, the backend/frontend could query." -> assuming this is the sentence you were going for

Advance -> Fix: "Advanced"

Complicated pipelines

sklearn_vs_vaex_vs_pyspark.ipynb

Typo: "..., you should give her a rise!"
Fix: "..., you should give her a raise!"

Ensembles with LightGBM, XGBoost, and CatBoost

ensemble_example.ipynb

Typo: "Crazy ensmble logic example"
Fix: "Crazy ensemble logic example"

Data science examples

Vaex Skleran Predictor -> Vaex Sklearn Prediction

Typo: "The predictor can apply any skleran..."
Fix: "The predictor can applly any sklearn..."

LightGBM

lightgbm.ipynb

Typo: "Variebels and description"
Fix: "Variables and description"

Typo: "...which want to assosiate..."
Fix: "...which want to associate..."

Typo: "A greate place..."
Fix: "A great place..."

Add from_onnx

Implement a OnnxPipeline with from_onnx.

  • Might be a lot of work for no value.
  • Great for completeness
  • Allow pyspark pipelines.

Add MLFlowPipeline

Implement a general MLFlowPipeline and from_mlflow.

  • Much work.
  • Good for completeness.
  • Might be very complicated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.