Giter Club home page Giter Club logo

ibis-ml's Introduction

IbisML

Build status Docs License PyPI

What is IbisML?

IbisML is a library for building scalable ML pipelines using Ibis:

How do I install IbisML?

pip install ibis-ml

How do I use IbisML?

With recipes, you can define sequences of feature engineering steps to get your data ready for modeling. For example, create a recipe to replace missing values using the mean of each numeric column and then normalize numeric data to have a standard deviation of one and a mean of zero.

import ibis_ml as ml

imputer = ml.ImputeMean(ml.numeric())
scaler = ml.ScaleStandard(ml.numeric())
rec = ml.Recipe(imputer, scaler)

A recipe can be chained in a Pipeline like any other transformer.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipe = Pipeline([("rec", rec), ("svc", SVC())])

The pipeline can be used as any other estimator and avoids leaking the test set into the train set.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe.fit(X_train, y_train).score(X_test, y_test)

ibis-ml's People

Contributors

deepyaman avatar gforsyth avatar indexseek avatar jcrist avatar jitingxu1 avatar sfc-gh-twhite avatar toohsk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ibis-ml's Issues

IbisML - Interactive Mode

After performing a fit and a transform with a recipe, the return looks like this, even in interactive mode.

In [1]: import ibisml as ml
   ...: from ibis.interactive import *
   ...: 
   ...: con = ibis.duckdb.connect()
   ...: 
   ...: diamonds = con.read_csv(
   ...:     "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv"
   ...: )
   ...: 
   ...: rec = ml.Recipe(ml.CategoricalEncode("cut"))
   ...: rec.fit(diamonds).transform(diamonds)
Out[1]: 
TransformResult:
- Features {
    carat    float64
    color    string
    clarity  string
    depth    float64
    table    float64
    price    int64
    x        float64
    y        float64
    z        float64
    cut      int64
}

It's possible to do rec.fit(diamonds).transform(diamonds).table to see the result, but I wasn't sure this was the expected behavior using interactive mode.

Error in `README` snippet. `RecipeTransform` vs `TransformResult`

The README includes the following snippet which has errors (I added type hints for readability)

import ibis
import ibisml

train = ibis.read_csv(...)

recipe: ibisml.Recipe = ibisml.Recipe()
transform: ibisml.RecipeTransform = recipe.fit(train, outcomes=["outcome_col"])

df_train: pd.DataFrame = transform(train).to_pandas()
X = df_train[transform.features]  # <- ibisml.RecipeTransform doesn't have `.features` or `.outcomes attributes
y = df_train[transform.outcomes]

Code needs to be refactored to

recipe: ibisml.Recipe = ibisml.Recipe()
transform: ibisml.RecipeTransform = recipe.fit(train, outcomes=["outcome_col"])

train_fitted: ibisml.TransformResult = transform(train)  # <- ibisml.TransformResult has `.features` or `.outcomes attributes
df_train: pd.DataFrame = train_fitted.to_pandas()
X = df_train[train_fitted.features]  # <- ibisml.TransformResult has `.features` or `.outcomes attributes
y = df_train[train_fitted.outcomes]

I don't know the motivations behind separating Recipe, RecipeTransform, and TransformResult. What raises questions is that both the transform and the train_fitted objects need to access train, but only the latter has .features and .outcomes set. Users coming from scikit-learn would expect 2 "layers" instead of 3 with Recipe (with a fitted: bool flag) and some sort of result.

Maybe a relevant distinction for users is that you should store RecipeTransform or TransformResult when you need matching training/inference pipelines, and store Recipe want you intend to reuse a pipeline across datasets/feature sets/model versions.

I'm familiar with the scikit-learn code conventions (which are not perfect), but the following sections on instantiation, fitting, and estimated attributes might be relevant. Transformer/Estimator parameters are only set by __init__ and set_params (essentially the ibisml.Recipe part of the API) and attributes derived from the data all have a trailing underscore. For instance, we would have .features_ derived from train and .outcomes_ is received during .fit() instead of __init__

feat: enable not fitted check for transform_table() and throw proper exception

The current transform_table throws AttributeError when step is not fitted.

import ibisml as ml
t = ibis.memtable({"a": [1, 2, 3], "b": [2,3,4], "c": [3,4,5]})
step = ml.CountEncode(ml.string())
step.transform_table(t)
----> 1 step.transform_table(t)

File /Users/voltrondata/repos/ibisml/ibisml/steps/encode.py:281, in CountEncode.transform_table(self, table)
    280 def transform_table(self, table: ir.Table) -> ir.Table:
--> 281     for c, value_counts in self.value_counts_.items():
    282         joined = table.left_join(
    283             value_counts, table[c] == value_counts[0], lname="left_{name}", rname=""
    284         )
    285         table = joined.drop(value_counts.columns[0], f"left_{c}").rename(
    286             {c: f"{c}_count"}
    287         )

AttributeError: 'CountEncode' object has no attribute 'value_counts_'

Desired output

Ensure step is fitted when transform_table, otherwise throws proper error, such as NotFitError

ExpandDate + ExpandTime = ExpandDateTime

I love using the ExpandDate and ExpandTime methods to extract helpful columns quickly. It would be neat if there were an "all-in-one" method to do both date and time attributes.

import ibisml as ml
from ibis.interactive import *

wowah_data = ex.wowah_data_raw.fetch()

recipe = ml.Recipe(ml.ExpandDate("timestamp"), ml.ExpandTime("timestamp"))

recipe.fit(wowah_data).transform(wowah_data).table

feat: Polynomial features transformation

Definition

Create a feature matrix by including polynomial combinations of the input features up to the specified degree. For instance, for a two-dimensional input [x1, x2], the degree-2 polynomial features would be [1, x1, x2, x1^2, x1*x2, x2^2]. It could be splitted into two steps: polynomial and interact features

Reference

chore: rename IbisML/`ibisml` to Ibis-ML/`ibis-ml`

Rename the library ahead of release, to avoid autocorrecting to/mispronunciation as "abysmal." πŸ˜‚

Will require removing the existing package, as the package name ibis-ml is too similar to the existing one to be published on PyPI.

Push release 0.2.0 to Pypi

I was hitting an error trying in ibisml.steps with the import of ibis.expr.deferred.Deferred.

I see that this was fixed on the GitHub release 0.2.0 from November 2023. However, the latest version on Pypi is 0.1.0 from September 2023.

bug(steps): handle col with all nulls in impute

For all the impute, we need to handle cols with all nulls. We need to tell the user the column is all nulls by failing the imputing or throw warnings.

I prefer fail the impute.

a = ibis.memtable(
    {"all_null": np.array([np.nan, np.nan, np.nan], dtype="float64")}
)
step = ml.ImputeMean(ml.numeric())
step.fit_table(a, ml.core.Metadata())
step.transform_table(a)
┏━━━━━━━━━━┓
┃ all_null ┃
┑━━━━━━━━━━┩
β”‚ float64  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚      nan β”‚
β”‚      nan β”‚
β”‚      nan β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

feat: preprocessing transformation priorities

Building upon the deliverables outlined in issue #19, the objective is to enhance the coverage of ibisml machine learning preprocessing transformations, prioritizing key areas for improvement.

Please share your favorite ML transformation for your daily ML tasks and provide additional context as to why you find it particularly useful.

Assumption

  • Raw feature creation is done using ibis
  • tabular data

Priority definition:

  • P0: Essential tasks vital for the model development, Essential before our initial release.

  • P1: Desirable tasks that can enhance the model, based on feedback and further optimization.

  • P2: Additional tasks aimed at improving the model, based on feedback and further optimization.

Priorities

Preprocessing Module Ibis-ml Step sklearn Priority Status Note Model Needed
Encoding CatgoricalEncode OrdinalEncoder P0 Done
Encoding CountEncode P1 Done
Feature Engineering CreatePolynomialFeatures PolynomialFeatures P0 Done
Non-linear Transformation Math Transformation (log, sqrt,) P1 Done ibis
Standardization and Scaling ScaleStandard StandardScaler P0 Done KNN, MLPBased, SVM
Encoding TargetEncode TargetEncoder P0 Done
Feature Reduction DropZeroVariance VarianceThreshold P0 Done
Imputing HandleUnivariateOutliers SimpleImputer P0 Done
Feature Engineering ratio variable creation P0 Done ibis
Discretition DiscretizeKBins KBinsDiscretizer P0 Done
Discretition Feature binarization Binarizer P1 Done
Standardization and Scaling ScaleMinMax MinMaxScaler P0 Done KNN, MLPBased, SVM
Custom Transformer Custom transform FunctionTransformer P0 Done
Encoding OneHotEncode OneHotEncoder P0 Done
Imputing Outlier - Impute and capping P0 Done Log/Linear Reg
Feature Reduction Continuous Target Mutual Info P1 Not started
Feature Reduction Discrete Target Mutual information P1 Not started
Feature Engineering - Text Count Transfomer CountVectorizer P2 Not started
Feature Engineering - Text TFIDF Transformer TfidfTransformer P2 Not started
Encoding label binarizer LabelBinarizer P2 Not started
Encoding label encode LabelEncoder P2 Not started
Standardization and Scaling MaxAbsScaler MaxAbsScaler P2 Not started
Standardization and Scaling RobustScaler RobustScaler P1 Not started KNN, MLPBased, SVM
Imputing Missing value - Nearest Neighbor KNNImputer P1 Not started Doable
Non-linear Transformation QuantileTransformer QuantileTransformer P1 Not started
Non-linear Transformation Inverse and Logit transformation P2 Not started
Imputing Missing value - Linear reg P1 Not started Not Support
Imputing Missing value - bagged trees P1 Not started Not Support
Feature Reduction Filter col with missing rate threshold P1 Not started
Feature Reduction Filter Feature by high correlation P2 Not started Doable
Non-linear Transformation PowerTransformer PowerTransformer P1 Not started MLPBased, SVM
Feature Reduction PCA P1 Not started Not Support
Imputing Missing Value - rolling window Imputing P2 Not started
Feature Engineering Spline transformer SplineTransformer P1 Not started

Reference:

feat(steps): handle unknown category for all encoders

Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.

Open an issue to record this for future consideration.

The current implemenations:

  • CategoricalEncode will convert unknown category to None
  • OneHotEncode will convert all encoded cols to 0, see following example.
  • CountEncode will convert unknown category to 0

For example:

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res

AMZN in the 5th row is unknown, it will be translated to all 0s

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time                    ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
β”‚ timestamp               β”‚ int8        β”‚ int8        β”‚ int8        β”‚ int8        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2016-05-25 13:30:00.023 β”‚           0 β”‚           0 β”‚           1 β”‚           0 β”‚
β”‚ 2016-05-25 13:30:00.038 β”‚           0 β”‚           0 β”‚           1 β”‚           0 β”‚
β”‚ 2016-05-25 13:30:00.048 β”‚           0 β”‚           1 β”‚           0 β”‚           0 β”‚
β”‚ 2016-05-25 13:30:00.049 β”‚           0 β”‚           1 β”‚           0 β”‚           0 β”‚
β”‚ 2016-05-25 13:30:00.050 β”‚           0 β”‚           0 β”‚           0 β”‚           0 β”‚
β”‚ 2016-05-25 13:30:00.051 β”‚           0 β”‚           0 β”‚           0 β”‚           1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

feat: alternative OHE using vectors

background:

instead of a full sql translation of one-hot encoding algorithm, he was envisioning more of as a backend registered function, which probably will be more performant.

That requires SQL metaprogramming to first enumerate a list of unique values in the columns and then craft the create statement

additionally it pigeon holes you into one hot encoding having to return separate columns per value whereas many frameworks, i.e. Spark MLlib if I remember correctly, return something like a 2d vector in lieu of those columns (which can be handled much more efficiently from a memory / hardware perspective)

Some response:

When fitting a one-hot-encoder we already have to collect all the cases so they're consistent across all applications of transform.
The only difference here would be whether a one-hot-encoder should return a column-per-case or a column of an array of cases. I'd argue that since the consuming tooling will want a flat array, not special casing one-hot-encoding (for now) and returning a column-per-case is the correct approach.
Also note - ibisml already has a OneHotEncode step that does all this.

doc: ibisml for sklearn/Tidymodel users tutorial sessions

It's really important to make sure that ibisml is user-friendly for people who are already familiar with scikit-learn and tidymodel. Adding coverage of scikit-learn and sklearn preprocessing in the documentation and tutorials will make it easier for python and R users to start using ibisml.

In this tutorial or documentation, we should cover the following points:

  • Mapping scikit-learn and Tidymodel preprocessing techniques directly to ibisml equivalents.
  • Providing examples of ibisml preprocessing, which can be taken from the API reference documentation.
  • [Optional] Demonstrating how to integrate ibisml preprocessing with scikit-learn and tidymodels. This was partly covered in the main tutorial, but we can expand on it if needed.

Also need to combine this #30

bug(bigquery): invalid field name Cast(y, int64) in bigquery

See this ibis issue for details.
I'm not 100% if this will get other errors, but one potential fix is to remove the ibis.memtable() here, I do see we have other places uses memtable, need to double check and test the usage of memtable in other place.

To reproduce the error in ibis-ml:

import ibis
con = ibis.bigquery.connect(project_id="xxx", dataset_id="xxx")
t = ibis.memtable({
    "x": ["a", "b"],
    "y": ["0", "1"]
})
con.create_table(
    "t", t.to_pyarrow(), overwrite=True
)

t = con.table("t")
t = t.mutate(
    x1=ibis.literal("c"),
    y=_.y.cast("int32"),
)

x = t.drop("y")
y = t.y.cast("int64")

import ibis_ml as ml

step = ml.TargetEncode(["x"])
table, targets, index = ml.core.normalize_table(x, y)
print(targets)
metadata = ml.core.Metadata(targets=targets)
step.fit_table(table,metadata)
step.transform_table(x)

error:

BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/voltrondata-demo/jobs?uploadType=multipart:
 Invalid field name "Cast(y, int64)_194474".
 Fields must contain the allowed characters, 
and be at most 300 characters long. 
For allowed characters, please refer to https://cloud.google.com/bigquery/docs/schemas#column_names

Repr for `Recipe` and `Step`

We used to have nicer custom reprs for each Step and Recipe, but these were dropped in the refactor. Now that we're going for scikit-learn compatibility, it would be nice if our repr would match their conventions. We could accomplish this by subclassing Recipe from BaseEstimator, but currently we don't have sklearn as a required dependency (since we want to support non-sklearn workflows as well too). My guess is there's a way we can have a repr that follows sklearn's conventions here without relying on sklearn itself for an implementation.

Docstrings + Examples + Help Get Started

There are some pretty good documented functions in the step functions. Is there another area to which you would like to add some docstrings? I'm unsure how many people would need/reference any of the inner functions, or if they would need to see docstrings for the transform functions.

We could do a small "getting-started" section, why IbisML?, or "IbisML for scikit-learn" users.

Happy to mock up a few small examples. :)

bug(steps): CategoricalEncode - rows containing NULL value in the encoded column will be relocated to the bottom

The rows with null values will be shuffled to the bottom by the CategoricalEncode step. This might pose an issue if certain rows remain grouped together when building the model, especially if the user forgets to shuffle them again.

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", None, "MSFT", "MSFT", None, "AAPL", "GOOG", None],
...         }
...     )

>>> step = ml.CategoricalEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_train)

Original table:

In [18]: t_train
Out[18]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ time                    ┃ ticker ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
β”‚ timestamp               β”‚ string β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2016-05-25 13:30:00.023 β”‚ GOOG   β”‚
β”‚ 2016-05-25 13:30:00.023 β”‚ NULL   β”‚
β”‚ 2016-05-25 13:30:00.030 β”‚ MSFT   β”‚
β”‚ 2016-05-25 13:30:00.041 β”‚ MSFT   β”‚
β”‚ 2016-05-25 13:30:00.048 β”‚ NULL   β”‚
β”‚ 2016-05-25 13:30:00.049 β”‚ AAPL   β”‚
β”‚ 2016-05-25 13:30:00.072 β”‚ GOOG   β”‚
β”‚ 2016-05-25 13:30:00.075 β”‚ NULL   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Transformed table, rows containing NULL values in the encoded column will be relocated to the bottom..

In [17]: res
Out[17]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ time                    ┃ ticker ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
β”‚ timestamp               β”‚ int64  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2016-05-25 13:30:00.023 β”‚      1 β”‚
β”‚ 2016-05-25 13:30:00.030 β”‚      2 β”‚
β”‚ 2016-05-25 13:30:00.041 β”‚      2 β”‚
β”‚ 2016-05-25 13:30:00.049 β”‚      0 β”‚
β”‚ 2016-05-25 13:30:00.072 β”‚      1 β”‚
β”‚ 2016-05-25 13:30:00.023 β”‚   NULL β”‚
β”‚ 2016-05-25 13:30:00.048 β”‚   NULL β”‚
β”‚ 2016-05-25 13:30:00.075 β”‚   NULL β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

feat: Target Encoder implementation

Definition

Target encoding involves replacing categorical feature values with a numeric representation derived from the target variable. This method aims to capture the relationship between categorical features and the target variable by encoding categories with their respective impact on the target.

use case

  • High Cardinality Features

requirements

  • Regression, binary and multiclass classification.
  • Handle overfitting
  • Handle unknown category, the new category not present in the training dataset.
  • Handle missing value

Implementation

Fit

  • Treat missing Value
  • Use the mean value of target variable for that category for regression task
  • Use the conditional probability given that category
  • Handle Over-fittings
    • [recommend] Smoothing
    • KFold Target Encoder
    • Leave-one-out
    • Adding Gaussian Noise

Transform

  • Unknown category

Issues

  • over-fitting
  • target leakage

Reference

docs: make website launch-ready

https://ibis-project.github.io/ibis-ml/

  • [P0] Installation
  • [P0] Have a getting started example in the documentation (nycflights13 demo)
  • [P0] Why IbisML?
  • [P1] Concepts
  • [P1] How-to guides (e.g. how to write your own Step, how to use IbisML without scikit-learn, inverse transform (when available?), how to use arbitrary Ibis expressions, etc.?) (see #92)

feat: Outlier - Impute and capping

Definition

impute or cap/floor the outliers of numeric features by percentile or a user-defined threshold.

Examples:

Apply caps and floors to each column; if a value is greater than or equal to the 99th percentile, replace it with the value at the 99th percentile, and if it is less than or equal to the 1st percentile, replace it with the value at the 1st percentile.

Reference

bug(steps): `OneHotEncode` treats NaNs differently

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res.to_pandas()
                     time  ticker_AAPL  ticker_GOOG  ticker_MSFT  ticker_None
0 2016-05-25 13:30:00.023          0.0          0.0          1.0            0
1 2016-05-25 13:30:00.038          0.0          0.0          1.0            0
2 2016-05-25 13:30:00.048          0.0          1.0          0.0            0
3 2016-05-25 13:30:00.049          0.0          1.0          0.0            0
4 2016-05-25 13:30:00.050          0.0          0.0          0.0            0
5 2016-05-25 13:30:00.051          NaN          NaN          NaN            1

bug(steps): treat null values as proper categories

Any null value is replaced with target mean, whereas the docstring says that nulls are treated as a separate category. More from my conversation with @jitingxu1:

image

Adding a test like:

    def test_target_encode_null():
        t = ibis.memtable(
            {
                "Color": ["Red"] * 5 + [None] * 3,
                "Target": [1, 1, 0, 0, 0, 1, 0, 0],
            }
        )
        expected = pd.DataFrame(
            {
                "Color": [2 / 5] * 5 + [1 / 3] * 3,
            }
        )
    
        step = ml.TargetEncode("Color")
        step.fit_table(t, ml.core.Metadata(targets=("Target",)))
        res = step.transform_table(t)
        tm.assert_frame_equal(res.to_pandas()[expected.columns], expected)

Will result in a failure like:

E   AssertionError: DataFrame.iloc[:, 0] (column name="Color") are different
E   
E   DataFrame.iloc[:, 0] (column name="Color") values are different (37.5 %)
E   [index]: [0, 1, 2, 3, 4, 5, 6, 7]
E   [left]:  [0.4, 0.4, 0.4, 0.4, 0.4, 0.375, 0.375, 0.375]
E   [right]: [0.4, 0.4, 0.4, 0.4, 0.4, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
E   At positional index 5, first diff: 0.375 != 0.3333333333333333

May be somewhat related to #73.

Possible implementation

Filling null values with a special indicator value first could work. It's usually more efficient to just check for nulls in the join condition (ref: https://bertwagner.com/posts/joining-on-nulls/).

Is (or will) it be possible to use ibisml within sklearn pipelines?

Very happy to find this project, thanks for creating this library! Basically I'm looking for a way to write something resembling a scikit-learn pipeline to train an ML model which can be executed on BQ (even if maybe just partly). I'm also using BQML for some use cases, but it has some limitations that I'm currently jumping back into python for, which has some barriers/constraints in my current context.

From the ibisml readme, it looks to me like the difference between an ibisml recipe and a sklearn pipeline is that the ibisml recipe only encompasses feature transformations, while an sklearn pipeline includes feature transformations and a model at the end. Is that a fair characterization?

From my perspective, what makes sklearn pipelines so useful is that they encompass both the feature transformations and the model, which can be jointly fit on a training dataset, cross-validated on an eval dataset, then refit on a full dataset and used to predict on new data. That whole workflow seems to get more complicated if I have to manage separate objects for my feature-transform recipe and model.

But maybe it should be possible to write a sklearn pipeline which contains both ibisml feature transformations, and python models at the end? In that case, if the compute for the feature transformations are still farmed out to BQ and the underlying python environment/compute-resource only has to run the clf.fit or clf.predict part, that would still be very exciting to me.

Thanks again!

Expand test coverage

Right now the test coverage is pretty poor, mostly covering the core operations and a smattering of simple Step bits. We should:

  • Enable test coverage metrics (possibly using codecov, no strong thoughts)
  • Expand test coverage to at least cover the fit/transform bits of all steps

feat: support direct numpy output

The current to_numpy() in ibisml is based on a pandas dataframe: Convert ibis table to pandas dataframe, then to numpy. It is not efficient.

Some backend, like duckdb, could directly convert query result to a numpy array.


Had some initial discussion in ibis triage, we decided to hold off until we encounter a specific use case before proceeding further.

Applying custom transforms

What is the plan for allowing custom transforms for feature engineering to be applied?

These can range from something like:

def is_high_roller(order_value: float) -> bool:
    return order_value > 1000.0

to more complex where multiple successive transforms are required:

def survey_features(raw_json: str) ->  dict:
    return json.loads(raw_json)["survey"]

def one_hot_encode(survey_features: dict) -> list:
    return ...  # takes in survey features an one hot encodes them...

Invariably not everything is translatable to SQL, so you need some runtime to execute things with...

Note: I drive hamilton which could have an integration with ibis...

interest in vector abstractions?

I have these existing vector abstractions in ibis:

"""Vector operations"""
from __future__ import annotations

from typing import Literal, TypeVar

import ibis
import ibis.expr.types as ir


@ibis.udf.scalar.builtin(name="array_sum")
def _array_sum(a) -> float:
    ...


# for duckdb
@ibis.udf.scalar.builtin(name="list_dot_product")
def _array_dot_product(a, b) -> float:
    ...


T = TypeVar("T", ir.MapValue, ir.ArrayValue)


def dot(a: T, b: T) -> ir.FloatingValue:
    """Compute the dot product of two vectors

    The vectors can either be dense vectors, represented as array<numeric>,
    or sparse vectors, represented as map<any_type, numeric>.
    Both vectors must be of the same type though.

    Parameters
    ----------
    a :
        The first vector.
    b :
        The second vector.

    Returns
    -------
    FloatingValue
        The dot product of the two vectors.

    Examples
    --------
    >>> import ibis
    >>> from mismo._compute import dot
    >>> v1 = ibis.array([1, 2])
    >>> v2 = ibis.array([4, 5])
    >>> dot(v1, v2)
    14.0  # 1*4 + 2*5
    >>> m1 = ibis.map({"a": 1, "b": 2})
    >>> m2 = ibis.map({"b": 3, "c": 4})
    >>> dot(m1, m2)
    6.0  # 2*3
    """
    if isinstance(a, ir.ArrayValue) and isinstance(b, ir.ArrayValue):
        a_vals = a
        b_vals = b
    elif isinstance(a, ir.MapValue) and isinstance(b, ir.MapValue):
        keys = a.keys().intersect(b.keys())
        a_vals = keys.map(lambda k: a[k])
        b_vals = keys.map(lambda k: b[k])
    else:
        raise ValueError(f"Unsupported types {type(a)} and {type(b)}")
    return _array_dot_product(a_vals, b_vals)


def norm(arr: T, metric: Literal["l1", "l2"] = "l2") -> T:
    """Normalize a vector to have unit length.

    The vector can either be a dense vector, represented as array<numeric>,
    or a sparse vector, represented as map<any_type, numeric>.
    The returned vector will have the same type as the input vector.

    Parameters
    ----------
    arr :
        The vector to normalize.
    metric : {"l1", "l2"}, default "l2"
        The metric to use. "l1" for Manhattan distance, "l2" for Euclidean distance.

    Returns
    -------
    ArrayValue
        The normalized vector.

    Examples
    --------
    >>> import ibis
    >>> ibis.options.interactive = True
    >>> from mismo._compute import norm
    >>> norm(ibis.array([1, 2]))
    [0.4472135954999579, 0.8944271909999159]
    >>> norm(ibis.array([1, 2]), metric="l1")
    [0.3333333333333333, 0.6666666666666666]
    >>> norm(ibis.map({"a": 1, "b": 2}))
    {"a": 0.4472135954999579, "b": 0.8944271909999159}
    """
    if isinstance(arr, ir.ArrayValue):
        vals = arr
    elif isinstance(arr, ir.MapValue):
        vals = arr.values()
    else:
        raise ValueError(f"Unsupported type {type(arr)}")

    if metric == "l1":
        denom = _array_sum(vals)
    elif metric == "l2":
        denom = _array_sum(vals.map(lambda x: x**2)).sqrt()
    else:
        raise ValueError(f"Unsupported norm {metric}")
    normed_vals = vals.map(lambda x: x / denom)

    if isinstance(arr, ir.ArrayValue):
        return normed_vals
    else:
        return ibis.map(arr.keys(), normed_vals)

Also have tests. Are you interested in these abstractions, should we try to use these more throughout this lib? I have some tests too. I can submit a PR if so.

feat: support more parts of end-to-end ML workflow

Objectives

TL;DR

Start at the "Alternatives considered" section.

Constraints

  • Ibis-ML will focus on enabling data processing workloads for ML on tabular data
  • Ibis-ML will be a standalone extension lib that depends on Ibis
  • Excludes domain-specific preprocessing like NLP, computer vision, and large language models
  • Does not address exploratory data analysis (EDA) or model training-related procedures

Mapping the landscape

Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.

Breaking down an end-to-end ML pipeline

Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.

image
Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline

The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.

The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:

  • Feature creation: Creating new features from existing ones or combining different features to create a new one.
  • Feature publishing: Pushing to a feature store to be used for training and inference by the entire organization.
  • Training dataset generation: Constructing training data by (if necessary, retrieving, and) joining features.
  • Data segregation: Splitting data into training, testing, and validation sets.
  • Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html
  • Hyperparameter tuning: https://scikit-learn.org/stable/modules/grid_search.html
  • Feature preprocessing
    • Feature standardization/normalization: Converting the feature values into similar scale and distribution values. Usually falls under model preprocessing.
    • Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
  • Feature selection: Select the most appropriate features to be cleaned and engineered. A number of automated algorithms exist.

Note

The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.

Feature comparison (WIP)

Tecton Scikit-learn BigQuery ML NVTabular Dask-ML Ray
Feature creation βœ… ❌ ❌ Partial ❌
Feature publishing βœ… ❌ Partial ❌ ❌
Training dataset generation βœ… ❌ βœ… ❌ ❌
Data segregation ❌ βœ… Partial ❌ βœ…
Cross validation ❌ βœ… ❌ ❌ βœ…
Hyperparameter tuning ❌ βœ… βœ… ❌ βœ…
Feature preprocessing ❌ βœ… βœ… βœ… βœ…
Feature selection ❌ βœ… ❌ ❌ ❌
Model training ❌ βœ… βœ… ❌ βœ…
Feature serving βœ… ❌ Partial ❌ ❌
Details

Tecton

Scikit-learn

  • Feature creation: No
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: Yes
  • Cross validation: Yes
  • Hyperparameter tuning: Yes
  • Feature preprocessing: Yes
  • Feature selection: Yes
  • Model training: Yes
  • Feature serving: No

BigQuery ML

NVTabular

  • Feature creation: Partial
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: No
  • Cross validation: No
  • Hyperparameter tuning: No
  • Feature preprocessing: Yes
  • Feature selection: No
  • Model training: No
  • Feature serving: No

Dask-ML

  • Feature creation: No
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: Yes
  • Cross validation: Yes
  • Hyperparameter tuning: Yes
  • Feature preprocessing: Yes
  • Feature selection: No
  • Model training: Yes
  • Feature serving: No

Ray

  • Feature creation:
  • Feature publishing:
  • Training dataset generation:
  • Data segregation:
  • Cross validation:
  • Hyperparameter tuning:
  • Feature preprocessing:
  • Feature selection:
  • Model training:
  • Feature serving:

Ibis-ML product hypotheses

Scope

  • A library needs to solve a sufficiently-large problem to be adopted widely. To this end, we want to provide value in multiple stages of the ML pipeline.
  • (Domain-driven) feature engineering is already handled sufficiently well by Ibis. As with other tools that are part of an ecosystem that already supports data transformation (e.g. BigQuery, Dask), we leave feature engineering to the existing tooling (i.e. Ibis).
  • Feature publishing, retrieval, and serving are orthogonal and can be left to feature platforms.
  • Model training can’t be done by a database, unless they control the underlying (cloud) infrastructure and can treat it as another distributed compute problem (e.g. in the case of BigQuery ML, Snowpark ML). Ibis doesn’t control the underlying compute infrastructure.
  • In practice, hyperparameter tuning in industry/large companies is often delegated to purpose-fit tools like Optuna.
  • The remainder is (potentially) in scope.
    • Data segregation and cross validation are required in most every tabular ML problem.
      • @jcrist "Re: test/train splits and CV things - I do think we can provide some utilities in the initial release for handling test/train splitting or CV work, but for now I suggest we mostly focus on single model pipelines and already partitioned data. We ingest an ibis.Table as training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."
    • Feature preprocessing is a good fit for Ibis to provide value. Technically, a user with a model pipeline (e.g. scikit-learn) may already include that in their pipeline, so they may or may not leverage this.
    • (Automated) feature selection is more case-by-base, and therefore a lower priority.

Alternatives considered

End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:

  1. Ibis-ML supports fitting data preprocessing steps (during the training process) and applying pre-trained Ibis-ML preprocessing steps (during inference).
    • Pros: Ibis-ML is used during both the training and inference process
    • Cons: Ibis-ML only supports data preprocessing, and even then a subset of steps that can be fit in database (e.g. not some very widely-used steps like PCA, that fit in the middle of the data-preprocessing pipeline)
  2. Ibis-ML supports constructing transformers from a wider range of pre-trained preprocessors and models (from other libraries, like scikit-learn), and applying them across backends (during inference).
    • Pros: Ibis-ML gives users the ability to apply a much wider range of steps in the ML process during inference time, including preprocessing steps that can be fit linearly (e.g. PCA) and even linear models (e.g. SGDRegressor, GLMClassifier). You can even showcase the end-to-end capabilities just using Ibis (from raw data to model outputs, all on your database, across streaming and batch, powered by Ibis)
    • Cons: Ibis-ML doesn't support training the preprocessors on multiple backends; the expectation is that you use a dedicated library/existing local tools for training
  3. A combination of 1 & 2, where Ibis-ML supports a wider range of pre-processing steps and models, but only a subset support a fit method (those that don't need to be constructed .from_sklearn() or something).
    • Pros: Support the wider range of operations, and also fitting everything on the database in simple cases.
    • Cons: Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified. @jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.

Proposal

I propose to go with option #3 of the alternatives considered. In practice, this will mean:

  • Keeping the existing structure of Ibis-ML
  • Adding the ability to construct transforms from_sklearn (and, in the future, potentially other libraries)
    • Some of the transforms may not be steps you can fit using Ibis-ML

This also means that the following will be out of scope (at least, for now):

  • Train-test split (may have value to add in the future)
  • CV (may have value to add in the future)
  • Hyperparameter tuning (less hypothesized value; probably better to integrate with existing frameworks like Optuna)

Deliverables

Guiding principles

  • At launch, we should showcase an end-to-end Ibis-ML workflow, from preprocessing to model inference.
    • The goal is to get people excited about Ibis-ML, and for them to try the example(s).
  • In future releases, we will increase the number of methods we support for each step. If we are successful in the first sub-goal (about generating excitement), people in the community will provide direction for and even contribute to this effort.
  • The library should be incrementally adoptable. The user should get benefit out of using just Ibis data segregation or feature preprocessing, and then they should be able to move on to adding another piece, and get further value.

Demo workflows

  1. Fit preprocessing on DuckDB (local experience, during experimentation)
  2. Experiment with different features
  3. Fit finalized preprocessing on larger dataset (e.g. from BigQuery)
  4. Perform inference on larger dataset

We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.

We need variants for all of:

  • scikit-learn (we already have)
  • XGBoost
  • PyTorch

With less priority:

  • LightGBM
  • Tensorflow
  • CatBoost

High-level deliverables

P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).

  • [P0] Support handoff to XGBoost (for training and inference) Update: to_dmatrix/to_dask_dmatrix are already implemented
  • [P0] Support handoff to PyTorch (for training and inference)
  • [P0] Build demo workflows
  • [P0] Make documentation ready for "initial" release
  • [P0] Increase coverage of Ibis-ML preprocessing steps w.r.t. tidymodels
  • [P1] Increase coverage of data processing transformer(s) from_sklearn
  • [P2] Increase coverage of model prediction transformer(s) from_sklearn (i.e. those with predict functions that don't require UDFs)
  • [P2] Support handoff to LightGBM (for training and inference)
  • [P2] Support handoff to Tensorflow (for training and inference)
  • [P2] Support handoff to CatBoost (for training and inference)
  • [P2] Support (demo?) inference in streaming contexts
  • [P3] Support constructing some data preprocessing transformer(s) from_sklearn (e.g. PCA, or some more frequently used step)
  • [P3] Support constructing some (linear) model prediction transformer(s) from_sklearn (e.g. SGDRegressor)

Questions for validation

  • Does being able to perform inference for certain model classes directly on the database, without UDF, provide real value? Are models with linear predict too narrow a category for people to care?

Changelog

2024-03-19

Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:

  • Ibis-ML should leverage the underlying engine during both training and inference, and speeding up the training iteration loop on big data is a key value proposition. Therefore, support for constructing transformers from_sklearn is no longer a priority, from P0 to P3.
    • The associated demo of scaling inference only is also removed.
  • Increasing coverage of ML preprocessing steps w.r.t. tidymodels recipes and sklearn.preprocessing is a higher priority. We break down the relative priority of implementing steps in a separate issue.

feat: K-bins discretization

Definition

map numeric columns into k bins

Possibble cutting strategy

  • Uniform: All bins in each feature maintain identical widths.

  • Quantile: Each feature's bins contain an equal number of data points.

  • K-means: Values within each bin share the same nearest center of a one-dimensional k-means cluster. [SQL backend may not support this]

Reference

docs(website): document custom step implementation

Once the Step interface solidifies a bit more, we may want to document how users can implement their own Step subclass. We probably won't want to document for external usage for a while (the interface is still in flux and not really user-facing). Docs on the internals may be useful before then for aiding in contributions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.