ibis-project / ibis-ml Goto Github PK

View Code? Open in Web Editor NEW

52.0 8.0 9.0 2.56 MB

IbisML is a library for building scalable ML pipelines using Ibis.

Home Page: https://ibis-project.github.io/ibis-ml/

License: Apache License 2.0

Python 100.00%

feature-engineering machine-learning sql ibis

ibis-ml's Introduction

IbisML

What is IbisML?

IbisML is a library for building scalable ML pipelines using Ibis:

Preprocess your data at scale on any Ibis-supported backend.
Compose Recipes with other scikit-learn estimators using Pipelines.
Seamlessly integrate with scikit-learn, XGBoost, and PyTorch models.

How do I install IbisML?

pip install ibis-ml

How do I use IbisML?

With recipes, you can define sequences of feature engineering steps to get your data ready for modeling. For example, create a recipe to replace missing values using the mean of each numeric column and then normalize numeric data to have a standard deviation of one and a mean of zero.

import ibis_ml as ml

imputer = ml.ImputeMean(ml.numeric())
scaler = ml.ScaleStandard(ml.numeric())
rec = ml.Recipe(imputer, scaler)

A recipe can be chained in a Pipeline like any other transformer.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipe = Pipeline([("rec", rec), ("svc", SVC())])

The pipeline can be used as any other estimator and avoids leaking the test set into the train set.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe.fit(X_train, y_train).score(X_test, y_test)

ibis-ml's People

Contributors

Stargazers

Watchers

Forkers

jcrist sfc-gh-twhite gforsyth chris-park toohsk deepyaman jitingxu1 ted0928 indexseek

ibis-ml's Issues

doc: change ImputeMode for categorical features

ImputeMode should be recommended for categorical features.

It is use numeric column in the example.

feat: Run ml data preprocess workflow on different backends[Theseus and big query]

Run ml data preprocess workflow on different backends[Theseus and big query].

It may involves the following tasks:

spinup Backend engine
Larger size dataset
- Fannie mae SFH Mortgage could be one choice
- Credit card Fraud detection
Build the preprocessing pipeline

IbisML - Interactive Mode

After performing a fit and a transform with a recipe, the return looks like this, even in interactive mode.

In [1]: import ibisml as ml
   ...: from ibis.interactive import *
   ...: 
   ...: con = ibis.duckdb.connect()
   ...: 
   ...: diamonds = con.read_csv(
   ...:     "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv"
   ...: )
   ...: 
   ...: rec = ml.Recipe(ml.CategoricalEncode("cut"))
   ...: rec.fit(diamonds).transform(diamonds)
Out[1]: 
TransformResult:
- Features {
    carat    float64
    color    string
    clarity  string
    depth    float64
    table    float64
    price    int64
    x        float64
    y        float64
    z        float64
    cut      int64
}

It's possible to do rec.fit(diamonds).transform(diamonds).table to see the result, but I wasn't sure this was the expected behavior using interactive mode.

Error in `README` snippet. `RecipeTransform` vs `TransformResult`

The README includes the following snippet which has errors (I added type hints for readability)

import ibis
import ibisml

train = ibis.read_csv(...)

recipe: ibisml.Recipe = ibisml.Recipe()
transform: ibisml.RecipeTransform = recipe.fit(train, outcomes=["outcome_col"])

df_train: pd.DataFrame = transform(train).to_pandas()
X = df_train[transform.features]  # <- ibisml.RecipeTransform doesn't have `.features` or `.outcomes attributes
y = df_train[transform.outcomes]

Code needs to be refactored to

recipe: ibisml.Recipe = ibisml.Recipe()
transform: ibisml.RecipeTransform = recipe.fit(train, outcomes=["outcome_col"])

train_fitted: ibisml.TransformResult = transform(train)  # <- ibisml.TransformResult has `.features` or `.outcomes attributes
df_train: pd.DataFrame = train_fitted.to_pandas()
X = df_train[train_fitted.features]  # <- ibisml.TransformResult has `.features` or `.outcomes attributes
y = df_train[train_fitted.outcomes]

I don't know the motivations behind separating Recipe, RecipeTransform, and TransformResult. What raises questions is that both the transform and the train_fitted objects need to access train, but only the latter has .features and .outcomes set. Users coming from scikit-learn would expect 2 "layers" instead of 3 with Recipe (with a fitted: bool flag) and some sort of result.

Maybe a relevant distinction for users is that you should store RecipeTransform or TransformResult when you need matching training/inference pipelines, and store Recipe want you intend to reuse a pipeline across datasets/feature sets/model versions.

I'm familiar with the scikit-learn code conventions (which are not perfect), but the following sections on instantiation, fitting, and estimated attributes might be relevant. Transformer/Estimator parameters are only set by __init__ and set_params (essentially the ibisml.Recipe part of the API) and attributes derived from the data all have a trailing underscore. For instance, we would have .features_ derived from train and .outcomes_ is received during .fit() instead of __init__

feat: enable not fitted check for transform_table() and throw proper exception

The current transform_table throws AttributeError when step is not fitted.

import ibisml as ml
t = ibis.memtable({"a": [1, 2, 3], "b": [2,3,4], "c": [3,4,5]})
step = ml.CountEncode(ml.string())
step.transform_table(t)

----> 1 step.transform_table(t)

File /Users/voltrondata/repos/ibisml/ibisml/steps/encode.py:281, in CountEncode.transform_table(self, table)
    280 def transform_table(self, table: ir.Table) -> ir.Table:
--> 281     for c, value_counts in self.value_counts_.items():
    282         joined = table.left_join(
    283             value_counts, table[c] == value_counts[0], lname="left_{name}", rname=""
    284         )
    285         table = joined.drop(value_counts.columns[0], f"left_{c}").rename(
    286             {c: f"{c}_count"}
    287         )

AttributeError: 'CountEncode' object has no attribute 'value_counts_'

Desired output

Ensure step is fitted when transform_table, otherwise throws proper error, such as NotFitError

ExpandDate + ExpandTime = ExpandDateTime

I love using the ExpandDate and ExpandTime methods to extract helpful columns quickly. It would be neat if there were an "all-in-one" method to do both date and time attributes.

import ibisml as ml
from ibis.interactive import *

wowah_data = ex.wowah_data_raw.fetch()

recipe = ml.Recipe(ml.ExpandDate("timestamp"), ml.ExpandTime("timestamp"))

recipe.fit(wowah_data).transform(wowah_data).table

feat: Polynomial features transformation

Definition

Create a feature matrix by including polynomial combinations of the input features up to the specified degree. For instance, for a two-dimensional input [x1, x2], the degree-2 polynomial features would be [1, x1, x2, x1^2, x1*x2, x2^2]. It could be splitted into two steps: polynomial and interact features

Reference

feat: support pytorch dataformat

Implement to_pytortch() for modeling in pytorch

chore: rename IbisML/`ibisml` to Ibis-ML/`ibis-ml`

Rename the library ahead of release, to avoid autocorrecting to/mispronunciation as "abysmal." 😂

Will require removing the existing package, as the package name ibis-ml is too similar to the existing one to be published on PyPI.

docs: build demo workflows

~~We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.~~ Update: Due to the RecSys2020 demo data being unavailable (and against Twitter's terms to share), we will start with the R nycflights13 dataset. It has been added to Ibis examples to support this.

Major tasks

Demo Dataset
- nycflights13
Feature engineering
- Use ibis and ibisml for data preprocessing
Model training
- XGBoost
- sklearn
- PyTorch

Push release 0.2.0 to Pypi

I was hitting an error trying in ibisml.steps with the import of ibis.expr.deferred.Deferred.

I see that this was fixed on the GitHub release 0.2.0 from November 2023. However, the latest version on Pypi is 0.1.0 from September 2023.

bug(steps): handle col with all nulls in impute

For all the impute, we need to handle cols with all nulls. We need to tell the user the column is all nulls by failing the imputing or throw warnings.

I prefer fail the impute.

a = ibis.memtable(
    {"all_null": np.array([np.nan, np.nan, np.nan], dtype="float64")}
)
step = ml.ImputeMean(ml.numeric())
step.fit_table(a, ml.core.Metadata())
step.transform_table(a)

┏━━━━━━━━━━┓
┃ all_null ┃
┡━━━━━━━━━━┩
│ float64  │
├──────────┤
│      nan │
│      nan │
│      nan │
└──────────┘

feat: preprocessing transformation priorities

Building upon the deliverables outlined in issue #19, the objective is to enhance the coverage of ibisml machine learning preprocessing transformations, prioritizing key areas for improvement.

Please share your favorite ML transformation for your daily ML tasks and provide additional context as to why you find it particularly useful.

Assumption

Raw feature creation is done using ibis
tabular data

Priority definition:

P0: Essential tasks vital for the model development, Essential before our initial release.
P1: Desirable tasks that can enhance the model, based on feedback and further optimization.
P2: Additional tasks aimed at improving the model, based on feedback and further optimization.

Priorities

Preprocessing Module	Ibis-ml Step	sklearn	Priority	Status	Note	Model Needed
Encoding	CatgoricalEncode	OrdinalEncoder	P0	Done
Encoding	CountEncode		P1	Done
Feature Engineering	CreatePolynomialFeatures	PolynomialFeatures	P0	Done
Non-linear Transformation	Math Transformation (log, sqrt,)		P1	Done	ibis
Standardization and Scaling	ScaleStandard	StandardScaler	P0	Done		KNN, MLPBased, SVM
Encoding	TargetEncode	TargetEncoder	P0	Done
Feature Reduction	DropZeroVariance	VarianceThreshold	P0	Done
Imputing	HandleUnivariateOutliers	SimpleImputer	P0	Done
Feature Engineering	ratio variable creation		P0	Done	ibis
Discretition	DiscretizeKBins	KBinsDiscretizer	P0	Done
Discretition	Feature binarization	Binarizer	P1	Done
Standardization and Scaling	ScaleMinMax	MinMaxScaler	P0	Done		KNN, MLPBased, SVM
Custom Transformer	Custom transform	FunctionTransformer	P0	Done
Encoding	OneHotEncode	OneHotEncoder	P0	Done
Imputing	Outlier - Impute and capping		P0	Done		Log/Linear Reg
Feature Reduction	Continuous Target Mutual Info		P1	Not started
Feature Reduction	Discrete Target Mutual information		P1	Not started
Feature Engineering - Text	Count Transfomer	CountVectorizer	P2	Not started
Feature Engineering - Text	TFIDF Transformer	TfidfTransformer	P2	Not started
Encoding	label binarizer	LabelBinarizer	P2	Not started
Encoding	label encode	LabelEncoder	P2	Not started
Standardization and Scaling	MaxAbsScaler	MaxAbsScaler	P2	Not started
Standardization and Scaling	RobustScaler	RobustScaler	P1	Not started		KNN, MLPBased, SVM
Imputing	Missing value - Nearest Neighbor	KNNImputer	P1	Not started	Doable
Non-linear Transformation	QuantileTransformer	QuantileTransformer	P1	Not started
Non-linear Transformation	Inverse and Logit transformation		P2	Not started
Imputing	Missing value - Linear reg		P1	Not started	Not Support
Imputing	Missing value - bagged trees		P1	Not started	Not Support
Feature Reduction	Filter col with missing rate threshold		P1	Not started
Feature Reduction	Filter Feature by high correlation		P2	Not started	Doable
Non-linear Transformation	PowerTransformer	PowerTransformer	P1	Not started		MLPBased, SVM
Feature Reduction	PCA		P1	Not started	Not Support
Imputing	Missing Value - rolling window Imputing		P2	Not started
Feature Engineering	Spline transformer	SplineTransformer	P1	Not started

Reference:

feat(steps): handle unknown category for all encoders

Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.

Open an issue to record this for future consideration.

The current implemenations:

CategoricalEncode will convert unknown category to None
OneHotEncode will convert all encoded cols to 0, see following example.
CountEncode will convert unknown category to 0

For example:

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res

AMZN in the 5th row is unknown, it will be translated to all 0s

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time                    ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp               │ int8        │ int8        │ int8        │ int8        │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.038 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.048 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.049 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.050 │           0 │           0 │           0 │           0 │
│ 2016-05-25 13:30:00.051 │           0 │           0 │           0 │           1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

feat: alternative OHE using vectors

background:

instead of a full sql translation of one-hot encoding algorithm, he was envisioning more of as a backend registered function, which probably will be more performant.

That requires SQL metaprogramming to first enumerate a list of unique values in the columns and then craft the create statement

additionally it pigeon holes you into one hot encoding having to return separate columns per value whereas many frameworks, i.e. Spark MLlib if I remember correctly, return something like a 2d vector in lieu of those columns (which can be handled much more efficiently from a memory / hardware perspective)

Some response:

When fitting a one-hot-encoder we already have to collect all the cases so they're consistent across all applications of transform.
The only difference here would be whether a one-hot-encoder should return a column-per-case or a column of an array of cases. I'd argue that since the consuming tooling will want a flat array, not special casing one-hot-encoding (for now) and returning a column-per-case is the correct approach.
Also note - ibisml already has a OneHotEncode step that does all this.

feat: support offline and realtime production deployment for inference

At present, we haven't quite cracked the code on efficiently running ML models on a SQL backend.

One approach to transitioning the preprocessing pipeline involves saving it in a readable file format for deployment into offline or real-time production systems.

docs(website): create skeleton for new FAQ section

It may make sense to have a separate FAQ on the website with things like:

[EPIC] Launch IbisML

Acceptance criteria: IbisML is improved and publicly launched with a blog.

TODOs:

investigate integration with https://github.com/ydataai/ydata-profiling

docs: add missing steps to documentation

The recently-added steps are not rendering on https://ibis-project.github.io/ibis-ml/. This task should add all the missing steps to the docs; from here forward, we should ideally make sure PRs that add new steps also add them to the docs.

(Maybe also confirm whether https://github.com/ibis-project/ibis-ml/blob/main/docs/_quarto.yml needs to be updated on each PR, or if this could be better handled by the API doc generation.)

doc: ibisml for sklearn/Tidymodel users tutorial sessions

It's really important to make sure that ibisml is user-friendly for people who are already familiar with scikit-learn and tidymodel. Adding coverage of scikit-learn and sklearn preprocessing in the documentation and tutorials will make it easier for python and R users to start using ibisml.

In this tutorial or documentation, we should cover the following points:

Mapping scikit-learn and Tidymodel preprocessing techniques directly to ibisml equivalents.
Providing examples of ibisml preprocessing, which can be taken from the API reference documentation.
[Optional] Demonstrating how to integrate ibisml preprocessing with scikit-learn and tidymodels. This was partly covered in the main tutorial, but we can expand on it if needed.

Also need to combine this #30

bug(bigquery): invalid field name Cast(y, int64) in bigquery

See this ibis issue for details.
I'm not 100% if this will get other errors, but one potential fix is to remove the ibis.memtable() here, I do see we have other places uses memtable, need to double check and test the usage of memtable in other place.

To reproduce the error in ibis-ml:

import ibis
con = ibis.bigquery.connect(project_id="xxx", dataset_id="xxx")
t = ibis.memtable({
    "x": ["a", "b"],
    "y": ["0", "1"]
})
con.create_table(
    "t", t.to_pyarrow(), overwrite=True
)

t = con.table("t")
t = t.mutate(
    x1=ibis.literal("c"),
    y=_.y.cast("int32"),
)

x = t.drop("y")
y = t.y.cast("int64")

import ibis_ml as ml

step = ml.TargetEncode(["x"])
table, targets, index = ml.core.normalize_table(x, y)
print(targets)
metadata = ml.core.Metadata(targets=targets)
step.fit_table(table,metadata)
step.transform_table(x)

error:

BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/voltrondata-demo/jobs?uploadType=multipart:
 Invalid field name "Cast(y, int64)_194474".
 Fields must contain the allowed characters, 
and be at most 300 characters long. 
For allowed characters, please refer to https://cloud.google.com/bigquery/docs/schemas#column_names

ci: add `pytest-cov`, and require minimum coverage

Repr for `Recipe` and `Step`

We used to have nicer custom reprs for each Step and Recipe, but these were dropped in the refactor. Now that we're going for scikit-learn compatibility, it would be nice if our repr would match their conventions. We could accomplish this by subclassing Recipe from BaseEstimator, but currently we don't have sklearn as a required dependency (since we want to support non-sklearn workflows as well too). My guess is there's a way we can have a repr that follows sklearn's conventions here without relying on sklearn itself for an implementation.

feat: Filter Feature by variance

definition

Develop a feature selector that excludes features with 0 or near 0 variance.

Need to handle categorical features: the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Reference

Recipes step_zv()
sklearn variance threshold
Recipes step_nzv()

test(steps): add unit tests for `CategoricalEncode` step

CategoricalEncode will convert unknown category to None

I haven't looked into whether this is correct or not, since the step doesn't have any tests; we should definitely add one, and then can identify the correct behaviors. 😅

Originally posted by @deepyaman in #73 (comment)

docs(tutorial): remove repeated backend connection

Can be done without side-effects once ibis-project/ibis#9107 is released. Until then, it's just there so as to not list the temporary tables in .list_tables().

feat: ratio variable creation

definition

Create one or multiple ratios using selected numerator and denominator variables.

Reference

Recipes step_ratio()

refactor(steps): rename `CategoricalEncode` to `OrdinalEncode`

Category encoders are a whole category of encoders, including OHE, count encoding, target encoding. Ordinal encoding is yet another method of category encoding.

Docstrings + Examples + Help Get Started

There are some pretty good documented functions in the step functions. Is there another area to which you would like to add some docstrings? I'm unsure how many people would need/reference any of the inner functions, or if they would need to see docstrings for the transform functions.

We could do a small "getting-started" section, why IbisML?, or "IbisML for scikit-learn" users.

Happy to mock up a few small examples. :)

bug(steps): CategoricalEncode - rows containing NULL value in the encoded column will be relocated to the bottom

The rows with null values will be shuffled to the bottom by the CategoricalEncode step. This might pose an issue if certain rows remain grouped together when building the model, especially if the user forgets to shuffle them again.

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", None, "MSFT", "MSFT", None, "AAPL", "GOOG", None],
...         }
...     )

>>> step = ml.CategoricalEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_train)

Original table:

In [18]: t_train
Out[18]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ time                    ┃ ticker ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ timestamp               │ string │
├─────────────────────────┼────────┤
│ 2016-05-25 13:30:00.023 │ GOOG   │
│ 2016-05-25 13:30:00.023 │ NULL   │
│ 2016-05-25 13:30:00.030 │ MSFT   │
│ 2016-05-25 13:30:00.041 │ MSFT   │
│ 2016-05-25 13:30:00.048 │ NULL   │
│ 2016-05-25 13:30:00.049 │ AAPL   │
│ 2016-05-25 13:30:00.072 │ GOOG   │
│ 2016-05-25 13:30:00.075 │ NULL   │
└─────────────────────────┴────────┘

Transformed table, rows containing NULL values in the encoded column will be relocated to the bottom..

In [17]: res
Out[17]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ time                    ┃ ticker ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ timestamp               │ int64  │
├─────────────────────────┼────────┤
│ 2016-05-25 13:30:00.023 │      1 │
│ 2016-05-25 13:30:00.030 │      2 │
│ 2016-05-25 13:30:00.041 │      2 │
│ 2016-05-25 13:30:00.049 │      0 │
│ 2016-05-25 13:30:00.072 │      1 │
│ 2016-05-25 13:30:00.023 │   NULL │
│ 2016-05-25 13:30:00.048 │   NULL │
│ 2016-05-25 13:30:00.075 │   NULL │
└─────────────────────────┴────────┘

feat: Target Encoder implementation

Definition

Target encoding involves replacing categorical feature values with a numeric representation derived from the target variable. This method aims to capture the relationship between categorical features and the target variable by encoding categories with their respective impact on the target.

use case

High Cardinality Features

requirements

Regression, binary and multiclass classification.
Handle overfitting
Handle unknown category, the new category not present in the training dataset.
Handle missing value

Implementation

Fit

Treat missing Value
Use the mean value of target variable for that category for regression task
Use the conditional probability given that category
Handle Over-fittings
- [recommend] Smoothing
- KFold Target Encoder
- Leave-one-out
- Adding Gaussian Noise

Transform

Unknown category

Issues

over-fitting
target leakage

Reference

Scikit-learn TargetEncoder
Target encoding done the right way

docs: make website launch-ready

https://ibis-project.github.io/ibis-ml/

[P0] Installation
[P0] Have a getting started example in the documentation (nycflights13 demo)
[P0] Why IbisML?
[P1] Concepts
[P1] How-to guides (e.g. how to write your own Step, how to use IbisML without scikit-learn, inverse transform (when available?), how to use arbitrary Ibis expressions, etc.?) (see #92)

feat: Outlier - Impute and capping

Definition

impute or cap/floor the outliers of numeric features by percentile or a user-defined threshold.

Examples:

Apply caps and floors to each column; if a value is greater than or equal to the 99th percentile, replace it with the value at the 99th percentile, and if it is less than or equal to the 1st percentile, replace it with the value at the 1st percentile.

Reference

Recipes step_impute_lower()

feat: Add ML operator support matrix, and ideally compares coverage with other major framework such as scikit-learn

It'll be nice to have an operator support matrix similar to Ibis', with coverages compare to
scikit-learn

bug(steps): `OneHotEncode` treats NaNs differently

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res.to_pandas()
                     time  ticker_AAPL  ticker_GOOG  ticker_MSFT  ticker_None
0 2016-05-25 13:30:00.023          0.0          0.0          1.0            0
1 2016-05-25 13:30:00.038          0.0          0.0          1.0            0
2 2016-05-25 13:30:00.048          0.0          1.0          0.0            0
3 2016-05-25 13:30:00.049          0.0          1.0          0.0            0
4 2016-05-25 13:30:00.050          0.0          0.0          0.0            0
5 2016-05-25 13:30:00.051          NaN          NaN          NaN            1

test(steps): increase coverage of nontrivial steps

docs(website): add support matrix

enable unit tests on different backends

backends testing env setup
enable unit tests on different backends
add this to CI

bug(steps): treat null values as proper categories

Any null value is replaced with target mean, whereas the docstring says that nulls are treated as a separate category. More from my conversation with @jitingxu1:

Adding a test like:

    def test_target_encode_null():
        t = ibis.memtable(
            {
                "Color": ["Red"] * 5 + [None] * 3,
                "Target": [1, 1, 0, 0, 0, 1, 0, 0],
            }
        )
        expected = pd.DataFrame(
            {
                "Color": [2 / 5] * 5 + [1 / 3] * 3,
            }
        )
    
        step = ml.TargetEncode("Color")
        step.fit_table(t, ml.core.Metadata(targets=("Target",)))
        res = step.transform_table(t)
        tm.assert_frame_equal(res.to_pandas()[expected.columns], expected)

Will result in a failure like:

E   AssertionError: DataFrame.iloc[:, 0] (column name="Color") are different
E   
E   DataFrame.iloc[:, 0] (column name="Color") values are different (37.5 %)
E   [index]: [0, 1, 2, 3, 4, 5, 6, 7]
E   [left]:  [0.4, 0.4, 0.4, 0.4, 0.4, 0.375, 0.375, 0.375]
E   [right]: [0.4, 0.4, 0.4, 0.4, 0.4, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
E   At positional index 5, first diff: 0.375 != 0.3333333333333333

May be somewhat related to #73.

Possible implementation

~~Filling null values with a special indicator value first could work.~~ It's usually more efficient to just check for nulls in the join condition (ref: https://bertwagner.com/posts/joining-on-nulls/).

docs(website): explain how users can perform train-test splitting with Ibis

randomly partition a dataset into subsets while ensuring reproducibility

Reference:

feat: model based ml data preprocessing

preprocessing data using machine learning models before feed it into model training.

Examples:

model based imputing
- KNN imputer
model based outlier detection
- tidy.outliers
model based encoding
- RandomTreesEmbedding

Note: certain models may be challenging to execute on SQL backends

feat: add serializer and deserializer for recipe

Backlog item, details to follow.

Is (or will) it be possible to use ibisml within sklearn pipelines?

Very happy to find this project, thanks for creating this library! Basically I'm looking for a way to write something resembling a scikit-learn pipeline to train an ML model which can be executed on BQ (even if maybe just partly). I'm also using BQML for some use cases, but it has some limitations that I'm currently jumping back into python for, which has some barriers/constraints in my current context.

From the ibisml readme, it looks to me like the difference between an ibisml recipe and a sklearn pipeline is that the ibisml recipe only encompasses feature transformations, while an sklearn pipeline includes feature transformations and a model at the end. Is that a fair characterization?

From my perspective, what makes sklearn pipelines so useful is that they encompass both the feature transformations and the model, which can be jointly fit on a training dataset, cross-validated on an eval dataset, then refit on a full dataset and used to predict on new data. That whole workflow seems to get more complicated if I have to manage separate objects for my feature-transform recipe and model.

But maybe it should be possible to write a sklearn pipeline which contains both ibisml feature transformations, and python models at the end? In that case, if the compute for the feature transformations are still farmed out to BQ and the underlying python environment/compute-resource only has to run the clf.fit or clf.predict part, that would still be very exciting to me.

Thanks again!

Expand test coverage

Right now the test coverage is pretty poor, mostly covering the core operations and a smattering of simple Step bits. We should:

Enable test coverage metrics (possibly using codecov, no strong thoughts)
Expand test coverage to at least cover the fit/transform bits of all steps

feat: support direct numpy output

The current to_numpy() in ibisml is based on a pandas dataframe: Convert ibis table to pandas dataframe, then to numpy. It is not efficient.

Some backend, like duckdb, could directly convert query result to a numpy array.

Had some initial discussion in ibis triage, we decided to hold off until we encounter a specific use case before proceeding further.

Applying custom transforms

What is the plan for allowing custom transforms for feature engineering to be applied?

These can range from something like:

def is_high_roller(order_value: float) -> bool:
    return order_value > 1000.0

to more complex where multiple successive transforms are required:

def survey_features(raw_json: str) ->  dict:
    return json.loads(raw_json)["survey"]

def one_hot_encode(survey_features: dict) -> list:
    return ...  # takes in survey features an one hot encodes them...

Invariably not everything is translatable to SQL, so you need some runtime to execute things with...

Note: I drive hamilton which could have an integration with ibis...

interest in vector abstractions?

I have these existing vector abstractions in ibis:

"""Vector operations"""
from __future__ import annotations

from typing import Literal, TypeVar

import ibis
import ibis.expr.types as ir


@ibis.udf.scalar.builtin(name="array_sum")
def _array_sum(a) -> float:
    ...


# for duckdb
@ibis.udf.scalar.builtin(name="list_dot_product")
def _array_dot_product(a, b) -> float:
    ...


T = TypeVar("T", ir.MapValue, ir.ArrayValue)


def dot(a: T, b: T) -> ir.FloatingValue:
    """Compute the dot product of two vectors

    The vectors can either be dense vectors, represented as array<numeric>,
    or sparse vectors, represented as map<any_type, numeric>.
    Both vectors must be of the same type though.

    Parameters
    ----------
    a :
        The first vector.
    b :
        The second vector.

    Returns
    -------
    FloatingValue
        The dot product of the two vectors.

    Examples
    --------
    >>> import ibis
    >>> from mismo._compute import dot
    >>> v1 = ibis.array([1, 2])
    >>> v2 = ibis.array([4, 5])
    >>> dot(v1, v2)
    14.0  # 1*4 + 2*5
    >>> m1 = ibis.map({"a": 1, "b": 2})
    >>> m2 = ibis.map({"b": 3, "c": 4})
    >>> dot(m1, m2)
    6.0  # 2*3
    """
    if isinstance(a, ir.ArrayValue) and isinstance(b, ir.ArrayValue):
        a_vals = a
        b_vals = b
    elif isinstance(a, ir.MapValue) and isinstance(b, ir.MapValue):
        keys = a.keys().intersect(b.keys())
        a_vals = keys.map(lambda k: a[k])
        b_vals = keys.map(lambda k: b[k])
    else:
        raise ValueError(f"Unsupported types {type(a)} and {type(b)}")
    return _array_dot_product(a_vals, b_vals)


def norm(arr: T, metric: Literal["l1", "l2"] = "l2") -> T:
    """Normalize a vector to have unit length.

    The vector can either be a dense vector, represented as array<numeric>,
    or a sparse vector, represented as map<any_type, numeric>.
    The returned vector will have the same type as the input vector.

    Parameters
    ----------
    arr :
        The vector to normalize.
    metric : {"l1", "l2"}, default "l2"
        The metric to use. "l1" for Manhattan distance, "l2" for Euclidean distance.

    Returns
    -------
    ArrayValue
        The normalized vector.

    Examples
    --------
    >>> import ibis
    >>> ibis.options.interactive = True
    >>> from mismo._compute import norm
    >>> norm(ibis.array([1, 2]))
    [0.4472135954999579, 0.8944271909999159]
    >>> norm(ibis.array([1, 2]), metric="l1")
    [0.3333333333333333, 0.6666666666666666]
    >>> norm(ibis.map({"a": 1, "b": 2}))
    {"a": 0.4472135954999579, "b": 0.8944271909999159}
    """
    if isinstance(arr, ir.ArrayValue):
        vals = arr
    elif isinstance(arr, ir.MapValue):
        vals = arr.values()
    else:
        raise ValueError(f"Unsupported type {type(arr)}")

    if metric == "l1":
        denom = _array_sum(vals)
    elif metric == "l2":
        denom = _array_sum(vals.map(lambda x: x**2)).sqrt()
    else:
        raise ValueError(f"Unsupported norm {metric}")
    normed_vals = vals.map(lambda x: x / denom)

    if isinstance(arr, ir.ArrayValue):
        return normed_vals
    else:
        return ibis.map(arr.keys(), normed_vals)

Also have tests. Are you interested in these abstractions, should we try to use these more throughout this lib? I have some tests too. I can submit a PR if so.

feat: add inverse_transform() to steps

Transform processed data back to original feature space.

Note that some transformations cannot be converted to the original value.

Reference:

feat: support more parts of end-to-end ML workflow

Objectives

Provide context on the data preprocessing, feature engineering, and model training ML workflow to inform the scope of Ibis-ML
Propose direction and deliverables for Q1 and Q2 (and roadmap items for further down the road)

TL;DR

Start at the "Alternatives considered" section.

Constraints

Ibis-ML will focus on enabling data processing workloads for ML on tabular data
Ibis-ML will be a standalone extension lib that depends on Ibis
Excludes domain-specific preprocessing like NLP, computer vision, and large language models
Does not address exploratory data analysis (EDA) or model training-related procedures

Mapping the landscape

Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.

Breaking down an end-to-end ML pipeline

Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.

Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline

The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.

The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:

Feature creation: Creating new features from existing ones or combining different features to create a new one.
Feature publishing: Pushing to a feature store to be used for training and inference by the entire organization.
Training dataset generation: Constructing training data by (if necessary, retrieving, and) joining features.
Data segregation: Splitting data into training, testing, and validation sets.
Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html
Hyperparameter tuning: https://scikit-learn.org/stable/modules/grid_search.html
Feature preprocessing
- Feature standardization/normalization: Converting the feature values into similar scale and distribution values. Usually falls under model preprocessing.
- Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
Feature selection: Select the most appropriate features to be cleaned and engineered. A number of automated algorithms exist.

Note

The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.

Feature comparison (WIP)

	Tecton	Scikit-learn	BigQuery ML	NVTabular	Dask-ML
Feature creation	✅	❌	❌	Partial	❌
Feature publishing	✅	❌	Partial	❌	❌
Training dataset generation	✅	❌	✅	❌	❌
Data segregation	❌	✅	Partial	❌	✅
Cross validation	❌	✅	❌	❌	✅
Hyperparameter tuning	❌	✅	✅	❌	✅
Feature preprocessing	❌	✅	✅	✅	✅
Feature selection	❌	✅	❌	❌	❌
Model training	❌	✅	✅	❌	✅
Feature serving	✅	❌	Partial	❌	❌

Details

Tecton

Feature creation: Yes
- This is one of Tecton’s core value propositions. They support Spark and Rift (proprietary Python-based compute engine) for feature definition. Rift allows a broader range of Python transformations (i.e. not just SQL-like operations, and avoiding UDFs).
Feature publishing: Yes
- The other half of Tecton’s core capabilities.
Training dataset generation: Yes
- In Tecton, this involves first retrieving published features: https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-training/constructing-training-data
Data segregation: No
Cross validation: No
Hyperparameter tuning: No
Feature preprocessing: No
- Together with model development, this is delegated to another library (e.g. scikit-learn).
Feature selection: No
Model training: No
Feature serving: Yes
- https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference

Scikit-learn

Feature creation: No
Feature publishing: No
Training dataset generation: No
Data segregation: Yes
Cross validation: Yes
Hyperparameter tuning: Yes
Feature preprocessing: Yes
Feature selection: Yes
Model training: Yes
Feature serving: No

BigQuery ML

Feature creation: No
- Just write SQL in BigQuery itself. 🙂
Feature publishing: Partial
- Vertex AI Feature Store
Training dataset generation: Yes
- Either pull a BigQuery table or fetch data from Vertex AI Feature Store, depending on if features are published.
Data segregation: Partial
- Pass DATA_SPLIT_* parameters to your CREATE MODEL statement to control how train-test splitting is done. You can’t extract the split dataset.
Cross validation: No (automated?)
Hyperparameter tuning: Yes
- Pass HPARAM_* parameters to your CREATE MODEL statement.
Feature preprocessing: Yes
- E.g. https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer
Feature selection: No
- Does have https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-importance
Model training: Yes
Feature serving: Partial
- See feature publishing

NVTabular

Feature creation: Partial
- LambdaOp and JoinExternal enable very simple row-level feature engineering
Feature publishing: No
Training dataset generation: No
Data segregation: No
Cross validation: No
Hyperparameter tuning: No
Feature preprocessing: Yes
Feature selection: No
Model training: No
Feature serving: No

Dask-ML

Feature creation: No
Feature publishing: No
Training dataset generation: No
Data segregation: Yes
Cross validation: Yes
Hyperparameter tuning: Yes
Feature preprocessing: Yes
Feature selection: No
Model training: Yes
Feature serving: No

Ray

Feature creation:
Feature publishing:
Training dataset generation:
Data segregation:
Cross validation:
Hyperparameter tuning:
Feature preprocessing:
Feature selection:
Model training:
Feature serving:

Ibis-ML product hypotheses

Scope

A library needs to solve a sufficiently-large problem to be adopted widely. To this end, we want to provide value in multiple stages of the ML pipeline.
(Domain-driven) feature engineering is already handled sufficiently well by Ibis. As with other tools that are part of an ecosystem that already supports data transformation (e.g. BigQuery, Dask), we leave feature engineering to the existing tooling (i.e. Ibis).
Feature publishing, retrieval, and serving are orthogonal and can be left to feature platforms.
Model training can’t be done by a database, unless they control the underlying (cloud) infrastructure and can treat it as another distributed compute problem (e.g. in the case of BigQuery ML, Snowpark ML). Ibis doesn’t control the underlying compute infrastructure.
In practice, hyperparameter tuning in industry/large companies is often delegated to purpose-fit tools like Optuna.
The remainder is (potentially) in scope.
- Data segregation and cross validation are required in most every tabular ML problem.
  - @jcrist "Re: test/train splits and CV things - I do think we can provide some utilities in the initial release for handling test/train splitting or CV work, but for now I suggest we mostly focus on single model pipelines and already partitioned data. We ingest an ibis.Table as training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."
- Feature preprocessing is a good fit for Ibis to provide value. Technically, a user with a model pipeline (e.g. scikit-learn) may already include that in their pipeline, so they may or may not leverage this.
- (Automated) feature selection is more case-by-base, and therefore a lower priority.

Alternatives considered

End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:

Ibis-ML supports fitting data preprocessing steps (during the training process) and applying pre-trained Ibis-ML preprocessing steps (during inference).
- Pros: Ibis-ML is used during both the training and inference process
- Cons: Ibis-ML only supports data preprocessing, and even then a subset of steps that can be fit in database (e.g. not some very widely-used steps like PCA, that fit in the middle of the data-preprocessing pipeline)
Ibis-ML supports constructing transformers from a wider range of pre-trained preprocessors and models (from other libraries, like scikit-learn), and applying them across backends (during inference).
- Pros: Ibis-ML gives users the ability to apply a much wider range of steps in the ML process during inference time, including preprocessing steps that can be fit linearly (e.g. PCA) and even linear models (e.g. SGDRegressor, GLMClassifier). You can even showcase the end-to-end capabilities just using Ibis (from raw data to model outputs, all on your database, across streaming and batch, powered by Ibis)
- Cons: Ibis-ML doesn't support training the preprocessors on multiple backends; the expectation is that you use a dedicated library/existing local tools for training
A combination of 1 & 2, where Ibis-ML supports a wider range of pre-processing steps and models, but only a subset support a fit method (those that don't need to be constructed .from_sklearn() or something).
- Pros: Support the wider range of operations, and also fitting everything on the database in simple cases.
- Cons: ~~Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified.~~ @jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.

Proposal

I propose to go with option #3 of the alternatives considered. In practice, this will mean:

Keeping the existing structure of Ibis-ML
Adding the ability to construct transforms from_sklearn (and, in the future, potentially other libraries)
- Some of the transforms may not be steps you can fit using Ibis-ML

This also means that the following will be out of scope (at least, for now):

Train-test split (may have value to add in the future)
CV (may have value to add in the future)
Hyperparameter tuning (less hypothesized value; probably better to integrate with existing frameworks like Optuna)

Deliverables

Guiding principles

At launch, we should showcase an end-to-end Ibis-ML workflow, from preprocessing to model inference.
- The goal is to get people excited about Ibis-ML, and for them to try the example(s).
In future releases, we will increase the number of methods we support for each step. If we are successful in the first sub-goal (about generating excitement), people in the community will provide direction for and even contribute to this effort.
The library should be incrementally adoptable. The user should get benefit out of using just Ibis data segregation or feature preprocessing, and then they should be able to move on to adding another piece, and get further value.

Demo workflows

Fit preprocessing on DuckDB (local experience, during experimentation)
Experiment with different features
Fit finalized preprocessing on larger dataset (e.g. from BigQuery)
Perform inference on larger dataset

We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.

We need variants for all of:

scikit-learn (we already have)
XGBoost
PyTorch

With less priority:

LightGBM
Tensorflow
CatBoost

High-level deliverables

P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).

Questions for validation

Does being able to perform inference for certain model classes directly on the database, without UDF, provide real value? Are models with linear predict too narrow a category for people to care?

Changelog

2024-03-19

Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:

Ibis-ML should leverage the underlying engine during both training and inference, and speeding up the training iteration loop on big data is a key value proposition. Therefore, support for constructing transformers from_sklearn is no longer a priority, from P0 to P3.
- The associated demo of scaling inference only is also removed.
Increasing coverage of ML preprocessing steps w.r.t. tidymodels recipes and sklearn.preprocessing is a higher priority. We break down the relative priority of implementing steps in a separate issue.

feat: K-bins discretization

Definition

map numeric columns into k bins

Possibble cutting strategy

Uniform: All bins in each feature maintain identical widths.
Quantile: Each feature's bins contain an equal number of data points.
K-means: Values within each bin share the same nearest center of a one-dimensional k-means cluster. [SQL backend may not support this]

Reference

sklearn k-bins
Recipes step_discretize()

docs(website): document custom step implementation

Once the Step interface solidifies a bit more, we may want to document how users can implement their own Step subclass. We probably won't want to document for external usage for a while (the interface is still in flux and not really user-facing). Docs on the internals may be useful before then for aiding in contributions.

ibis-project / ibis-ml Goto Github PK

ibis-ml's Introduction

IbisML

What is IbisML?

How do I install IbisML?

How do I use IbisML?

ibis-ml's People

Contributors

Stargazers

Watchers

Forkers

ibis-ml's Issues

Desired output

Definition

Reference

Major tasks

Assumption

Priority definition:

Priorities

Reference:

definition

Reference

definition

Reference

Definition

use case

requirements

Implementation

Fit

Transform

Issues

Reference

Definition

Reference

Possible implementation

Objectives

TL;DR

Constraints

Mapping the landscape

Breaking down an end-to-end ML pipeline

Feature comparison (WIP)

Tecton

Scikit-learn

BigQuery ML

NVTabular

Dask-ML

Ray

Ibis-ML product hypotheses

Scope

Alternatives considered

Proposal

Deliverables

Guiding principles

Demo workflows

High-level deliverables

Questions for validation

Changelog

2024-03-19

Definition

Possibble cutting strategy

Reference

Recommend Projects

Recommend Topics

Recommend Org