maif / eurybia Goto Github PK

⚓ Eurybia monitors model drift over time and securizes model deployment with data validation

Home Page: https://maif.github.io/eurybia/

License: Apache License 2.0

Makefile 0.06% Python 5.68% Jupyter Notebook 89.56% CSS 4.71%

python data-validation drift html-report machine-learning data-drift domain-classifier drift-detection model-monitoring production-machine-learning

eurybia's Introduction

View Demo · Documentation · Eurybia Quick Tour · Tutorial Article

🔍 Overview

Eurybia is a Python library which aims to help in :

Detecting data drift and model drift
Validate data before putting a model in production.

Eurybia addresses challenges of industrialisation and maintainability of machine learning models over time. Thus, it contributes for better model monitoring, model auditing and more generally AI governance.

To do so, Eurybia generates an HTML report:

🕐 Quickstart

The 3 steps to display results:

Step 1: Declare SmartDrift Object

you need to pass at least 2 pandas DataFrames in order to instantiate the SmartDrift class (Current or production dataset, baseline or training dataset)

from eurybia import SmartDrift

sd = SmartDrift(
    df_current=df_current,
    df_baseline=df_baseline,
    deployed_model=my_model,  # Optional: put in perspective result with importance on deployed model
    encoding=my_encoder,  # Optional: if deployed_model and encoder to use this model
    dataset_names={
        "df_current": "Current dataset Name",
        "df_baseline": "Baseline dataset Name",
    },  # Optional: Names for outputs
)

Step 2: Compile Model

There are different ways to compile the SmartDrift object

sd.compile(
    full_validation=True,  # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.
    date_compile_auc="01/01/2022",  # Optional: useful when computing the drift for a time that is not now
    datadrift_file="datadrift_auc.csv",  # Optional: name of the csv file that contains the performance history of data drift
)

Step 3: Generate report

The report's content will be enriched if you provided the datascience model (deployed) and its encoder. Note that providing the deployed_model and encoding will only produce useful results if the datasets are both usable by the model (i.e. all features are present, dtypes are correct, etc).

sd.generate_report(
    output_file="output/my_report_name.html",
    title_story="my_report_title",
    title_description="my_report_subtitle",  # Optional: add a subtitle to describe report
    project_info_file="project_info.yml",  # Optional: add information on report
)

Report Example

🛠 Installation

Eurybia is intended to work with Python versions 3.8 to 3.10. Installation can be done with pip:

pip install eurybia

If you encounter compatibility issues you may check the corresponding section in the Eurybia documentation here.

🔥 Features

Display clear and understandable insightful report :

Allow Data Scientists to quickly explore drift thanks to dynamic reports to easily navigate between drift detection and datasets features.

In a nutshell :

Monitoring drift using a scheduler (like Airflow)
Evaluate level of data drift
Facilitate collaboration between data analysts and data scientists, and easily share and discuss results with non-Data users

More precisely :

Render data drift and model drift over time through :
- Feature importance: features that discriminate the most the two datasets
- Scatter plot: Feature importance relatively to the drift importance
- Dataset analysis: distribution comparison between variable from the baseline dataset and the newest one
- Predicted values analysis: distribution comparison between targets from the baseline dataset and the newest one
- Performance of the data drift classifier
- Features contribution for the data drift classifier
- AUC evolution: comparison of data drift classifier at different period.
- Model performance evolution: your model performances over time

📢 Why we made Eurybia

The visualization of the life cycle of a machine learning model can ease the understanding of Eurybia importance. During their life, ML models go through the following phases: Model learning, Model deployment, Model monitoring.

Let's respectively name features, target and prediction of a model X, Y and P(X, Y). P(X, Y) can be decompose as : P(X, Y) = P(Y|X)P(X), with P(Y|X), the conditional probability of ouput given the model features, and P(X) the probability density of the model features.

Data Validation : Validate that data used for production prediction are similar to training data or test data before deployment. With formulas, P(Xtraining) similar to P(XtoDeploy)
Data drift : Evolution of the production data over time compared to training or test data before deployment. With formulas, compare P(Xtraining) to P(XProduction)
Model drift : Model performances' evolution over time due to change in the target feature statistical properties (Concept drift), or due to change in data (Data drift). With formulas, when change in P(Y|XProduction) compared to P(Y|Xtraining) is concept drift. And change in P(Y,XProduction) compared to P(Y,Xtraining) is model drift

Eurybia helps data analysts and data scientists to collaborate through a report that allows them to exchange on drift monitoring and data validation before deploying model into production.
Eurybia also contributes to data science auditing by displaying usefull information about any model and data in a unique report.

⚙️ How Eurybia detect data drift

Eurybia works mainly with a binary classification model (named datadrift classifier) that tries to predict whether a sample belongs to the training dataset (or baseline dataset) or to the production dataset (or current dataset).

As shown below on the diagram, there are 2 datasets, the baseline and the current one. Those datasets are those we wish to compare in order to assess if data drift occurred. On the first one we create a column named “target”, it will be filled only with 0, on the other hand on the second dataset we also add this column, but this time it will be filled only with 1 values.
Our goal is to build a binary classification model on top of those 2 datasets (concatenated). Once trained, this model will be helpful to tell if there is any data drift. To do so we are looking at the model performance through AUC metric. The greater the AUC the greater the drift is. (AUC = 0.5 means no data drift and AUC close to 1 means data drift is occuring)

The explainability of this datadrift classifier allows to prioritise features that are important for drift and to focus on those that have the most impact on the model in production.

To use Eurybia to monitor drift over time, you can use a scheduler to make computations automatically and periodically.
One of the schedulers you can use is Apache Airflow. To use it, you can read the official documentation and read blogs like this one: Getting started with Apache Airflow

🔬 Built With

This section list libraries used in Eurybia.

📖 Tutorials

This github repository offers a lot of tutorials to let you to quickly start using Eurybia.

Overview

Overview to compile Eurybia and generate Report

Validate Data before model deployment

Measure and analyze Data drift

Measure and analyze Model drift

More details about report and plots

🔭 Roadmap

Concept Drift

Detecting drift concept and get analyses and explainability of this drift. An issue is open: Add Concept Drift

API mode

Adapting Eurybia for models consumed in API mode. An issue is open: Adapt Eurybia to API mode

If you want to contribute, you can contact us in the discussion tab

eurybia's People

Contributors

Stargazers

Watchers

eurybia's Issues

Add Concept Drift

Description of Problem:
Data drift is not necessarily sufficient to explain evolution of performance of deployed model.
The concept drift would complete the explanation of the evolution of performance. And in addition, to project the future behaviour of the model

Overview of the Solution:
A first solution is to re-train the same type of model on df_baseline and df_current. And then compare the explainability of these two models. This comparison can be done with the Shapash library

Adapt Eurybia to API mode

Description of Problem:
Eurybia is currently designed to detect drift on data built in batch mode.
If deployed model consumes and does the data preparation in API mode, we have not yet thought of how to use Eurybia on these data as they come in.

Overview of the Solution:
One answer is to concatenate this data over the API calls and then run Eurybia after a while.
One of the limitations is that the compilation may come late to ensure good data quality

Allow datetime columns

Description of Problem:

You can't pass datetime columns in eurybia

...
sd = SmartDrift(
  df_current=df_current,   # with datetime column
  df_baseline=df_baseline  # with datetime column
)
sd.compile(full_validation=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File _catboost.pyx:1130, in _catboost._FloatOrNan()

TypeError: float() argument must be a string or a number, not 'Timestamp'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
File _catboost.pyx:2275, in _catboost.get_float_feature()

File _catboost.pyx:1132, in _catboost._FloatOrNan()

TypeError: Cannot convert obj 2022-01-01 00:00:00 to float

During handling of the above exception, another exception occurred:

CatBoostError                             Traceback (most recent call last)
Cell In [25], line 1
----> 1 sd.compile(full_validation=True)

File ~/github/eurybia/eurybia/core/smartdrift.py:305, in SmartDrift.compile(self, full_validation, ignore_cols, sampling, sample_size, datadrift_file, date_compile_auc, hyperparameter, attr_importance)
    302 x_test = test[varz]
    303 y_test = test[self._datadrift_target]
--> 305 xpl.compile(x=x_test)
    306 xpl.compute_features_import(force=True)
    308 self.xpl = xpl

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/shapash/explainer/smart_explainer.py:267, in SmartExplainer.compile(self, x, contributions, y_pred)
    264 self.x_init = inverse_transform(self.x_encoded, self.preprocessing)
    265 self.y_pred = check_ypred(self.x_init, y_pred)
--> 267 self._get_contributions_from_backend_or_user(x, contributions)
    268 self.check_contributions()
    270 self.columns_dict = {i: col for i, col in enumerate(self.x_init.columns)}

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/shapash/explainer/smart_explainer.py:288, in SmartExplainer._get_contributions_from_backend_or_user(self, x, contributions)
    285 def _get_contributions_from_backend_or_user(self, x, contributions):
    286     # Computing contributions using backend
    287     if contributions is None:
--> 288         self.explain_data = self.backend.run_explainer(x=x)
    289         self.contributions = self.backend.get_local_contributions(x=x, explain_data=self.explain_data)
    290     else:

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/shapash/backend/shap_backend.py:34, in ShapBackend.run_explainer(self, x)
     20 def run_explainer(self, x: pd.DataFrame) -> dict:
     21     """
     22     Computes and returns local contributions using Shap explainer
     23 
   (...)
     32         local contributions
     33     """
---> 34     contributions = self.explainer(x, **self.explainer_compute_args)
     35     explain_data = dict(contributions=contributions.values)
     36     return explain_data

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/shap/explainers/_tree.py:217, in Tree.__call__(self, X, y, interactions, check_additivity)
    214     feature_names = getattr(self, "data_feature_names", None)
    216 if not interactions:
--> 217     v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
    218     if type(v) is list:
    219         v = np.stack(v, axis=-1) # put outputs at the end

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/shap/explainers/_tree.py:367, in Tree.shap_values(self, X, y, tree_limit, approximate, check_additivity, from_call)
    365     import catboost
    366     if type(X) != catboost.Pool:
--> 367         X = catboost.Pool(X, cat_features=self.model.cat_feature_indices)
    368     phi = self.model.original_model.get_feature_importance(data=X, fstr_type='ShapValues')
    370 # note we pull off the last column and keep it as our expected_value

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/catboost/core.py:790, in Pool.__init__(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count, log_cout, log_cerr)
    784         if isinstance(feature_names, PATH_TYPES):
    785             raise CatBoostError(
    786                 "feature_names must be None or have non-string type when the pool is created from "
    787                 "python objects."
    788             )
--> 790         self._init(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight,
    791                    group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)
    792 super(Pool, self).__init__()

File ~/anaconda3/envs/eurybia/lib/python3.8/site-packages/catboost/core.py:1411, in Pool._init(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)
   1409 if feature_tags is not None:
   1410     feature_tags = self._check_transform_tags(feature_tags, feature_names)
-> 1411 self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight,
   1412                 group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)

File _catboost.pyx:3941, in _catboost._PoolBase._init_pool()

File _catboost.pyx:4008, in _catboost._PoolBase._init_pool()

File _catboost.pyx:3914, in _catboost._PoolBase._init_objects_order_layout_pool()

File _catboost.pyx:3422, in _catboost._set_data()

File _catboost.pyx:3405, in _catboost._set_data_from_generic_matrix()

File _catboost.pyx:2277, in _catboost.get_float_feature()

CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=0]="2022-01-01 00:00:00": Cannot convert obj 2022-01-01 00:00:00 to float

But in some use case, Eurybia should be useful to analyse difference between 2 dataset with temporal information (like seasonal information). If users only want to get some analysis about difference between 2 dataset, it should be done (via AUC). But if users want to reuse a model to get importance, this should raise an error (and invite him to drop datetime columns as it can't be done).

Overview of the Solution:

If there are datetime columns in datasets, automatically create years / month / day features based on this column and drop original one.
If deployed_model is filled in SmartDrift then raised an error.

Examples:

import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor
from eurybia import SmartDrift

# Create random dataset
date_list = pd.date_range(start='01/01/2022', end='01/30/2022')
X1 = np.random.rand(len(date_list))
X2 = np.random.rand(len(date_list))

df_current = pd.DataFrame(date_list, columns=['date'])
df_current['col1'] = X1 
df_baseline = pd.DataFrame(date_list, columns=['date'])
df_baseline['col1'] = X2

sd = SmartDrift(df_current=df_current,
  				df_baseline=df_baseline)
# Datetime columns will be transform into df_current
# Datetime columns will be transform into df_baseline

sd.compile(full_validation=True)

# Bloc user when using model
# Random models
regressor = LGBMRegressor(n_estimators=2).fit(df_baseline[['col1']], 
                                              df_baseline[['col1']])

sd = SmartDrift(df_current=df_current,
  				df_baseline=df_baseline,
  				deployed_model=regressor)
sd.compile(full_validation=True)
# Error
# Raising error

Blockers:

Definition of Done:

Some tests

The demo link in readthedocs is no longer valid

The demo link in readthedocs is no longer valid:
https://eurybia.readthedocs.io/en/latest/report.html

Univariate analysis column selector back to 1st column

In report HTML in Univariate analysis, when you select a column to analyse your result, the box (dash line) come back to the first column. This can cause a misunderstanding :

Tweak feature importance sorting

Description of Problem:
I was surprised to find some features with missing values at the bottom of the feature importance list.
It was important for my use case to spot variables where some modalities never appear.

Overview of the Solution:
I would appreciate the possibility to set some sort of "feature importance policy".

Examples:
Priorise by:

missing modalities
mean/max distribution gaps
...

Blockers:
None

Definition of Done:
Feature is available.

CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir: catboost_info

Problem: when I runing SD.compile() on Databricks cluster I have this issue : CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir: catboost_info.
Related to this [issue] (catboost/catboost#1891 ), I add allow_writing_files=False in the definition of datadrift_classifier in SmartDrift Class and the problem disappear.

Is it possible to add a optionnal parameter to set allow_writing_files=False in the definition of datadrift_classifier in SmartDrift Class ?

Bug in display univariate fig

The feature is considered numeric and therefore the plot is displayed incorrectly

Bug when use of parameter dataset_names and model

Bug when use of parameter dataset_names and model,
The code has not been adapted to the addition of the "dataset_names" parameter. See the error message below :
On SD.compile()

Fix this bug and look to add unit test

Bad section in column type consistency analysis and improved readability

I have columns that are excluded from the analysis because the types are different. Floats and int.
In the report, these columns end up in the "Ignored columns in the report (manually excluded)" section, wrongly.

What's more, the issues surrounding type differences are not clear.

And another point: when the difference is between purely numeric types (such as float and int), eurybia may not be obliged to exclude them.

Python version :3.10

Eurybia version :1.1.1

Operating System :Linux

support for python 3.10

Description of Problem:

Python 3.10 is more and more used

Overview of the Solution:

Support of 3.10. Check dependencies, run tests, adapt GitHub workflow to 3.10, etc.

Similair issue for Shapash MAIF/shapash#293

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

We have this error:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The error is linked to the latest version of the shapash library 2.4.0.

To use eurybia, you have to install shapash 2.3.7

pip install shapash==2.3.7

We also maintain the shapash library, we will make a fix in shapash in the next days.

Python version :3.10

Eurybia version :1.1.1

Operating System :Linux

Colors consistency in datadrift analysis

The baseline and current datasets colors are not always consistent from variable to variable (e.g. current dataset is blue for var 1, then brown for var 2)

Getting modulenotfound error while importing eurybia

Hi,
I'm using Databricks to find data drift for a model, but when I install eurybia and try import, I keep getting
ModuleNotFoundError: No module named 'tkinter'.
But, I think tkinter in pre intsalled in python, right?

Any help on how I can solve this issue?

Upgrade sklearn

Description of Problem:

Upgrade sklearn compatiblity as MAIF/shapash#375

Support of pandas>=2

Description of Problem:

Due to datapane support being dropped, Eurybia is stuck with pandas<2.

Overview of the Solution:

Either get rid of datapane or integrate datapane as part of Eurybia and make it work with pandas >= 2.

Examples:

Blockers:

Definition of Done:
Eurybia works properly along pandas>=2 in the same virtualenv.

Detect differences in float precision

Description of Problem:
I have two datasets, one with a .2 precision floating point value, the other with .10 precision.
Despite the fact that these datasets are roughly the same, Eurybia outputs a 0.98 AUC.

Overview of the Solution:

Detect this kind of technical data mismatch and add a warning to the "Consistency Analysis" view.

Examples:

Blockers:

Definition of Done:

Feature Request: Smart Drift Object Compilation with Filtered Datasets for Enhanced Visualization Clarity

Description of Problem:
In the current implementation of the Smart Drift reports, all data points are visualized without any filtering options. This leads to cluttered and sometimes overwhelming visualizations, making it difficult for users to quickly identify and analyze the most relevant data trends and outliers.

Overview of the Solution:
Introduce a feature that allows users to compile Smart Drift objects with options to filter datasets based on user-defined criteria. This would enable the generation of reports that focus on the most pertinent data, providing cleaner and more insightful visualizations.

Examples:

There should be an option to filter data by categories or values that the user considers critical for their analysis.
If we want to filter a country dataset we can pick a column to filter by e.g. "name" and pick our desired countries such as "France" and "Germany"

Blockers:
There may be technical challenges in implementing dynamic filtering that interacts seamlessly with the existing Smart Drift compilation process. We want to make sure that this addition does not make the compilation unnecessarily longer and more cluttered in our code.

Definition of Done:

Apply filters to the dataset before compiling the Smart Drift object.
Generate visualizations in the Smart Drift report that only include the filtered data.

Dependencies are no longer correct to generate report

When generate report, have message:
"ImportError: cannot import name 'appengine' from 'urllib3.contrib'"

Python version :3.9

Eurybia version :1.1.0

Operating System :Linux

AUC below 0.5

When using eurybia on small dataframes the computed AUC can be below 0.5, even if you compare the same dataframe in baseline and current. It is caused by the train test split on the concatenated data.
A solution could be to apply the train test split with the same seed on both baseline and current dataframes before concatenating them.
An other quick solution could be to duplicate the data to have enough data for a balanced train test split.

To reproduce:

from eurybia import SmartDrift
import pandas as pd

df = pd.DataFrame([[0,1],[0,1],[0,1],[0,2],[0,2],[0,2],[0,2]], columns=["A","B"])

sd = SmartDrift(
    df_current=df,
    df_baseline=df,
)

sd.compile()

sd.generate_report(
    output_file="auc_test.html",
    title_story="AUC Test",
)

maif / eurybia Goto Github PK

eurybia's Introduction

🔍 Overview

🕐 Quickstart

🛠 Installation

🔥 Features

📢 Why we made Eurybia

⚙️ How Eurybia detect data drift

🔬 Built With

📖 Tutorials

Overview

Validate Data before model deployment

Measure and analyze Data drift

Measure and analyze Model drift

More details about report and plots

🔭 Roadmap

eurybia's People

Contributors

Stargazers

Watchers

Forkers

eurybia's Issues

Recommend Projects

Recommend Topics

Recommend Org