Giter Club home page Giter Club logo

deepchecks's Introduction

Deepchecks - Test Suites for ML Models and Data

pyVersions pkgVersion build FOSSA Status

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. This includes checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.

Key Concepts

Check

Each check enables you to inspect a specific aspect of your data and models. They are the basic building block of the deepchecks package, covering all kinds of common issues, such as: PerformanceOverfit, DataSampleLeakage, SingleFeatureContribution, DataDuplicates, and many more checks. Each check can have two types of results:

  1. A visual result meant for display (e.g. a figure or a table).
  2. A return value that can be used for validating the expected check results (validations are typically done by adding a "condition" to the check, as explained below).

Condition

A condition is a function that can be added to a Check, which returns a pass ✓, fail or warning ! result, intended for validating the Check's return value. An example for adding a condition would be:

from deepchecks.checks import BoostingOverfit
BoostingOverfit().add_condition_test_score_percent_decline_not_greater_than(threshold=0.05)

which will fail if there is a difference of more than 5% between the best score achieved on the test set during the boosting iterations and the score achieved in the last iteration (the model's "original" score on the test set).

Suite

An ordered collection of checks, that can have conditions added to them. The Suite enables displaying a concluding report for all of the Checks that ran. Here you can find the predefined existing suites and a code example demonstrating how to build your own custom suite. The existing suites include default conditions added for most of the checks. You can edit the preconfigured suites or build a suite of your own with a collection of checks and optional conditions.

Installation

Using pip

pip install deepchecks #--user

From source

First clone the repository and then install the package from inside the repository's directory:

git clone https://github.com/deepchecks/deepchecks.git
cd deepchecks
# for installing stable tag version and not the latest commit to main
git checkout tags/<version>

and then either:

pip install .

or

python setup.py install

Are You Ready to Start Checking?

For the full value from Deepchecks' checking suites, we recommend working with:

  • A model compatible with scikit-learn API that you wish to validate (e.g. RandomForest, XGBoost)

  • The model's training data with labels

  • Test data (on which the model wasn’t trained) with labels

However, many of the checks and some of the suites need only a subset of the above to run.

Usage Examples

Running a Check

For running a specific check on your pandas DataFrame, all you need to do is:

from deepchecks.checks import RareFormatDetection
import pandas as pd

df_to_check = pd.read_csv('data_to_validate.csv')
# Initialize and run desired check
RareFormatDetection().run(df_to_check)

Which might product output of the type:

Rare Format Detection

Check whether columns have common formats (e.g. 'XX-XX-XXXX' for dates) and detects values that don't match.

Nothing found

If all was fine, or alternatively something like:

Rare Format Detection

Check whether columns have common formats (e.g. 'XX-XX-XXXX' for dates) and detects values that don't match.

Column date:

  digits and letters format (case sensitive)
ratio of rare samples 1.50% (3)
common formats ['2020-00-00']
examples for values in common formats ['2021-11-07']
values in rare formats ['2021-Nov-04', '2021-Nov-05', '2021-Nov-06']

If mismatches were detected.

Running a Suite

Let's take the "iris" dataset as an example:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_df = load_iris(return_X_y=False, as_frame=True)['frame']
label_col = 'target'
df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0)

To run an existing suite all you need to do is import the suite and run it -

from deepchecks.suites import integrity_check_suite
integrity_suite = integrity_check_suite()
integrity_suite.run(train_dataset=df_train, test_dataset=df_test, check_datasets_policy='both')

Which will result in printing the summary of the check conditions and then the visual outputs of all of the checks that are in that suite.

Example Notebooks

For usage examples, check out:

Communication

  • Join our Slack Community to connect with the maintainers and follow users and interesting discussions
  • Post a Github Issue to suggest improvements, open an issue, or share feedback.

License

FOSSA Status

deepchecks's People

Contributors

benisraeldan avatar danarlowski avatar fossabot avatar itaygabbay avatar jkl98isr avatar matanper avatar nirchecks avatar nirhutnik avatar noamzbr avatar shir22 avatar yromanyshyn avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.