guillermo-navas-palencia / optbinning Goto Github PK

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.

Home Page: http://gnpalencia.org/optbinning/

License: Apache License 2.0

Python 100.00%

python binning woe optimization woebinning stream streaming-data batch-processing scorecard credit-scoring

optbinning's People

Contributors

Stargazers

Watchers

Forkers

leepand useric nic9lif3 darenr gabrielsgoncalves ds-mauri mmejdoubi nehaljwani mnjenga2 juanccis tin9580 floriankappert salihercan ttelfer enriczhang jensgk wushicanasl fjbarrientosg feros5 kasperdeblieck doctorado-ml niklasvm nicolizamacorrea andre16s fintrek mnicstruwig jinyu25 antoniramon dave1221 ocean-ds-pioneer gnisbet200 kabil1001 suyin1203 lastoautumn sufezl philip-khor reejungkim uilsonzamboni longshen931 zambo23 dand-oss rnrahman chliyiyu stanleysongpro simonilishaev bryan3387 saties zju-hjfeng sebastianf101 hy0095 lopsdir overfittingstudyroom saiddddd domagojgalic isokaze04 lcrmorin gravesee ufukcolak omdgit beerbottle tranvohuy elijahahianyo phillipkn lorenzofamiglini mervyn2 peterpanmj aqini2006 denmase psurajbali ruslansblv jnsofini xoniks cigonzalez2 karagul sllovewyl dave12221 dandanya1221 felixbigtree fengxiang2 sandy4321 janeyu0410 abhinav70291 tdl77 lassilehtonen christianjauregui bmreiniger patricksurry joshuahltsang gtnx maxsanwal chunqishi tomcortenrq fernandoccordeiro superflyben rmallof josp70

optbinning's Issues

Coverage > 80%

Add new tests to satisfy coverage > 80%.

MDLP algorithm

Python Implementation of the Minimum Description Length Principle (MDLP).

Binning sketch for continuous target.

Numerical variables.
Categorical variables.

Tutorial: Binning process using pandas.DataFrame as input

Add tutorial showing how to pass pandas.DataFrame instead of numpy.ndarray matrices as X.

Error while importing optbinning installed from scorecard branch

Hi Guillermo,

I getting this error when I try to import optbinning that was installed from the scorecard branch:

import optbinning

ImportError                               Traceback (most recent call last)
<ipython-input-2-3b16b6a50809> in <module>
     47 
     48 from decouple import config
---> 49 from optbinning import OptimalBinning
     50 
     51 # Definir variáveis de ambiente

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/__init__.py in <module>
----> 1 from .binning.binning import OptimalBinning
      2 from .binning.continuous_binning import ContinuousOptimalBinning
      3 from .binning.multiclass_binning import MulticlassOptimalBinning
      4 from .binning.binning_process import BinningProcess
      5 from .binning.mdlp import MDLP

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/binning/binning.py in <module>
     15 
     16 from ..logging import Logger
---> 17 from ..preprocessing import preprocessing_user_splits_categorical
     18 from ..preprocessing import split_data
     19 from .auto_monotonic import auto_monotonic

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/preprocessing.py in <module>
     15 from sklearn.utils import check_consistent_length
     16 from sklearn.utils import compute_class_weight
---> 17 from sklearn.utils.validation import _check_sample_weight
     18 
     19 from .outlier import ModifiedZScoreDetector

ImportError: cannot import name '_check_sample_weight' from 'sklearn.utils.validation' (/home/gabriel/miniconda3/envs/jeitto/lib/python3.7/site-packages/sklearn/utils/validation.py)

When I install optbinning from master branch it runs ok.
Do you know what might be causing this?

Att

Gabriel

Monotonic trend on BinningProcess

How do I force the bins to have a monotonic trend when using BinningProcess(variable_names, categorical_variables=categorical_variables,selection_criteria=selection_criteria)
I s there a way to round of the scores to the nearest decimal place?
from the scorecard example how do i fit the train data and not the entire dataset as showing in the scoring tutorial scorecard.fit(df)

Plots for Score Card

Hi Guillermo,

I just wanted to share with you a few functions that I use to visualized my some important metrics when working with Score Cards. It's basically functions to get AUROC, Gini and KS.

I think it would be a great addition to the scorecard module.

Feel free to use (and modify if needed) the code it if you think it would help the user.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import roc_curve, roc_auc_score


def plot_auroc(
    df,
    target_col,
    proba_col,
    title='ROC curve',
    xlabel='False positive rate',
    ylabel='True positive rate',
    roc_color='blue',
    reference_color='black',
    savefig=False,
    figname='auroc.png',
):
    """
    Plot AUROC for dataframe with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default 'ROC curve'): Title for plot.
        xlabel (str, 'False positive rate'): Label for the x-axis (bins).
        ylabel (str, 'True positive rate'): Label for the y-axis
            (probabilities).
        roc_color (str, default 'blue'): Color for the event lineplot.
        reference_color (str, default 'red'): Color for the non-event lineplot.
        savefig (boolean, default False): To save figure.
        figname (string, default 'auroc.fig'): Name for figure file.

    Return:
        auroc (float): Calculate AUROC score.
    """
    # Define the arrays for plotting
    fpr, tpr, threshold = roc_curve(df[target_col], df[proba_col])

    # Define the plot settings
    plt.plot(fpr, tpr, color=roc_color)
    plt.plot(fpr, fpr, linestyle="--", color=reference_color)
    plt.title(title, fontdict={'fontsize': 16})
    plt.xlabel(xlabel, fontdict={'fontsize': 16})
    plt.ylabel(ylabel, fontdict={'fontsize': 16})

    if savefig:
        plt.savefig(figname, dpi=300)

    # Set AUROC score
    auroc = roc_auc_score(df[target_col], df[proba_col])
    return auroc


def plot_gini(
    df,
    target_col,
    proba_col,
    title='Gini',
    xlabel='Cumulative % Population',
    ylabel='Cumulative % Bad',
    gini_color='blue',
    reference_color='black',
    savefig=False,
    figname='gini.png',
):
    """
    Plot Gini based for dataframe with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default None): Title for plot.
        xlabel (str, default None): Label for the x-axis (bins).
        ylabel (str, default None): Label for the y-axis (probabilities).
        roc_color (str, default 'blue'): Color for the event lineplot.
        reference_color (str, default 'red'): Color for the non-event lineplot.
        savefig (boolean, default False): To save figure.
        figname (string, default 'gini.fig'): Name for figure file.

    Return:
        gini (float): Calculate Gini Index.

    """
    # Sort dataframe by probabilities and reset index
    df = df.sort_values(proba_col)
    df = df.reset_index()

    # Calculate cumulative columns
    df["cumulative_n_population"] = df.index + 1
    df["cumulative_n_good"] = df[target_col].cumsum()
    df["cumulative_n_bad"] = (
        df["cumulative_n_population"] - df[target_col].cumsum()
    )
    df["cumulative_perc_population"] = df["cumulative_n_population"] / (
        df.shape[0]
    )
    df["cumulative_perc_good"] = df["cumulative_n_good"] / df[target_col].sum()
    df["cumulative_perc_bad"] = df["cumulative_n_bad"] / (
        df.shape[0] - df[target_col].sum()
    )

    # Plot Gini
    plt.plot(
        df["cumulative_perc_population"],
        df["cumulative_perc_bad"],
        color=gini_color,
    )
    plt.plot(
        df["cumulative_perc_population"],
        df["cumulative_perc_population"],
        linestyle="--",
        color=reference_color,
    )
    plt.title(title, fontdict={'fontsize': 16})
    plt.xlabel(xlabel, fontdict={'fontsize': 16})
    plt.ylabel(ylabel, fontdict={'fontsize': 16})

    # Save figure if requested
    if savefig:
        plt.savefig(figname, dpi=300)

    # Calculate Gini Index
    auroc = roc_auc_score(df[target_col], df[proba_col])
    gini = auroc * 2 - 1
    return gini


def plot_ks(
    df,
    target_col,
    proba_col,
    title="Kolmogorov-Smirnov",
    xlabel="Estimated Probability for being Good",
    ylabel="Cumulative %",
    good_color="blue",
    bad_color="red",
    savefig=False,
    figname="ks.png",
):
    """
    Create Kolmogorov-Smirnov plot for df with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default='Kolmogorov-Smirnov'): Title for plot.
        xlabel (str, default='Estimated Probability for being Good'): Label for
            the x-axis (bins).
        ylabel (str, default='Cumulative %'): Label for the y-axis
            (probabilities).
        good_color (str, default='blue'): Color for the event lineplot.
        bad_color (str, default='red'): Color for the non-event lineplot.
        savefig (boolean, default=False): To save figure.
        figname (string, default='ks.fig'): Name for figure file.

    Return:
        gini (float): Calculate Gini Index.

    """
    # Sort dataframe by probabilities and reset index
    df = df.sort_values(proba_col)
    df = df.reset_index()

    # Calculate cumulative columns
    df["cumulative_n_population"] = df.index + 1
    df["cumulative_n_good"] = df[target_col].cumsum()
    df["cumulative_n_bad"] = (
        df["cumulative_n_population"] - df[target_col].cumsum()
    )
    df["cumulative_perc_population"] = df["cumulative_n_population"] / (
        df.shape[0]
    )
    df["cumulative_perc_good"] = df["cumulative_n_good"] / df[target_col].sum()
    df["cumulative_perc_bad"] = df["cumulative_n_bad"] / (
        df.shape[0] - df[target_col].sum()
    )

    # Plot KS
    plt.plot(
        df[proba_col], df["cumulative_perc_bad"], color=bad_color,
    )
    plt.plot(
        df[proba_col], df["cumulative_perc_good"], color=good_color,
    )
    plt.xlabel(
        "Estimated Probability for being Good", fontdict={'fontsize': 16}
    )
    plt.ylabel("Cumulative %", fontdict={'fontsize': 16})
    plt.title("Kolmogorov-Smirnov", fontdict={'fontsize': 16})

    ks_score = max(df["cumulative_perc_bad"] - df["cumulative_perc_good"])
    ks_max_idx = np.argmax(
        df["cumulative_perc_bad"] - df["cumulative_perc_good"]
    )
    # Define the KS line

    plt.vlines(
        df[proba_col].iloc[ks_max_idx],
        ymin=df["cumulative_perc_good"].iloc[ks_max_idx],
        ymax=df["cumulative_perc_bad"].iloc[ks_max_idx],
        color='black',
        linestyles='--',
    )

    plt.text(
        df[proba_col].iloc[ks_max_idx] + 0.1,
        (
            (
                df["cumulative_perc_good"].iloc[ks_max_idx]
                + df["cumulative_perc_bad"].iloc[ks_max_idx]
            )
            * 0.5
        ),
        f'KS = {round(ks_score * 100, 2)}%',
        fontsize=12,
        rotation_mode='anchor',
    )

    # Save figure if requested
    if savefig:
        plt.savefig(figname, dpi=300)

    return ks_score

will it work on windows?

great code
thanks
will it work on windows?

How to bin new samples with splits generated from training?

Hi,

First of all, congrats for the library!

I'm trying to use the bins generated and use it to bin data in production.
Here is my example:

from optbinning import OptimalBinning

x = df["var1"].to_numpy()
y = df["target"].to_numpy()

optb = OptimalBinning(name='var1')
optb.fit(x, y)
optb.splits

Then I get an array with the splits (for a numerical variable).
That's great, but if I want to categorize 'var1' for one row of new data (not part of training), what's the best way to do it?

One way to do it using pandas.cut(), but I don't know if there is a better way to do it using optbinning.

pd.cut(df_new["var1"],  bins=optb.splits)

Also, how can I deal with categorical variables, as some bins are more than one category merged into an array (eg: ['cat1', 'cat3', 'catn']).

Att

Gabriel

Multiprocessing BinningProcess

Add multiprocessing support to Binning process using the multiprocessing module.

Binning categorical is not optimal IV

Hi @guillermo-navas-palencia,

When I using OptimalBinning with dtype='categorical',min_bin_size=0.05,max_bin_size=None with the column which have distribution like this:

	Count(%)	Eventrate
G-1	16631	0.011665
G10	39	0.025641
G2	2	0
G3	75	0.0133333
G4	6	0.166667
G7	21	0
G8	3536	0.0246041
G9	12	0

I get the result look like:

	Bin	Count	Count(%)	Non-event	Event	Eventrate	WoE	IV	JS
0	['G2''G7''G9''G-1''G3']	16741	0.823787	16546	195	0.011648	0.184488589528379	0.0256645	0.00320352
1	['G8''G10''G4']	3581	0.176213	3492	89	0.0248534	-0.586817964430649	0.0816331	0.0100602
2	Special	0	0	0	0	0	0.0	0	0
3	Missing	0	0	0	0	0	0.0	0	0
Totals		20322	1	20038	284	0.013975		0.107298	0.0132637

But when I manually group another values except G8 and G-1 as G1 and using the same previous binning option, I get a result like this:

	Bin	Count	Count(%)	Non-event	Event	Eventrate	WoE	IV	JS
0	['G-1']	16631	0.712798	16437	194	0.011665	0.38550821182938577	0.0883861	0.0109804
1	['G8']	3536	0.151552	3449	87	0.0246041	-0.37399230503155456	0.0255081	0.00317006
2	['G1']	3165	0.135651	3048	117	0.0369668	-0.7938568173050311	0.127864	0.0155761
3	Special	0	0	0	0	0	0.0	0	0
4	Missing	0	0	0	0	0	0.0	0	0
Totals		23332	1	22934	398	0.0170581		0.241758	0.0297265

So my question is: To preform a maximize IV, why the algorithm doesn't group these values as I manually do, because the second IV is much better than the first one's?

OutlierDetector documentation

Documentation classes:

OutlierDetector
RangeDetector
ModifiedZScoreDetector

Add examples.

Setting value for max_bin_size will break binning process

When I run any type binning,especially with continuous binning, and set parameter max_bin_size as expect that each bin fraction size small than this value, I usually get result usually that all of observation belong just 1 bin. I think this may be a mistake.

Optimal piecewise binning for continuous target

Release 0.7.0

AttributeError: 'DataFrame' object has no attribute 'dtype'

I am trying to run the scorecard model

scorecard_full = Scorecard(
    target='FLAG',
    binning_process=binning_process,
    estimator=lr,
    scaling_method=scaling_method,
    scaling_method_params=scaling_method_data,
    intercept_based=False,
    reverse_scorecard=False,
    rounding = True)
scorecard_full.fit(df_train)

but i get the below error
Fitted scorecard. """ --> return self._fit(df, metric_special, metric_missing, show_digits, check_input) AttributeError: 'DataFrame' object has no attribute 'dtype'

my data frame has some missing values which i have not imputted

serializing the scorecard

Thank you for great work.
Kindly advise on how to serialize the scorecard function.
I tried pickle, joblib and dill but they both give this error, "TypeError: cannot pickle 'SwigPyObject' object".
Thanks

Information Value whole dataset - Scorecard

How do i compute the information value for the entire dataset?

Two, is it advisable to do the binning on train set and later use the bins for test set or bin the entire dataset ?

Support prebinning algorithm MDLP

For large datasets, the time in pre-binning can be reduced by replacing the CART algorithm by the MDLP algorithm using a small number of candidates for the dynamic splitting method. Significant IV degradation is not expected.

test data can't do binning_table build and plot

In score card if we want to compare a variable results respectively in train and test,we can not do woe or event_rate in test data.We can only do it in train data byoptb.binning_table.plot(metric='woe).Please add some function in test data?

Stochastic Optimal Binning

Implement deterministic equivalent to stochastic optimal binning.

Roadmap

Base class DSOptimalBinning
Binary target
Unit testing

Implementation for continuous variables.

Adding feature: return set of solutions that is less reasonable than the best solution

Hi @guillermo-navas-palencia.
Thank you for this great project. It's very helpful for my work. I wonder if the model returns some other less reasonable solutions. In some circumstances, some businesses need additional conditions that cannot be indicated in the model's parameters. So I think the best way is filtering that conditions from the less reasonable solutions set to find suitable ones. That's why I would like to ask to add the feature in the next version.
Thanks in advance.

Polulation Stability Index (PSI)

Develop a function to compare two populations using PSI.

Scorecard monitoring (System stability report)
Variable monitoring (Characteristic analysis report)

will it work for unbalanced data : when number of one class is much less for another class?

will it work for unbalanced data : when number of one class is much less for another class?
for example
https://github.com/airysen/caimcaim
and
https://cran.r-project.org/web/packages/discretization/discretization.pdf
and
https://en.wikipedia.org/wiki/Discretization_of_continuous_features

not working for unbalanced data

Problem with monotonic trend

Hi @guillermo-navas-palencia,

I see this problem when I try to set the value of the monotonic trend. When monotonic trend='descending', Optbinning will bin fail. On the other hands, with monotonic trend= 'auto' it's result will be like this:

I think this problem arises when the number of splits is 1.

One-hot encoding option in BinningProcess

Issue: one-hot encoding cannot generate the same number of columns if the number of categories (bins) is not fixed. For example, if we split data into train and test, the number of columns after one-hot encoding might differ. This cannot occur using WoE transformation.

Comments @GabrielSGoncalves?

Quick questions

Nice package I came across. By the way, may I know:

how to remove special/missing rows from binning_table when doing the plotting?
how to do customised binning.

Thanks.

Add parameters add_missing and add_special in binning_table.plot() function

Option to hide missing and special codes bar in plots. See #40.

Optimal binning for streaming data

Reference: http://gnpalencia.org/blog/2020/binning_data_streams/

Using class_weight parameter to BinningProcess

Hi @guillermo-navas-palencia !

Thanks for sharing your amazing work with the community.
It would be possible to use class_weight and outlier_detector as parameters to the BinningProcess class? I'm actually looping through the features using the OptimalBinning class.

Thanks in advance!

Binning plot - scale option

Scaling option to show actual bin widths (x-axis).

Binary target.
Continuous target.

Binning table analysis for ContinuousOptimalBinning algorithm

Add analysis method for ContinuousBinningTable.

Assign missing class to a bin & BinningProcess with user splits

I am running the below

user_splits =  [2,6,14,24,34]
optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",user_splits=user_splits)
optb.fit(x, y)
binning_table = optb.binning_table
binning_table.build()

that gives the below sample output

How do I force the missing bin to be in bin 0?

After running the user splits for the specific variables, how do I then use those splits when running the
binning_process = BinningProcess(variable_names, categorical_variables=categorical_variables, selection_criteria=selection_criteria) fro the scorecard

Scorcard

Is there any way that I don't get Special and Missing bins into the Scorecard, how can I get rid of features that have more than 50% of the population.
The last question (for this post) how can I made the bin's integer. The feature has NaN this means datatype si float, but the bins are elements (ints), and I get [1.5, 2.5) which means 2 in this range

Scorecard

Implementation of class for building scorecards.

References:

SAS: https://documentation.sas.com/?docsetId=emref&docsetTarget=n181vl3wdwn89mn1pfpqm3w6oaz5.htm&docsetVersion=14.3&locale=en

Qhull error

Hi @guillermo-navas-palencia

I don't know extractly what this error means. Can you help me to explain it.

Release 0.6.0

Prepare release 0.6.0

Documentation: scorecard and plots.
Tutorial: scorecard for binary and continuous target.
Release notes.
List of companies using OptBinning.

Sample Weights Error

An error occurs when optbinning.OptimalBinning.fit() is used with sample_weight specified and a dataset contains missing and/or special values.

Example (Kaggle Telco Customer Churn Dataset):

# %%

import optbinning
import numpy as np
import pandas as pd

# %%

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=",", engine="c")

y = df["Churn"].values

mask = y == "Yes"
y[mask] = 1
y[~mask] = 0
y = y.astype(np.int)


# %%

# Add special cases
df['MonthlyCharges'][0:100] = -1
df['MonthlyCharges'][100:200] = -2

# Add missing values
df['MonthlyCharges'][200:400] = np.nan


# %%

print('n_missing:', df['MonthlyCharges'].isna().sum())
print('n_special_cases:', sum(df['MonthlyCharges'] == -1))
print('n_special_cases:', sum(df['MonthlyCharges'] == -2))

# %%

binning = optbinning.OptimalBinning(dtype = "numerical",
                                    prebinning_method='cart',
                                    special_codes= [-1, -2],
                                    verbose = True)

# Fit with random values as weights
binning.fit(x = df['MonthlyCharges'].values,
            y = y,
            sample_weight = list(np.random.rand(len(df))))

Then the error occurs:

Traceback (most recent call last):
  File "<input>", line 8, in <module>
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 528, in fit
    return self._fit(x, y, sample_weight, check_input)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 781, in _fit
    sw_special, sw_others)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 857, in _fit_prebinning
    ).fit(x, y, sample_weight)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/prebinning.py", line 117, in fit
    est.fit(x.reshape(-1, 1), y, sample_weight=sample_weight)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 894, in fit
    X_idx_sorted=X_idx_sorted)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 288, in fit
    sample_weight = _check_sample_weight(sample_weight, X, DOUBLE)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/utils/validation.py", line 1303, in _check_sample_weight
    .format(sample_weight.shape, (n_samples,)))
ValueError: sample_weight.shape == (7043,), expected (6643,)!

I assume, that the error can be fixed by updating the code of _fit function in binning.py. In this function, self._fit_prebinning should be called with sw_clean at the current position of sample_weight:

            splits, n_nonevent, n_event = self._fit_prebinning(
                x_clean, y_clean, y_missing, y_special, y_others,
                self.class_weight, sample_weight, sw_clean, sw_missing,
                sw_special, sw_others)

In this case sw_clean will be used with x_clean, y_clean in _fit_prebinning that was called, in prebinning initialisation:

        prebinning = PreBinning(method=self.prebinning_method,
                                n_bins=self.max_n_prebins,
                                min_bin_size=min_bin_size,
                                problem_type=self._problem_type,
                                class_weight=class_weight,
                                **self.prebinning_kwargs
                                ).fit(x, y, sample_weight)

Also maybe prebinning initialisation with explicitly x_clean, y_clean, and sw_clean should be used.

Binning proces sketch

Implementation of the binning process sketch to perform binning using datasets with the same structure stored on a distributed infrastructure. No data centralization. This new class will allow performing binning for very large datasets that do not fit in memory.

scorecard.intercept_ did not work

Hey I'm on Ubuntu 20.04
python 3.8.5

binning_process = BinningProcess(variable_names,binning_fit_params=({"solver":"mip"}),categorical_variables=variable_names, special_codes=special_codes,selection_criteria=selection_criteria)
estimator = LogisticRegression(solver="lbfgs")
scorecard = Scorecard(target='IBM', binning_process=binning_process,estimator=estimator,intercept_based=True)
scorecard.fit(data)

Begin options
target IBM * U
binning_process yes * U
estimator yes * U
scaling_method no * d
scaling_method_params no * d
intercept_based True * U
reverse_scorecard False * d
rounding False * d
verbose False * d
End options

Statistics
Number of records 7299
Number of variables 20
Target type binary

Number of numerical                    0
Number of categorical                 20
Number of selected                     5

print(scorecard.intercept_ ) get -> 0

saving OptimalBinning & some other issues

Hi,

I am dealing with a big dataset => so scorecard module can't be used on my pc. I resort to OptimalBinning of each variable:

Code
optb = OptimalBinning(name=variable, dtype="numerical")
optb.fit(x, y)

import joblib
joblib.dump(optb, output+'txt.pkl')

Error
_TypeError: can't pickle thread.RLock objects

Would love to:

Obtain the criteria of bins (to transform the dataset later)? Or;
Save the OptimalBinning in pickle format.

Any thoughts on the above are much appreciated.

Tutorial: Binning under uncertainty

BinningProcess: support pandas.DataFrame as input X.

Support fitting with pandas.DataFrame to avoid .values() function resulting in a numpy.ndarray with mixed dtypes.

Question On Custom Metric

Hi,

thanks for sharing your project!

I'd like to do an optimal binning fit with a binary target and therefore have two rather specific questions:

Is it possible to optimize a custom metric?
Quick Example: If I declare S to be what you refer to as "Event" and B to be what you refer to as "Non-Event" i am looking for a binning that optimizes/maximizes e.g. S/(S+B) at a certain threshold.
If my values range from 0 to 1, I'd like to have the binning within [0,1] as well, rather than having the first bin starting from -inf and the last bin ranging to +inf. Is there a way to achieve this?

Cheers!

Result of monotonic_trend

Hi @guillermo-navas-palencia

How I know what exactly trend of binning result that chosen by model when option monotonic_trend='auto'.

Thanks.

Missing values considered as Special on transform method

Hi, @guillermo-navas-palencia ! Again, thanks for sharing your work with the community 🥇

I was writing a class using the BinningProcess to create a binning table on a test dataset. So, I have a df_train and a df_test, I fitted the BinningProcess using the df_train and transformed df_train and df_test. Then I checked the binning table. So far so good, the binning_table showed that I had some missing values in the Bin Missing, what is expected.
After that I did a "safe check", I joined the df_train_transformed with the binning_table by WoE and re-grouped it by Bin. Then I realized that in the transformation, the BinningProcess was considering the Missing values as Special values, and it transformed it to WoE = 0. Am I doing something wrong in this process?

Steps to reproduce:

# Get any DataFrame with missing values in column "feature_na"

binning_process = BinningProcess(variable_names=features,)

X_train, y_train = df_train.loc[:, features].copy(), df_train[target]

X_train_binned = binning_process.fit_transform(X_train, y_train)

# Check the WoE of each bin
binning_process.get_binned_variable(name='feature_na').binning_table.build()

# Check if these WoE are in the binned dataframe
X_train_binned.loc[:, 'feature_na'].unique()

# Or, for a more complex check, merge the X_train_binned with the binning_table on WoE and groupby Bin

Setting 'user_splits_fixed' in categorical binning

Hi @guillermo-navas-palencia,

There is a problem when I try to setting value to user_splits_fixed. Suppose I have column with raito of event rate in each value like this:

value	raito
-1	0.011665
2	0
3	0.0133333
4	0.166667
7	0
8	0.0246041
9	0
10	0.025641

Then when I set user_splits = [[ 2., 7., 9., 3., 10., 4.],[8],[-1]] ,user_splits_fixed=[True, True, True], monotonic_trend=None,dtype='categorical'
and the program raises error ValueError: Fixed user_splits [list([2.0, 7.0, 9.0, 3.0, 10.0, 4.0])] are removed because produce pure prebins. Provide different splits to be fixed.. What thing is wrong here?

optbinning.binning module not found

Imports fail and throws a ModuleNotFoundError

from optbinning import OptimalBinning

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/surya/miniconda3/envs/ml/lib/python3.7/site-packages/optbinning/__init__.py", line 1, in <module>
    from .binning.binning import OptimalBinning
ModuleNotFoundError: No module named 'optbinning.binning'

guillermo-navas-palencia / optbinning Goto Github PK

optbinning's People

Contributors

Stargazers

Watchers

Forkers

optbinning's Issues

Roadmap

Hey I'm on Ubuntu 20.04 python 3.8.5

Recommend Projects

Recommend Topics

Recommend Org

Hey I'm on Ubuntu 20.04
python 3.8.5