Giter Club home page Giter Club logo

optbinning's People

Contributors

alexliap avatar bmreiniger avatar floriankappert avatar gabrielsgoncalves avatar guillermo-navas-palencia avatar jnsofini avatar kasdeblieck avatar lassilehtonen avatar nehaljwani avatar peterpanmj avatar philip-khor avatar stradivari96 avatar tranvohuy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optbinning's Issues

MDLP algorithm

Python Implementation of the Minimum Description Length Principle (MDLP).

Error while importing optbinning installed from scorecard branch

Hi Guillermo,

I getting this error when I try to import optbinning that was installed from the scorecard branch:

import optbinning
ImportError                               Traceback (most recent call last)
<ipython-input-2-3b16b6a50809> in <module>
     47 
     48 from decouple import config
---> 49 from optbinning import OptimalBinning
     50 
     51 # Definir variáveis de ambiente

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/__init__.py in <module>
----> 1 from .binning.binning import OptimalBinning
      2 from .binning.continuous_binning import ContinuousOptimalBinning
      3 from .binning.multiclass_binning import MulticlassOptimalBinning
      4 from .binning.binning_process import BinningProcess
      5 from .binning.mdlp import MDLP

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/binning/binning.py in <module>
     15 
     16 from ..logging import Logger
---> 17 from ..preprocessing import preprocessing_user_splits_categorical
     18 from ..preprocessing import split_data
     19 from .auto_monotonic import auto_monotonic

~/miniconda3/envs/jeitto/lib/python3.7/site-packages/optbinning-0.5.0-py3.7.egg/optbinning/preprocessing.py in <module>
     15 from sklearn.utils import check_consistent_length
     16 from sklearn.utils import compute_class_weight
---> 17 from sklearn.utils.validation import _check_sample_weight
     18 
     19 from .outlier import ModifiedZScoreDetector

ImportError: cannot import name '_check_sample_weight' from 'sklearn.utils.validation' (/home/gabriel/miniconda3/envs/jeitto/lib/python3.7/site-packages/sklearn/utils/validation.py)

When I install optbinning from master branch it runs ok.
Do you know what might be causing this?

Att

Gabriel

Monotonic trend on BinningProcess

  • How do I force the bins to have a monotonic trend when using BinningProcess(variable_names, categorical_variables=categorical_variables,selection_criteria=selection_criteria)
  • I s there a way to round of the scores to the nearest decimal place?
  • from the scorecard example how do i fit the train data and not the entire dataset as showing in the scoring tutorial scorecard.fit(df)

Plots for Score Card

Hi Guillermo,

I just wanted to share with you a few functions that I use to visualized my some important metrics when working with Score Cards. It's basically functions to get AUROC, Gini and KS.

I think it would be a great addition to the scorecard module.

Feel free to use (and modify if needed) the code it if you think it would help the user.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import roc_curve, roc_auc_score


def plot_auroc(
    df,
    target_col,
    proba_col,
    title='ROC curve',
    xlabel='False positive rate',
    ylabel='True positive rate',
    roc_color='blue',
    reference_color='black',
    savefig=False,
    figname='auroc.png',
):
    """
    Plot AUROC for dataframe with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default 'ROC curve'): Title for plot.
        xlabel (str, 'False positive rate'): Label for the x-axis (bins).
        ylabel (str, 'True positive rate'): Label for the y-axis
            (probabilities).
        roc_color (str, default 'blue'): Color for the event lineplot.
        reference_color (str, default 'red'): Color for the non-event lineplot.
        savefig (boolean, default False): To save figure.
        figname (string, default 'auroc.fig'): Name for figure file.

    Return:
        auroc (float): Calculate AUROC score.
    """
    # Define the arrays for plotting
    fpr, tpr, threshold = roc_curve(df[target_col], df[proba_col])

    # Define the plot settings
    plt.plot(fpr, tpr, color=roc_color)
    plt.plot(fpr, fpr, linestyle="--", color=reference_color)
    plt.title(title, fontdict={'fontsize': 16})
    plt.xlabel(xlabel, fontdict={'fontsize': 16})
    plt.ylabel(ylabel, fontdict={'fontsize': 16})

    if savefig:
        plt.savefig(figname, dpi=300)

    # Set AUROC score
    auroc = roc_auc_score(df[target_col], df[proba_col])
    return auroc


def plot_gini(
    df,
    target_col,
    proba_col,
    title='Gini',
    xlabel='Cumulative % Population',
    ylabel='Cumulative % Bad',
    gini_color='blue',
    reference_color='black',
    savefig=False,
    figname='gini.png',
):
    """
    Plot Gini based for dataframe with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default None): Title for plot.
        xlabel (str, default None): Label for the x-axis (bins).
        ylabel (str, default None): Label for the y-axis (probabilities).
        roc_color (str, default 'blue'): Color for the event lineplot.
        reference_color (str, default 'red'): Color for the non-event lineplot.
        savefig (boolean, default False): To save figure.
        figname (string, default 'gini.fig'): Name for figure file.

    Return:
        gini (float): Calculate Gini Index.

    """
    # Sort dataframe by probabilities and reset index
    df = df.sort_values(proba_col)
    df = df.reset_index()

    # Calculate cumulative columns
    df["cumulative_n_population"] = df.index + 1
    df["cumulative_n_good"] = df[target_col].cumsum()
    df["cumulative_n_bad"] = (
        df["cumulative_n_population"] - df[target_col].cumsum()
    )
    df["cumulative_perc_population"] = df["cumulative_n_population"] / (
        df.shape[0]
    )
    df["cumulative_perc_good"] = df["cumulative_n_good"] / df[target_col].sum()
    df["cumulative_perc_bad"] = df["cumulative_n_bad"] / (
        df.shape[0] - df[target_col].sum()
    )

    # Plot Gini
    plt.plot(
        df["cumulative_perc_population"],
        df["cumulative_perc_bad"],
        color=gini_color,
    )
    plt.plot(
        df["cumulative_perc_population"],
        df["cumulative_perc_population"],
        linestyle="--",
        color=reference_color,
    )
    plt.title(title, fontdict={'fontsize': 16})
    plt.xlabel(xlabel, fontdict={'fontsize': 16})
    plt.ylabel(ylabel, fontdict={'fontsize': 16})

    # Save figure if requested
    if savefig:
        plt.savefig(figname, dpi=300)

    # Calculate Gini Index
    auroc = roc_auc_score(df[target_col], df[proba_col])
    gini = auroc * 2 - 1
    return gini


def plot_ks(
    df,
    target_col,
    proba_col,
    title="Kolmogorov-Smirnov",
    xlabel="Estimated Probability for being Good",
    ylabel="Cumulative %",
    good_color="blue",
    bad_color="red",
    savefig=False,
    figname="ks.png",
):
    """
    Create Kolmogorov-Smirnov plot for df with probabilities and binary target.

    Args:
        df (pandas.DataFrame): Dataframe with the columns to be plotted.
        target_col (str): Column with binary target.
        proba_col (str): Column with the calculated probabilites.
        title (str, default='Kolmogorov-Smirnov'): Title for plot.
        xlabel (str, default='Estimated Probability for being Good'): Label for
            the x-axis (bins).
        ylabel (str, default='Cumulative %'): Label for the y-axis
            (probabilities).
        good_color (str, default='blue'): Color for the event lineplot.
        bad_color (str, default='red'): Color for the non-event lineplot.
        savefig (boolean, default=False): To save figure.
        figname (string, default='ks.fig'): Name for figure file.

    Return:
        gini (float): Calculate Gini Index.

    """
    # Sort dataframe by probabilities and reset index
    df = df.sort_values(proba_col)
    df = df.reset_index()

    # Calculate cumulative columns
    df["cumulative_n_population"] = df.index + 1
    df["cumulative_n_good"] = df[target_col].cumsum()
    df["cumulative_n_bad"] = (
        df["cumulative_n_population"] - df[target_col].cumsum()
    )
    df["cumulative_perc_population"] = df["cumulative_n_population"] / (
        df.shape[0]
    )
    df["cumulative_perc_good"] = df["cumulative_n_good"] / df[target_col].sum()
    df["cumulative_perc_bad"] = df["cumulative_n_bad"] / (
        df.shape[0] - df[target_col].sum()
    )

    # Plot KS
    plt.plot(
        df[proba_col], df["cumulative_perc_bad"], color=bad_color,
    )
    plt.plot(
        df[proba_col], df["cumulative_perc_good"], color=good_color,
    )
    plt.xlabel(
        "Estimated Probability for being Good", fontdict={'fontsize': 16}
    )
    plt.ylabel("Cumulative %", fontdict={'fontsize': 16})
    plt.title("Kolmogorov-Smirnov", fontdict={'fontsize': 16})

    ks_score = max(df["cumulative_perc_bad"] - df["cumulative_perc_good"])
    ks_max_idx = np.argmax(
        df["cumulative_perc_bad"] - df["cumulative_perc_good"]
    )
    # Define the KS line

    plt.vlines(
        df[proba_col].iloc[ks_max_idx],
        ymin=df["cumulative_perc_good"].iloc[ks_max_idx],
        ymax=df["cumulative_perc_bad"].iloc[ks_max_idx],
        color='black',
        linestyles='--',
    )

    plt.text(
        df[proba_col].iloc[ks_max_idx] + 0.1,
        (
            (
                df["cumulative_perc_good"].iloc[ks_max_idx]
                + df["cumulative_perc_bad"].iloc[ks_max_idx]
            )
            * 0.5
        ),
        f'KS = {round(ks_score * 100, 2)}%',
        fontsize=12,
        rotation_mode='anchor',
    )

    # Save figure if requested
    if savefig:
        plt.savefig(figname, dpi=300)

    return ks_score

How to bin new samples with splits generated from training?

Hi,

First of all, congrats for the library!

I'm trying to use the bins generated and use it to bin data in production.
Here is my example:

from optbinning import OptimalBinning

x = df["var1"].to_numpy()
y = df["target"].to_numpy()

optb = OptimalBinning(name='var1')
optb.fit(x, y)
optb.splits

Then I get an array with the splits (for a numerical variable).
That's great, but if I want to categorize 'var1' for one row of new data (not part of training), what's the best way to do it?

One way to do it using pandas.cut(), but I don't know if there is a better way to do it using optbinning.

pd.cut(df_new["var1"],  bins=optb.splits)

Also, how can I deal with categorical variables, as some bins are more than one category merged into an array (eg: ['cat1', 'cat3', 'catn']).

Att

Gabriel

Binning categorical is not optimal IV

Hi @guillermo-navas-palencia,

When I using OptimalBinning with dtype='categorical',min_bin_size=0.05,max_bin_size=None with the column which have distribution like this:

Count(%) Eventrate
G-1 16631 0.011665
G10 39 0.025641
G2 2 0
G3 75 0.0133333
G4 6 0.166667
G7 21 0
G8 3536 0.0246041
G9 12 0

I get the result look like:

Bin Count Count(%) Non-event Event Eventrate WoE IV JS
0 ['G2''G7''G9''G-1''G3'] 16741 0.823787 16546 195 0.011648 0.184488589528379 0.0256645 0.00320352
1 ['G8''G10''G4'] 3581 0.176213 3492 89 0.0248534 -0.586817964430649 0.0816331 0.0100602
2 Special 0 0 0 0 0 0.0 0 0
3 Missing 0 0 0 0 0 0.0 0 0
Totals 20322 1 20038 284 0.013975 0.107298 0.0132637

But when I manually group another values except G8 and G-1 as G1 and using the same previous binning option, I get a result like this:

Bin Count Count(%) Non-event Event Eventrate WoE IV JS
0 ['G-1'] 16631 0.712798 16437 194 0.011665 0.38550821182938577 0.0883861 0.0109804
1 ['G8'] 3536 0.151552 3449 87 0.0246041 -0.37399230503155456 0.0255081 0.00317006
2 ['G1'] 3165 0.135651 3048 117 0.0369668 -0.7938568173050311 0.127864 0.0155761
3 Special 0 0 0 0 0 0.0 0 0
4 Missing 0 0 0 0 0 0.0 0 0
Totals 23332 1 22934 398 0.0170581 0.241758 0.0297265

So my question is: To preform a maximize IV, why the algorithm doesn't group these values as I manually do, because the second IV is much better than the first one's?

Setting value for max_bin_size will break binning process

When I run any type binning,especially with continuous binning, and set parameter max_bin_size as expect that each bin fraction size small than this value, I usually get result usually that all of observation belong just 1 bin. I think this may be a mistake.

Release 0.7.0

  • OptimalBinningSketch introduction.
  • OptimalBinningSketch data streams tutorial.
  • OptimalBinningSketch pyspark tutorial.
  • Update to version 0.7.0
  • Update tutorials
  • Update README
  • Update Release Notes

AttributeError: 'DataFrame' object has no attribute 'dtype'

I am trying to run the scorecard model

scorecard_full = Scorecard(
    target='FLAG',
    binning_process=binning_process,
    estimator=lr,
    scaling_method=scaling_method,
    scaling_method_params=scaling_method_data,
    intercept_based=False,
    reverse_scorecard=False,
    rounding = True)
scorecard_full.fit(df_train)

but i get the below error
Fitted scorecard. """ --> return self._fit(df, metric_special, metric_missing, show_digits, check_input) AttributeError: 'DataFrame' object has no attribute 'dtype'

my data frame has some missing values which i have not imputted

serializing the scorecard

Thank you for great work.
Kindly advise on how to serialize the scorecard function.
I tried pickle, joblib and dill but they both give this error, "TypeError: cannot pickle 'SwigPyObject' object".
Thanks

Information Value whole dataset - Scorecard

How do i compute the information value for the entire dataset?

Two, is it advisable to do the binning on train set and later use the bins for test set or bin the entire dataset ?

Support prebinning algorithm MDLP

For large datasets, the time in pre-binning can be reduced by replacing the CART algorithm by the MDLP algorithm using a small number of candidates for the dynamic splitting method. Significant IV degradation is not expected.

test data can't do binning_table build and plot

In score card if we want to compare a variable results respectively in train and test,we can not do woe or event_rate in test data.We can only do it in train data byoptb.binning_table.plot(metric='woe).Please add some function in test data?

Stochastic Optimal Binning

Implement deterministic equivalent to stochastic optimal binning.

Roadmap

  • Base class DSOptimalBinning
  • Binary target
  • Unit testing

Implementation for continuous variables.

Adding feature: return set of solutions that is less reasonable than the best solution

Hi @guillermo-navas-palencia.
Thank you for this great project. It's very helpful for my work. I wonder if the model returns some other less reasonable solutions. In some circumstances, some businesses need additional conditions that cannot be indicated in the model's parameters. So I think the best way is filtering that conditions from the less reasonable solutions set to find suitable ones. That's why I would like to ask to add the feature in the next version.
Thanks in advance.

Polulation Stability Index (PSI)

Develop a function to compare two populations using PSI.

  • Scorecard monitoring (System stability report)
  • Variable monitoring (Characteristic analysis report)

Problem with monotonic trend

Hi @guillermo-navas-palencia,

I see this problem when I try to set the value of the monotonic trend. When monotonic trend='descending', Optbinning will bin fail. On the other hands, with monotonic trend= 'auto' it's result will be like this:
image

I think this problem arises when the number of splits is 1.

One-hot encoding option in BinningProcess

Issue: one-hot encoding cannot generate the same number of columns if the number of categories (bins) is not fixed. For example, if we split data into train and test, the number of columns after one-hot encoding might differ. This cannot occur using WoE transformation.

Comments @GabrielSGoncalves?

Quick questions

Nice package I came across. By the way, may I know:

  1. how to remove special/missing rows from binning_table when doing the plotting?
  2. how to do customised binning.

Thanks.

Assign missing class to a bin & BinningProcess with user splits

I am running the below

user_splits =  [2,6,14,24,34]
optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",user_splits=user_splits)
optb.fit(x, y)
binning_table = optb.binning_table
binning_table.build()

that gives the below sample output
image
How do I force the missing bin to be in bin 0?

After running the user splits for the specific variables, how do I then use those splits when running the
binning_process = BinningProcess(variable_names, categorical_variables=categorical_variables, selection_criteria=selection_criteria) fro the scorecard

Scorcard

Is there any way that I don't get Special and Missing bins into the Scorecard, how can I get rid of features that have more than 50% of the population.
The last question (for this post) how can I made the bin's integer. The feature has NaN this means datatype si float, but the bins are elements (ints), and I get [1.5, 2.5) which means 2 in this range

Release 0.6.0

Prepare release 0.6.0

  • Documentation: scorecard and plots.
  • Tutorial: scorecard for binary and continuous target.
  • Release notes.
  • List of companies using OptBinning.

Sample Weights Error

An error occurs when optbinning.OptimalBinning.fit() is used with sample_weight specified and a dataset contains missing and/or special values.

Example (Kaggle Telco Customer Churn Dataset):

# %%

import optbinning
import numpy as np
import pandas as pd

# %%

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=",", engine="c")

y = df["Churn"].values

mask = y == "Yes"
y[mask] = 1
y[~mask] = 0
y = y.astype(np.int)


# %%

# Add special cases
df['MonthlyCharges'][0:100] = -1
df['MonthlyCharges'][100:200] = -2

# Add missing values
df['MonthlyCharges'][200:400] = np.nan


# %%

print('n_missing:', df['MonthlyCharges'].isna().sum())
print('n_special_cases:', sum(df['MonthlyCharges'] == -1))
print('n_special_cases:', sum(df['MonthlyCharges'] == -2))

# %%

binning = optbinning.OptimalBinning(dtype = "numerical",
                                    prebinning_method='cart',
                                    special_codes= [-1, -2],
                                    verbose = True)

# Fit with random values as weights
binning.fit(x = df['MonthlyCharges'].values,
            y = y,
            sample_weight = list(np.random.rand(len(df))))

Then the error occurs:

Traceback (most recent call last):
  File "<input>", line 8, in <module>
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 528, in fit
    return self._fit(x, y, sample_weight, check_input)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 781, in _fit
    sw_special, sw_others)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/binning.py", line 857, in _fit_prebinning
    ).fit(x, y, sample_weight)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/optbinning/binning/prebinning.py", line 117, in fit
    est.fit(x.reshape(-1, 1), y, sample_weight=sample_weight)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 894, in fit
    X_idx_sorted=X_idx_sorted)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 288, in fit
    sample_weight = _check_sample_weight(sample_weight, X, DOUBLE)
  File "/Users/alexander/.conda/envs/woe/lib/python3.7/site-packages/sklearn/utils/validation.py", line 1303, in _check_sample_weight
    .format(sample_weight.shape, (n_samples,)))
ValueError: sample_weight.shape == (7043,), expected (6643,)!

I assume, that the error can be fixed by updating the code of _fit function in binning.py. In this function, self._fit_prebinning should be called with sw_clean at the current position of sample_weight:

            splits, n_nonevent, n_event = self._fit_prebinning(
                x_clean, y_clean, y_missing, y_special, y_others,
                self.class_weight, sample_weight, sw_clean, sw_missing,
                sw_special, sw_others)

In this case sw_clean will be used with x_clean, y_clean in _fit_prebinning that was called, in prebinning initialisation:

        prebinning = PreBinning(method=self.prebinning_method,
                                n_bins=self.max_n_prebins,
                                min_bin_size=min_bin_size,
                                problem_type=self._problem_type,
                                class_weight=class_weight,
                                **self.prebinning_kwargs
                                ).fit(x, y, sample_weight)

Also maybe prebinning initialisation with explicitly x_clean, y_clean, and sw_clean should be used.

Binning proces sketch

Implementation of the binning process sketch to perform binning using datasets with the same structure stored on a distributed infrastructure. No data centralization. This new class will allow performing binning for very large datasets that do not fit in memory.

scorecard.intercept_ did not work

Hey I'm on Ubuntu 20.04
python 3.8.5

binning_process = BinningProcess(variable_names,binning_fit_params=({"solver":"mip"}),categorical_variables=variable_names, special_codes=special_codes,selection_criteria=selection_criteria)
estimator = LogisticRegression(solver="lbfgs")
scorecard = Scorecard(target='IBM', binning_process=binning_process,estimator=estimator,intercept_based=True)
scorecard.fit(data)

optbinning (Version 0.9.0)
Copyright (c) 2019-2021 Guillermo Navas-Palencia, Apache License 2.0

Begin options
target IBM * U
binning_process yes * U
estimator yes * U
scaling_method no * d
scaling_method_params no * d
intercept_based True * U
reverse_scorecard False * d
rounding False * d
verbose False * d
End options

Statistics
Number of records 7299
Number of variables 20
Target type binary

Number of numerical                    0
Number of categorical                 20
Number of selected                     5

print(scorecard.intercept_ ) get -> 0

saving OptimalBinning & some other issues

Hi,

I am dealing with a big dataset => so scorecard module can't be used on my pc. I resort to OptimalBinning of each variable:

Code
optb = OptimalBinning(name=variable, dtype="numerical")
optb.fit(x, y)

import joblib
joblib.dump(optb, output+'txt.pkl')

Error
_TypeError: can't pickle thread.RLock objects

Would love to:

  1. Obtain the criteria of bins (to transform the dataset later)? Or;
  2. Save the OptimalBinning in pickle format.

Any thoughts on the above are much appreciated.

Question On Custom Metric

Hi,

thanks for sharing your project!

I'd like to do an optimal binning fit with a binary target and therefore have two rather specific questions:

  1. Is it possible to optimize a custom metric?
    Quick Example: If I declare S to be what you refer to as "Event" and B to be what you refer to as "Non-Event" i am looking for a binning that optimizes/maximizes e.g. S/(S+B) at a certain threshold.

  2. If my values range from 0 to 1, I'd like to have the binning within [0,1] as well, rather than having the first bin starting from -inf and the last bin ranging to +inf. Is there a way to achieve this?

Cheers!

Missing values considered as Special on transform method

Hi, @guillermo-navas-palencia ! Again, thanks for sharing your work with the community 🥇

I was writing a class using the BinningProcess to create a binning table on a test dataset. So, I have a df_train and a df_test, I fitted the BinningProcess using the df_train and transformed df_train and df_test. Then I checked the binning table. So far so good, the binning_table showed that I had some missing values in the Bin Missing, what is expected.
After that I did a "safe check", I joined the df_train_transformed with the binning_table by WoE and re-grouped it by Bin. Then I realized that in the transformation, the BinningProcess was considering the Missing values as Special values, and it transformed it to WoE = 0. Am I doing something wrong in this process?

Steps to reproduce:

# Get any DataFrame with missing values in column "feature_na"

binning_process = BinningProcess(variable_names=features,)

X_train, y_train = df_train.loc[:, features].copy(), df_train[target]

X_train_binned = binning_process.fit_transform(X_train, y_train)

# Check the WoE of each bin
binning_process.get_binned_variable(name='feature_na').binning_table.build()

# Check if these WoE are in the binned dataframe
X_train_binned.loc[:, 'feature_na'].unique()

# Or, for a more complex check, merge the X_train_binned with the binning_table on WoE and groupby Bin

Setting 'user_splits_fixed' in categorical binning

Hi @guillermo-navas-palencia,

There is a problem when I try to setting value to user_splits_fixed. Suppose I have column with raito of event rate in each value like this:

value raito
-1 0.011665
2 0
3 0.0133333
4 0.166667
7 0
8 0.0246041
9 0
10 0.025641

Then when I set user_splits = [[ 2., 7., 9., 3., 10., 4.],[8],[-1]] ,user_splits_fixed=[True, True, True], monotonic_trend=None,dtype='categorical'
and the program raises error ValueError: Fixed user_splits [list([2.0, 7.0, 9.0, 3.0, 10.0, 4.0])] are removed because produce pure prebins. Provide different splits to be fixed.. What thing is wrong here?

optbinning.binning module not found

Imports fail and throws a ModuleNotFoundError

from optbinning import OptimalBinning

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/surya/miniconda3/envs/ml/lib/python3.7/site-packages/optbinning/__init__.py", line 1, in <module>
    from .binning.binning import OptimalBinning
ModuleNotFoundError: No module named 'optbinning.binning'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.