Giter Club home page Giter Club logo

Comments (21)

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

Just wanted to start saying I'm really excited about the idea of helping the development of this module inside optbinning.

As I mentioned before on the other issue, I'm working on the development of a ScoreCard model, and the funcionalities offered by optbinning are already extremely powerful. As I'm not an expert in ScoreCard I'm trying to follow the traditional logic based on WoE and IV for feature selection.

What I have drafted so far:

  1. Define the variable dtype and fillna method and apply to dataframe;
  2. Use OptminalBinning to get categories for each variable;
  3. Verify IV for each variable and filter the most relevant;
  4. Transform the dataframe into binning categories and perform one-hot encoding

After these steps I would have a processed dataframe with each column corresponding to a binning category, ready for performing the model training using sklearn LogisticRegression (or any other algorithm).

Here is what I have constructed untill now.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd
from optbinning import OptimalBinning


class OptScoreCard:
    """Create ScoreCards using optbinning.

    Args:
            df (pandas.DataFrame): Dataframe containing the features.
            dict_parameters (dict): Dicionary with name of column as key and tuple
                with value to replace and datatype as value.
            target_col (str): Name of the target binary column. 

    Attributes:
        df (pandas.DataFrame): Stores input dataframe with the variables.
        target_array (numpy.array): Array containing the target variable.
        dict_parameters (dict): Dictionary containing each feature column of 
            interest from the df as key, and a respective tuple with the fillna 
            placeholder and dtype as values.
        df_bins (pandas.DataFrame): Dataframe with bin categories replacing 
            original values
        df_dummies (pandas.DataFrame): Dataframe with one-hot encoding applied
            from df_bins. 
        dict_optbins (dict): Store OptimalBinning objects from each variable
            processed.
        dict_optbins_filtered (dict): Filtered variables from dict_optbins.
        reference_bin_categories (dict): Variable name as key and bin 
            category for each selected variable.

    """

    def __init__(self, df, dict_parameters, target_col):
        self.df = df
        self.target_array = self.df[target_col].to_numpy()
        self.dict_parameters = dict_parameters
        self.df_bins = pd.DataFrame()
        self.df_dummies = pd.DataFrame()
        self.dict_optbins = {}
        self.dict_optbins_filtered = {}
        self.reference_bin_categories = {}

    def fillna(self, filter_cols=False):
        """
        Replace null values based on pre-defined parameters.

        Args:
            filter_cols (bool): Select columns dedine in dict_parameters in df.
        """
        dict_median = {
            k: v[0]
            for k, v in self.dict_parameters.items()
            if v[0] == "median"
        }
        dict_mean = {
            k: v[0] for k, v in self.dict_parameters.items() if v[0] == "mean"
        }
        dict_values = {
            k: v[0]
            for k, v in self.dict_parameters.items()
            if v[0] not in ["median", "mean"]
        }
        dict_datatype = {
            k: v[1]
            for k, v in self.dict_parameters.items()
            if v[1] is not None
        }
        dict_median = (
            self.df[list(dict_median.keys())].median().round(2).to_dict()
        )
        dict_mean = self.df[list(dict_mean.keys())].mean().round(2).to_dict()

        # Fillna
        dict_fillna = {**dict_median, **dict_mean, **dict_values}
        self.df.fillna(dict_fillna, inplace=True)

        # Convert datatype
        self.df = self.df.astype(dict_datatype)
        if filter_cols:
            self.df[list(dict_fillna.columns)]
        else:
            self.df

    def get_optimal_bins(self, solver="cp"):
        """Perform optimal binning in each column of the dataframe.
        
        Args:
            solver (str, default='cp'): Type of algorithm to solve the binning
                task.

        """
        for k, v in self.dict_parameters.items():
            if v[1] in ("category", "str"):
                x = self.df[k].to_numpy()
                optb = OptimalBinning(
                    name=k, dtype="categorical", solver=solver
                )
                optb.fit(x, self.target_array)
                self.dict_optbins[k] = [
                    optb,
                    optb.binning_table.build().at["Totals", "IV"],
                    {
                        k: str(v)
                        for k, v in dict(
                            zip(
                                optb._binning_table.build().WoE,
                                optb._binning_table.build().Bin,
                            )
                        ).items()
                        if str(v) not in ("", "Special", "Missing")
                    },
                ]

            elif v[1] in ("float", "int"):
                x = self.df[k].to_numpy()
                optb = OptimalBinning(name=k, solver=solver)
                optb.fit(x, self.target_array)
                self.dict_optbins[k] = [
                    optb,
                    optb.binning_table.build().at["Totals", "IV"],
                    {
                        k: str(v)
                        for k, v in dict(
                            zip(
                                optb._binning_table.build().WoE,
                                optb._binning_table.build().Bin,
                            )
                        ).items()
                        if str(v) not in ("", "Special", "Missing")
                    },
                ]
            else:
                self.dict_optbins[k] = [optb, 0, 'dtype_error']

    def show_iv_table(self):
        """Show a dataframe with variables and IV in ascending order."""
        return pd.DataFrame.from_dict(
            self.dict_optbins,
            orient="index",
            columns=["OptimalBinning_obj", "IV", "WoE-index"],
        )[["IV"]].sort_values(by="IV", ascending=False)

    def filter_features_by_IV(self, by='best', threshold=50):
        """Select the features based on the IV score.

        Need to implement other ways of filtering (eg: Percentage)
        """
        self.dict_optbins_filtered = {
            k: v
            for k, v in self.dict_optbins.items()
            if k in self.show_iv_table().IV[:threshold].index
        }
        self.df = self.df[self.dict_optbins_filtered.keys()]

    def transform_into_bin_categories(self):
        """
        Transform values into binning categories based on OptBinning.
        """

        for k, v in self.dict_optbins_filtered.items():
            self.df_bins[k] = pd.Series(
                self.dict_optbins_filtered.get(k)[0].transform(
                    self.df[k], metric="bins"
                )
            )

    def create_dummies_from_bins(self):
        """
        Create a One-Hot encoded dataframe from df_bins.
        """
        for k, v in self.dict_optbins_filtered.items():
            self.df_dummies = pd.concat(
                [
                    self.df_dummies,
                    pd.get_dummies(
                        self.df_bins[k],
                        prefix=k,
                        prefix_sep=":",
                        columns=list(
                            self.dict_optbins_filtered[k][2].values()
                        ),
                    ),
                ],
                axis=1,
            )

    def get_reference_bin_categories(self):
        """
        Define reference category for binned variables.

        The reference categories are selected by the lowest WoE of the bins
        of a variable.
        """
        self.reference_bin_categories = {
            k: f'{k}:'
            + str(
                self.dict_optbins_filtered[k][2].get(
                    min(self.dict_optbins_filtered[k][2].keys())
                )
            )
            for k in self.dict_optbins_filtered.keys()
        }

The template for dict_parameters would be something like:

dict_parameters = {
    'var1': ["ind", "category"],
    'var2': ["median", "int"],
    'var3': ["mean", "float"],
    'var4': [0, "bool"]
}

As you can see there is a lot of room for improvement and optmization.

Let me know what do you think about it.

Att

Gabriel

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

Thank you for the code and the effort!! :)

I still have to take a closer look, but here some points:

  • I would definitely make use of the class BinningProcess already implementing the method get_optimal_bins and transform_into_bin_categories, it's best to reuse code. Note that this class also allows selecting variables based on the IV or Gini statistic.
  • I would rename the method show_iv_table as table to be more consistent with other classes.
  • The scorecard is usually part of the credit risk modeling cycle, but I am tempted to generalize it as much as possible, so it could be used "somehow" for other applications, and even continuous target and not just binary. For example, could we compute scorecard points using other methods besides logistic regression coefficients? We could initially focus on the binary target case, but I think we should keep this generalization in mind.

BinningProcess API and tutorials. Note that the API has slightly changed after 4d54535.

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

BTW: I think you are right about the lack of filtering strategies, they should be implemented in BinningProcessas well. The top % or top x, are good additions.

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

I have been thinking about several improvements on the BinningProcess required for building scorecards and enhancing the variable selection criteria. I will replace the current parameters:

    min_iv : float or None, optional (default=None)
        The minimum information value. Applicable if target type is binary.

    max_iv : float or None, optional (default=None)
        The maximum information value. Applicable if target type is binary.

    min_js : float or None, optional (default=None)
        The minimum Jensen-Shannon divergence value. Applicable if target type
        is binary or multiclass.

    max_js : float or None, optional (default=None)
        The maximum Jensen-Shannon divergence value. Applicable if target type
        is binary or multiclass.

    quality_score_cutoff : float or None, optional (default=None)
        The quality score cutoff value. Applicable if target type is binary or
        multiclass.

by a more general approach: parameter selection_criteria, a dictionary with the following structure:

{"iv": {"min_value": 0.02, "max_value": 0.5, "strategy": "best", "top": 20}}

In this particular case: top 20 IV variables with IV in [0.02, 0.5] will be selected. Several metrics could be combined, for example:

{"iv": {"min_value": 0.02, "max_value": 0.5, "strategy": "best", "top": 20},
"quality_score": {"min_value": 0.001}}
  • Values for key strategy are "best" and "worst".
  • Values for key top: if a decimal value is provided => percentage, for example strategy="best" and top=0.25 means 25% best variables. If an integer is provided => 25 = 25 best variables.

Finally, by default selection_criteria is set to None.

Does it make sense?

Thanks!

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

In addition, the method BinningProcess.transform() is a bit confusing. After transformation, the resulting dataset (numpy or pandas depending on the input type) should include only the selected features. Follow scikit-learn style: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.transform

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

I forgot to tell you why I decided to implement the get_optimal_bins method yesterday.
I was trying to use the BinningProcess class, but even when defining a few categorical_variables, with some numerical variables, it was considering all my variables as category.

The selection_criteria would help a lot the user, and the way you defined looks easy to use it.

It would also be a good idea to offer inside this object access to the preprocess dataframe, and the filtered and transformed dataframe to the user.

I'm still trying to figure out the best way to organize the one-hot encoded dataframe, with and without the reference categories for each variable. I'll send you my solution for it next week.

If you need any help, please let me know.

Att

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

The problem you mentioned is due to pandas when converting a dataframe with mixed column types to numpy . This is a known problem, see links:

Note, however, that version 0.4.0 introduced the option to pass pandas dataframe as input. By doing so, you should not find such problems. Summarizing, one might use numpy arrays as X input only if all columns have the same data type. See: http://gnpalencia.org/optbinning/tutorials/tutorial_binning_process_telco_churn.html

In the meantine, I am working on the new selection_criteria.

Guillermo

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

It is not a good idea to store a dataframe within the class using self.df = df, because this could lead to memory issues if several BinningProcess are instantiated.

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

Now I understood the inconsistencies with the datatype and Numpy array.

About the storing a dataframe as an attribute of a class, what would be a more efficient way to do it?

Att

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

I would not store any dataframe inside the class, just use it when needed. For example, following the BinningProcess design patterns, the user should provide it via method fit. Storing dataframes is OKish if they are small, but we cannot assume that...

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

If you prefer, I can work on a proposal, and we discuss it by the end of the week? what do you think? Then, it would be much easier to iterate.

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

That would be great.
Look foward to read it.

Att

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

I have been thinking about the scorecard class structure and functionalities. It is not implemented yet, but I thought it will be beneficial to have a custom logistic regression supporting bound and linear constraints to fulfill business requirements. Thus, I wrote a lightweight library implementing the constrained logistic regression: https://github.com/guillermo-navas-palencia/clogistic.

Give it a try if you have time :)

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

I just added the first prototype with several scaling options. I think it is quite simple but powerful. The class parameters are self-explanatory, but I can provide you more details tomorrow. An example of scaling_method and scaling_method_params is given below

    scaling_method = "pdo_odds"
    scaling_method_data = {"pdo": 20, "odds": 50, "scorecard_points": 600}

Commit: e2dcc13

Let me know if you try it!

Thanks

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

Just took a look at your code. Looks really interesting. I'm going to perform some tests tomorrow morning and I'll let you know.

Thanks again for the great work!

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

I have some questions about your ScoreCard implementation:

  1. I undestood that you used sk-learn BaseEstimator class to construct your ScoreCard class. But exactly is the advantage of it? Because the documentation shows only two methods get_params and set_params, and I couldn't get where you are using it.

  2. To instantiate the ScoreCard you basically need to provide the name of your target column (target), a BinningProcess object (that generated the bins for the original dataframe) and an estimator. Is the estimator the model? What kinds of models can I use? Only LogisticRegression?

  3. About the min and max values for the ScoreCard, which function/method can I define it? For example a FICO ScoreCard goes from 300 to 850 points, so what it's usually done it a rescaling from the predicted probabilities to this predefined range.
    The compute_scorecard_points have some definitions for min and max but it's not very clear for me.

Att

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel.

  1. BaseEstimator has other private methods used in scikit-learn that might be needed in the future. Besides get_params and set_params are very handy when performing hyperparameter optimization using hyperopt, rbfopt and alike global solvers.
  2. Indeed, the estimator is a model. You can use any linear model for classification and regression: For example: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. Any estimator with methods fit, predict (and predict_proba for classification) and attributes coef_ and or intercept_ are suitable.
  3. Two scaling method are available: for min_max you might use the following:
scaling_method = "min_max"
scaling_method_data = {"min": 300, "max": 850}

The method min_max is a simple scaling to guarantee that minimum and maximum score is 300 and 850, respectively.

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Example binary target

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from optbinning import BinningProcess
from optbinning.scorecard.scorecard import Scorecard


data = load_breast_cancer()
variable_names = data.feature_names
target = "target"
df = pd.DataFrame(data.data, columns=variable_names)
df[target] = data.target

# Estimator
lr = LogisticRegression()

# Binning process
selection_criteria = {"iv": {"min": 0.025, "max": 1, "strategy": "highest", "top": 20}}
binning_process = BinningProcess(variable_names=variable_names, selection_criteria=selection_criteria)

# Scorecard
scaling_method = "min_max"
scaling_method_data = {"min": 300, "max": 850}
scorecard = Scorecard(target=target, binning_process=binning_process, estimator=lr,
                      scaling_method=scaling_method, scaling_method_data=scaling_method_data,
                      intercept_based=False, reverse_scorecard=False)
scorecard.fit(df)

# Check min_max points
sc = scorecard.table(style="detailed")
sc.groupby("Variable").agg({'Points' : [np.min, np.max]}).sum()

# Compute score
score = scorecard.score(df)

# Compute predicted probabilities
pred_proba = scorecard.predict_proba(df)

# Plots
score_good = score[df[target] == 0]
score_bad = score[df[target] == 1]

plt.hist(score_good, alpha=0.5, label="good")
plt.hist(score_bad, alpha=0.5, label="bad")
plt.legend()
plt.show()

plt.scatter(score, pred_proba[:, 1])

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Example continuous target

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, HuberRegressor, Ridge
from optbinning import BinningProcess
from optbinning.scorecard.scorecard import Scorecard

data = load_boston()
variable_names = data.feature_names
target = "target"
df = pd.DataFrame(data.data, columns=variable_names)
df[target] = data.target

# Estimator
lr = Ridge()

# Binnng process
binning_process = BinningProcess(variable_names=variable_names)

# Scorecard
scaling_method = "min_max"
scaling_method_data = {"min": 0, "max": 100}

scorecard = Scorecard(target=target, binning_process=binning_process, estimator=lr,
                      scaling_method=scaling_method, scaling_method_data=scaling_method_data,
                      intercept_based=False, reverse_scorecard=True)

scorecard.fit(df)

# Check min_max points
sc = scorecard.table(style="detailed")
sc.groupby("Variable").agg({'Points' : [np.min, np.max]}).sum()

# Compute score
score = scorecard.score(df)

# Compute predicted target
pred = scorecard.predict(df)

# Plots
plt.hist(score)
plt.plot()

plt.scatter(score, pred)

from optbinning.

GabrielSGoncalves avatar GabrielSGoncalves commented on May 27, 2024

Hi Guillermo,

I just tested the code you provided me and works flawlessly! Congrats!

I have a few more questions:

  1. During the process of creating a traditional ScoreCard we usually perform the one-hot encoding and then train the model. What we get after that is a coefficient for each dummy category. On your example I'm getting the same coefficient value for all the categories (example below for mean smoothness). How is the LogisticRegression dealing with the binned categories? How am I suppose to interpret the coefficients?
Bin id Bin Count Count (%) Non-event Event Event rate WoE IV JS Variable Coefficient Points
0 0 [-inf, 0.08) 82 0.144112 4 78 0.95122 -2.44926 0.488921 0.0493247 mean smoothness -0.728402 29.1772
1 1 [0.08, 0.09) 110 0.193322 22 88 0.8 -0.865145 0.123478 0.0149707 mean smoothness -0.728402 61.6618
2 2 [0.09, 0.10) 159 0.279438 64 95 0.597484 0.126156 0.0045139 0.000563863 mean smoothness -0.728402 81.9898
3 3 [0.10, 0.11) 114 0.200351 55 59 0.517544 0.450945 0.0424645 0.00526355 mean smoothness -0.728402 88.65
4 4 [0.11, 0.12) 57 0.100176 34 23 0.403509 0.912016 0.0875094 0.0105747 mean smoothness -0.728402 98.1049
5 5 [0.12, inf) 47 0.0826011 33 14 0.297872 1.3786 0.160531 0.0186144 mean smoothness -0.728402 107.673
6 6 Special 0 0 0 0 0 0 0 0 mean smoothness -0.728402 79.4028
7 7 Missing 0 0 0 0 0 0 0 0 mean smoothness -0.728402 79.4028
  1. Also, what's the definition for the category Special? Is it for new values that where not contemplated on the initial binning process but are not missing values?

from optbinning.

guillermo-navas-palencia avatar guillermo-navas-palencia commented on May 27, 2024

Hi Gabriel,

  1. The coefficient column is the linear model coefficient for a given variable, in this case, "mean smoothness", whereas the Point column is the score/point, computed as WoE * coefficient. The Point column is the value assigned to each variable and bin. This is the usual approach implemented in SAS, MATLAB, and FICO software.

  2. Special is a category including special codes, i.e., values with a specific meaning. See:

from optbinning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.