Comments (21)
Hi Guillermo,
Just wanted to start saying I'm really excited about the idea of helping the development of this module inside optbinning
.
As I mentioned before on the other issue, I'm working on the development of a ScoreCard model, and the funcionalities offered by optbinning
are already extremely powerful. As I'm not an expert in ScoreCard I'm trying to follow the traditional logic based on WoE and IV for feature selection.
What I have drafted so far:
- Define the variable
dtype
andfillna
method and apply to dataframe; - Use
OptminalBinning
to get categories for each variable; - Verify IV for each variable and filter the most relevant;
- Transform the dataframe into binning categories and perform one-hot encoding
After these steps I would have a processed dataframe with each column corresponding to a binning category, ready for performing the model training using sklearn LogisticRegression
(or any other algorithm).
Here is what I have constructed untill now.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
from optbinning import OptimalBinning
class OptScoreCard:
"""Create ScoreCards using optbinning.
Args:
df (pandas.DataFrame): Dataframe containing the features.
dict_parameters (dict): Dicionary with name of column as key and tuple
with value to replace and datatype as value.
target_col (str): Name of the target binary column.
Attributes:
df (pandas.DataFrame): Stores input dataframe with the variables.
target_array (numpy.array): Array containing the target variable.
dict_parameters (dict): Dictionary containing each feature column of
interest from the df as key, and a respective tuple with the fillna
placeholder and dtype as values.
df_bins (pandas.DataFrame): Dataframe with bin categories replacing
original values
df_dummies (pandas.DataFrame): Dataframe with one-hot encoding applied
from df_bins.
dict_optbins (dict): Store OptimalBinning objects from each variable
processed.
dict_optbins_filtered (dict): Filtered variables from dict_optbins.
reference_bin_categories (dict): Variable name as key and bin
category for each selected variable.
"""
def __init__(self, df, dict_parameters, target_col):
self.df = df
self.target_array = self.df[target_col].to_numpy()
self.dict_parameters = dict_parameters
self.df_bins = pd.DataFrame()
self.df_dummies = pd.DataFrame()
self.dict_optbins = {}
self.dict_optbins_filtered = {}
self.reference_bin_categories = {}
def fillna(self, filter_cols=False):
"""
Replace null values based on pre-defined parameters.
Args:
filter_cols (bool): Select columns dedine in dict_parameters in df.
"""
dict_median = {
k: v[0]
for k, v in self.dict_parameters.items()
if v[0] == "median"
}
dict_mean = {
k: v[0] for k, v in self.dict_parameters.items() if v[0] == "mean"
}
dict_values = {
k: v[0]
for k, v in self.dict_parameters.items()
if v[0] not in ["median", "mean"]
}
dict_datatype = {
k: v[1]
for k, v in self.dict_parameters.items()
if v[1] is not None
}
dict_median = (
self.df[list(dict_median.keys())].median().round(2).to_dict()
)
dict_mean = self.df[list(dict_mean.keys())].mean().round(2).to_dict()
# Fillna
dict_fillna = {**dict_median, **dict_mean, **dict_values}
self.df.fillna(dict_fillna, inplace=True)
# Convert datatype
self.df = self.df.astype(dict_datatype)
if filter_cols:
self.df[list(dict_fillna.columns)]
else:
self.df
def get_optimal_bins(self, solver="cp"):
"""Perform optimal binning in each column of the dataframe.
Args:
solver (str, default='cp'): Type of algorithm to solve the binning
task.
"""
for k, v in self.dict_parameters.items():
if v[1] in ("category", "str"):
x = self.df[k].to_numpy()
optb = OptimalBinning(
name=k, dtype="categorical", solver=solver
)
optb.fit(x, self.target_array)
self.dict_optbins[k] = [
optb,
optb.binning_table.build().at["Totals", "IV"],
{
k: str(v)
for k, v in dict(
zip(
optb._binning_table.build().WoE,
optb._binning_table.build().Bin,
)
).items()
if str(v) not in ("", "Special", "Missing")
},
]
elif v[1] in ("float", "int"):
x = self.df[k].to_numpy()
optb = OptimalBinning(name=k, solver=solver)
optb.fit(x, self.target_array)
self.dict_optbins[k] = [
optb,
optb.binning_table.build().at["Totals", "IV"],
{
k: str(v)
for k, v in dict(
zip(
optb._binning_table.build().WoE,
optb._binning_table.build().Bin,
)
).items()
if str(v) not in ("", "Special", "Missing")
},
]
else:
self.dict_optbins[k] = [optb, 0, 'dtype_error']
def show_iv_table(self):
"""Show a dataframe with variables and IV in ascending order."""
return pd.DataFrame.from_dict(
self.dict_optbins,
orient="index",
columns=["OptimalBinning_obj", "IV", "WoE-index"],
)[["IV"]].sort_values(by="IV", ascending=False)
def filter_features_by_IV(self, by='best', threshold=50):
"""Select the features based on the IV score.
Need to implement other ways of filtering (eg: Percentage)
"""
self.dict_optbins_filtered = {
k: v
for k, v in self.dict_optbins.items()
if k in self.show_iv_table().IV[:threshold].index
}
self.df = self.df[self.dict_optbins_filtered.keys()]
def transform_into_bin_categories(self):
"""
Transform values into binning categories based on OptBinning.
"""
for k, v in self.dict_optbins_filtered.items():
self.df_bins[k] = pd.Series(
self.dict_optbins_filtered.get(k)[0].transform(
self.df[k], metric="bins"
)
)
def create_dummies_from_bins(self):
"""
Create a One-Hot encoded dataframe from df_bins.
"""
for k, v in self.dict_optbins_filtered.items():
self.df_dummies = pd.concat(
[
self.df_dummies,
pd.get_dummies(
self.df_bins[k],
prefix=k,
prefix_sep=":",
columns=list(
self.dict_optbins_filtered[k][2].values()
),
),
],
axis=1,
)
def get_reference_bin_categories(self):
"""
Define reference category for binned variables.
The reference categories are selected by the lowest WoE of the bins
of a variable.
"""
self.reference_bin_categories = {
k: f'{k}:'
+ str(
self.dict_optbins_filtered[k][2].get(
min(self.dict_optbins_filtered[k][2].keys())
)
)
for k in self.dict_optbins_filtered.keys()
}
The template for dict_parameters
would be something like:
dict_parameters = {
'var1': ["ind", "category"],
'var2': ["median", "int"],
'var3': ["mean", "float"],
'var4': [0, "bool"]
}
As you can see there is a lot of room for improvement and optmization.
Let me know what do you think about it.
Att
Gabriel
from optbinning.
Hi Gabriel,
Thank you for the code and the effort!! :)
I still have to take a closer look, but here some points:
- I would definitely make use of the class
BinningProcess
already implementing the methodget_optimal_bins
andtransform_into_bin_categories
, it's best to reuse code. Note that this class also allows selecting variables based on the IV or Gini statistic. - I would rename the method
show_iv_table
astable
to be more consistent with other classes. - The scorecard is usually part of the credit risk modeling cycle, but I am tempted to generalize it as much as possible, so it could be used "somehow" for other applications, and even continuous target and not just binary. For example, could we compute scorecard points using other methods besides logistic regression coefficients? We could initially focus on the binary target case, but I think we should keep this generalization in mind.
BinningProcess API and tutorials. Note that the API has slightly changed after 4d54535.
from optbinning.
BTW: I think you are right about the lack of filtering strategies, they should be implemented in BinningProcess
as well. The top % or top x, are good additions.
from optbinning.
Hi Gabriel,
I have been thinking about several improvements on the BinningProcess
required for building scorecards and enhancing the variable selection criteria. I will replace the current parameters:
min_iv : float or None, optional (default=None)
The minimum information value. Applicable if target type is binary.
max_iv : float or None, optional (default=None)
The maximum information value. Applicable if target type is binary.
min_js : float or None, optional (default=None)
The minimum Jensen-Shannon divergence value. Applicable if target type
is binary or multiclass.
max_js : float or None, optional (default=None)
The maximum Jensen-Shannon divergence value. Applicable if target type
is binary or multiclass.
quality_score_cutoff : float or None, optional (default=None)
The quality score cutoff value. Applicable if target type is binary or
multiclass.
by a more general approach: parameter selection_criteria
, a dictionary with the following structure:
{"iv": {"min_value": 0.02, "max_value": 0.5, "strategy": "best", "top": 20}}
In this particular case: top 20 IV variables with IV in [0.02, 0.5] will be selected. Several metrics could be combined, for example:
{"iv": {"min_value": 0.02, "max_value": 0.5, "strategy": "best", "top": 20},
"quality_score": {"min_value": 0.001}}
- Values for key
strategy
are "best" and "worst". - Values for key
top
: if a decimal value is provided => percentage, for examplestrategy="best"
andtop=0.25
means 25% best variables. If an integer is provided => 25 = 25 best variables.
Finally, by default selection_criteria
is set to None.
Does it make sense?
Thanks!
from optbinning.
In addition, the method BinningProcess.transform()
is a bit confusing. After transformation, the resulting dataset (numpy or pandas depending on the input type) should include only the selected features. Follow scikit-learn style: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.transform
from optbinning.
Hi Guillermo,
I forgot to tell you why I decided to implement the get_optimal_bins
method yesterday.
I was trying to use the BinningProcess
class, but even when defining a few categorical_variables
, with some numerical variables, it was considering all my variables as category
.
The selection_criteria
would help a lot the user, and the way you defined looks easy to use it.
It would also be a good idea to offer inside this object access to the preprocess dataframe, and the filtered and transformed dataframe to the user.
I'm still trying to figure out the best way to organize the one-hot encoded dataframe, with and without the reference categories for each variable. I'll send you my solution for it next week.
If you need any help, please let me know.
Att
from optbinning.
Hi Gabriel,
The problem you mentioned is due to pandas when converting a dataframe with mixed column types to numpy . This is a known problem, see links:
- https://pandas.pydata.org/pandas-docs/version/0.24.0/reference/api/pandas.DataFrame.values.html
- https://pandas.pydata.org/pandas-docs/version/0.24.0/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy
Note, however, that version 0.4.0 introduced the option to pass pandas dataframe as input. By doing so, you should not find such problems. Summarizing, one might use numpy arrays as X input only if all columns have the same data type. See: http://gnpalencia.org/optbinning/tutorials/tutorial_binning_process_telco_churn.html
In the meantine, I am working on the new selection_criteria
.
Guillermo
from optbinning.
It is not a good idea to store a dataframe within the class using self.df = df
, because this could lead to memory issues if several BinningProcess
are instantiated.
from optbinning.
Hi Guillermo,
Now I understood the inconsistencies with the datatype and Numpy array.
About the storing a dataframe as an attribute of a class, what would be a more efficient way to do it?
Att
from optbinning.
I would not store any dataframe inside the class, just use it when needed. For example, following the BinningProcess
design patterns, the user should provide it via method fit
. Storing dataframes is OKish if they are small, but we cannot assume that...
from optbinning.
If you prefer, I can work on a proposal, and we discuss it by the end of the week? what do you think? Then, it would be much easier to iterate.
from optbinning.
Hi Guillermo,
That would be great.
Look foward to read it.
Att
from optbinning.
Hi Gabriel,
I have been thinking about the scorecard class structure and functionalities. It is not implemented yet, but I thought it will be beneficial to have a custom logistic regression supporting bound and linear constraints to fulfill business requirements. Thus, I wrote a lightweight library implementing the constrained logistic regression: https://github.com/guillermo-navas-palencia/clogistic.
Give it a try if you have time :)
from optbinning.
Hi Gabriel,
I just added the first prototype with several scaling options. I think it is quite simple but powerful. The class parameters are self-explanatory, but I can provide you more details tomorrow. An example of scaling_method and scaling_method_params is given below
scaling_method = "pdo_odds"
scaling_method_data = {"pdo": 20, "odds": 50, "scorecard_points": 600}
Commit: e2dcc13
Let me know if you try it!
Thanks
from optbinning.
Hi Guillermo,
Just took a look at your code. Looks really interesting. I'm going to perform some tests tomorrow morning and I'll let you know.
Thanks again for the great work!
from optbinning.
Hi Guillermo,
I have some questions about your ScoreCard implementation:
-
I undestood that you used sk-learn
BaseEstimator
class to construct yourScoreCard
class. But exactly is the advantage of it? Because the documentation shows only two methodsget_params
andset_params
, and I couldn't get where you are using it. -
To instantiate the
ScoreCard
you basically need to provide the name of your target column (target
), aBinningProcess
object (that generated the bins for the original dataframe) and anestimator
. Is theestimator
the model? What kinds of models can I use? OnlyLogisticRegression
? -
About the min and max values for the ScoreCard, which function/method can I define it? For example a FICO ScoreCard goes from 300 to 850 points, so what it's usually done it a rescaling from the predicted probabilities to this predefined range.
Thecompute_scorecard_points
have some definitions for min and max but it's not very clear for me.
Att
from optbinning.
Hi Gabriel.
BaseEstimator
has other private methods used in scikit-learn that might be needed in the future. Besidesget_params
andset_params
are very handy when performing hyperparameter optimization using hyperopt, rbfopt and alike global solvers.- Indeed, the estimator is a model. You can use any linear model for classification and regression: For example: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. Any estimator with methods
fit
,predict
(andpredict_proba
for classification) and attributescoef_
and orintercept_
are suitable. - Two scaling method are available: for
min_max
you might use the following:
scaling_method = "min_max"
scaling_method_data = {"min": 300, "max": 850}
The method min_max
is a simple scaling to guarantee that minimum and maximum score is 300 and 850, respectively.
from optbinning.
Example binary target
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from optbinning import BinningProcess
from optbinning.scorecard.scorecard import Scorecard
data = load_breast_cancer()
variable_names = data.feature_names
target = "target"
df = pd.DataFrame(data.data, columns=variable_names)
df[target] = data.target
# Estimator
lr = LogisticRegression()
# Binning process
selection_criteria = {"iv": {"min": 0.025, "max": 1, "strategy": "highest", "top": 20}}
binning_process = BinningProcess(variable_names=variable_names, selection_criteria=selection_criteria)
# Scorecard
scaling_method = "min_max"
scaling_method_data = {"min": 300, "max": 850}
scorecard = Scorecard(target=target, binning_process=binning_process, estimator=lr,
scaling_method=scaling_method, scaling_method_data=scaling_method_data,
intercept_based=False, reverse_scorecard=False)
scorecard.fit(df)
# Check min_max points
sc = scorecard.table(style="detailed")
sc.groupby("Variable").agg({'Points' : [np.min, np.max]}).sum()
# Compute score
score = scorecard.score(df)
# Compute predicted probabilities
pred_proba = scorecard.predict_proba(df)
# Plots
score_good = score[df[target] == 0]
score_bad = score[df[target] == 1]
plt.hist(score_good, alpha=0.5, label="good")
plt.hist(score_bad, alpha=0.5, label="bad")
plt.legend()
plt.show()
plt.scatter(score, pred_proba[:, 1])
from optbinning.
Example continuous target
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, HuberRegressor, Ridge
from optbinning import BinningProcess
from optbinning.scorecard.scorecard import Scorecard
data = load_boston()
variable_names = data.feature_names
target = "target"
df = pd.DataFrame(data.data, columns=variable_names)
df[target] = data.target
# Estimator
lr = Ridge()
# Binnng process
binning_process = BinningProcess(variable_names=variable_names)
# Scorecard
scaling_method = "min_max"
scaling_method_data = {"min": 0, "max": 100}
scorecard = Scorecard(target=target, binning_process=binning_process, estimator=lr,
scaling_method=scaling_method, scaling_method_data=scaling_method_data,
intercept_based=False, reverse_scorecard=True)
scorecard.fit(df)
# Check min_max points
sc = scorecard.table(style="detailed")
sc.groupby("Variable").agg({'Points' : [np.min, np.max]}).sum()
# Compute score
score = scorecard.score(df)
# Compute predicted target
pred = scorecard.predict(df)
# Plots
plt.hist(score)
plt.plot()
plt.scatter(score, pred)
from optbinning.
Hi Guillermo,
I just tested the code you provided me and works flawlessly! Congrats!
I have a few more questions:
- During the process of creating a traditional ScoreCard we usually perform the one-hot encoding and then train the model. What we get after that is a coefficient for each dummy category. On your example I'm getting the same coefficient value for all the categories (example below for
mean smoothness
). How is theLogisticRegression
dealing with the binned categories? How am I suppose to interpret the coefficients?
Bin id | Bin | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | Variable | Coefficient | Points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | [-inf, 0.08) | 82 | 0.144112 | 4 | 78 | 0.95122 | -2.44926 | 0.488921 | 0.0493247 | mean smoothness | -0.728402 | 29.1772 |
1 | 1 | [0.08, 0.09) | 110 | 0.193322 | 22 | 88 | 0.8 | -0.865145 | 0.123478 | 0.0149707 | mean smoothness | -0.728402 | 61.6618 |
2 | 2 | [0.09, 0.10) | 159 | 0.279438 | 64 | 95 | 0.597484 | 0.126156 | 0.0045139 | 0.000563863 | mean smoothness | -0.728402 | 81.9898 |
3 | 3 | [0.10, 0.11) | 114 | 0.200351 | 55 | 59 | 0.517544 | 0.450945 | 0.0424645 | 0.00526355 | mean smoothness | -0.728402 | 88.65 |
4 | 4 | [0.11, 0.12) | 57 | 0.100176 | 34 | 23 | 0.403509 | 0.912016 | 0.0875094 | 0.0105747 | mean smoothness | -0.728402 | 98.1049 |
5 | 5 | [0.12, inf) | 47 | 0.0826011 | 33 | 14 | 0.297872 | 1.3786 | 0.160531 | 0.0186144 | mean smoothness | -0.728402 | 107.673 |
6 | 6 | Special | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | mean smoothness | -0.728402 | 79.4028 |
7 | 7 | Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | mean smoothness | -0.728402 | 79.4028 |
- Also, what's the definition for the category
Special
? Is it for new values that where not contemplated on the initial binning process but are not missing values?
from optbinning.
Hi Gabriel,
-
The coefficient column is the linear model coefficient for a given variable, in this case, "mean smoothness", whereas the Point column is the score/point, computed as WoE * coefficient. The Point column is the value assigned to each variable and bin. This is the usual approach implemented in SAS, MATLAB, and FICO software.
-
Special is a category including special codes, i.e., values with a specific meaning. See:
- http://gnpalencia.org/optbinning/tutorials/tutorial_binary.html#Missing-data-and-special-codes
- https://documentation.sas.com/?docsetId=emref&docsetTarget=p1qzwz7onopjqcn11uc04i18urg7.htm&docsetVersion=14.3&locale=en#p101vz9f11cdz4n19mdj3i56gl29
from optbinning.
Related Issues (20)
- R Package HOT 1
- Scipy/cvxpy dependency bug
- Feature Request: give universal monotonic trend to features for automatic binning HOT 2
- Possible bug in OptimalBinning2D HOT 5
- Optbinning misses "missings" in binning HOT 2
- Explain meaning of min_event_rate_diff_x HOT 2
- Anaconda package support HOT 2
- Sample weight problem
- Manual Binning : Everything binned in a single bin. Status: INFEASIBLE HOT 2
- Manual Binning Error : Pure prebins error HOT 3
- BinningProcess special_codes HOT 1
- Is it possible to change the index of special values?? HOT 1
- Plot: handling of add_special and add_missing when show_bin_label is True HOT 1
- Wrong reference feature in special_codes_y in preprocessing_2d.split_data_2d HOT 1
- Memory error/kernel restarting HOT 4
- BinningProcess: error in binning_transform_params parameter with metric = bins HOT 12
- 'ortools' version conflict HOT 6
- How to create interaction variables like we do in SAS ? HOT 1
- Feature Request : 2D Binning when one of the features is missing. HOT 1
- RuntimeWarning: invalid value encountered in cast n_zeros = np.empty(n_bins).astype(np.int64) HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from optbinning.