rsyi / pylift Goto Github PK
View Code? Open in Web Editor NEWUplift modeling and evaluation library. Actively maintained pypi version.
Home Page: https://docs.pylift.org/
License: BSD 2-Clause "Simplified" License
Uplift modeling and evaluation library. Actively maintained pypi version.
Home Page: https://docs.pylift.org/
License: BSD 2-Clause "Simplified" License
In pylift/docs/eda.rst
there is a typo in the LaTeX. It should swap \elem
for \in
to get the symbol โ in the following line:
\text{WOE}_{ij} = \log \frac{P(X_j \elem B_i | Y = 1)}{P(X_j \elem B_i | Y = 0)}
In the docs online, on readthedocs, I find that a few links are incorrectly rendered.
On Usage, the links for introduction and quickstart miss the .html
.
There's an error in eda.rst
in Weight of Evidence part, mathematical symbol for "belongs to" is not rendered due to incorrect command \elem
. The correct command is \in
.
Is there an easy way to convert the sklearn make_scorer metrics (e.g., _aqini_score, _qini_score, etc.) to eval functions that can be used with the 'eval_metric' argument in xgboost?
Hi @rsyi
First of all, thanks for very good package.
I want to use your code for my data.
I have few questions. First is that, If I want to use XGBoost, where can I change the parameters? I am asking because my data is highly imbalanced. Therefore, I need to change the value of scale_pos_weight of XGBoost to balance my data.
The second question is that I don't understand the difference between your plots of uplift gain and lift!
Can you please very briefly explain here?
Thanks in advance
Should all future issues and PR posted to this fork instead of wayfair/pylift?
Hello,
I've setup my model via TransformedOutcome, then wanted to check all the NIV features. However, when I use NIV (dict, or plot) all the included features are empty. The rest of the steps work and I get plots further down the line, but the NIV being empty leads me to believe they are incorrect.
Setting up the model
up = TransformedOutcome(df, col_treatment='Response', col_outcome='TotalRevenueFoodItems',random_state=4, stratify=y)
Call NIV
up.NIV_dict
Thank you.
Given 100K (N=1e5) samples with the following distribution:
Treatment = 98% (W=1)
Control = 2% (W=0)
Hence p = 0.98
The samples were balanced for response such that, for response (Y=1), the samples are split 1% (W=0, Y=1) Control vs 49% (W=1, Y=1) Treatment. Similarly, for no response (Y=0), the samples are split 1% (W=0, Y=0) Control vs 49% (W=1, Y=0) Treatment.
Based on this I would expect the two functions _get_counts and _get_tc_counts in pylift.eval to return the following values.
Nt1o1 = 49K, Nt0o1 = 1K , Nt1o0 = 49K, Nt0o0 = 1K
Nt1 = 98K, Nt0 = 2K , N = 1e5
However, the functions are returning the following values instead
Nt1o1 = 25K, Nt0o1 = 25K, Nt1o0 = 25K, Nt0o0 = 25K
Nt1 = 50K, Nt0 = 50K , N = 1e5
Could it be that that the implemented logic which is based on summing 1/p and 1/(1-p) values need to be modify?
Should stratified cross-validation based on the Treatment vs Outcome 2x2 matrix split be used when performing a grid search to ensure that each fold follows the same distribution used in the overall data? If this is not the case, and cross validation is used for hyperparameter search, should we expect that the scores used for evaluating each fold, using qini coefficient as an example, indeed represent the qini coefficient for the overall training dataset? Putting it differently, an increase in number of folds may affect the stability of the uplift score in each fold.
Another closely related question, given the following two cross validation outputs, which one should we prefer? Higher mean score across folds, or uniform score across folds?
Number of cross folds: 4
Split_1_Score | split_2_Score | Split_3_Score | Split_4_Score | Mean_Score
0.4 0.9 -0.2 -0.3 | 0.2
Number of cross folds: 2
Split_1_Score | split_2_Score | Mean_Score
0.12 0.16 | 0.14
Q1)
Given the fact that for continuous outcome the theoretical max (i.e., q1_) and practical max(i.e., q2_) curves are not well defined and will not be correct, then only the following six metrics can be used to evaluate the model. Is this correct?
Q2)
Based on lines 205
score_name = 'q1_'+method
And the _score function in base.py
def _score(self, y_true, y_pred, method, plot_type, score_name):
""" scoring function to be passed to make_scorer.
"""
treatment_true, outcome_true, p = self.untransform(y_true)
scores = get_scores(treatment_true, outcome_true, y_pred, p, scoring_range=(0,self.scoring_cutoff[method]), plot_type=plot_type)
return scores[score_name]
three of the scoring methods which can be used for grid search: 'q1_qini', 'q1_cgains', 'q1_aqini' should not be used with continuous variables. If this is indeed the case, then I would suggest that this issue be fixed using the continuous_outcome argument already available and maybe be replaced with โQ_โ scores for continuous variables.
https://github.com/rsyi/pylift/blob/master/pylift/eval.py#L605
Dividing by an additional N does not affect the trend of the curve, but it makes the Adjusted Qini and Cgain curves very similar on data where the control and treatment groups are relatively evenly distributed (after sorted by uplift value), making the two curves difficult to distinguish and prone to misunderstanding.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.