vinatsai / xgboost_notebook Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 0 B

xgboost_notebook's People

Watchers

xgboost_notebook's Issues

KS和AUC的理解和应用

KS

要弄明白ks值和auc值的关系首先要弄懂roc曲线和ks曲线是怎么画出来的。其实从某个角度上来讲ROC曲线和KS曲线是一回事，只是横纵坐标的取法不同而已。拿逻辑回归举例，模型训练完成之后每个样本都会得到一个类概率值（注意是类似的类），把样本按这个类概率值排序后分成10等份，每一份单独计算它的真正率和假正率，然后计算累计概率值，用真正率和假正率的累计做为坐标画出来的就是ROC曲线，用10等分做为横坐标，用真正率和假正率的累计值分别做为纵坐标就得到两个曲线，这就是KS曲线。AUC值就是ROC曲线下放的面积值，而ks值就是ks曲线中两条曲线之间的最大间隔距离。由于ks值能找出模型中差异最大的一个分段，因此适合用于cut_off，像评分卡这种就很适合用ks值来评估。但是ks值只能反映出哪个分段是区分最大的，而不能总体反映出所有分段的效果，因果AUC值更能胜任。

imbalanced data

max_delta_step

https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html
https://xgboost.readthedocs.io/en/latest/parameter.html

learning_rate 设置原则

Google 的这篇论文则为理解这个现象提供了一个可能的解释：适当而不是过小的学习率能为优化过程带来隐式的梯度惩罚项，有助于收敛到更平稳的区域。笔者认为其分析过程还是值得参考学习的。

References:

y不平衡：SMOTE 过采样

install imblearn

conda install -c conda-forge imbalanced-learn

https://imbalanced-learn.readthedocs.io/en/stable/install.html

import

from imblearn.over_sampling import SMOTE

Parameter Tuning in XGBoost

Reference: https://www.kaggle.com/general/17120

using python:

http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

using R:
http://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/

anaconda python install settings

anaconda 安装

https://zhuanlan.zhihu.com/p/32925500

INN和OOT的KS，AUC差异大

INN和OOT的特征差异

检查特征的IV，WOE在INN和OOT是否有反转
检查特征在INN和OOT的稳定性：缺失率、均值、分位值

进一步查找存在人为缺失的特征

检查特征在样本观察期内的上线情况

模型效果评估：Accuracy, Precision, Recall & F1 Score

TP, TN, FP, FN

True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.

True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

False Positives (FP) – When actual class is no and predicted class is yes.

False Negatives (FN) – When actual class is yes but predicted class is no.

以上可知，positives and negatives 分别表示预测结果中的1和0 (or 1和-1)。

Accuracy, Precision, Recall and F1 score.

Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.

Accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same.

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

High precision relates to a low false positive rate.
Precision = TP/(TP+FP)

Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The recall above 0.5 is good.

High recall relates to a low false negative rate.
Recall = TP/(TP+FN)

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
*Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2 * (Recall * Precision) / (Recall + Precision)

Conclusion:

使用特征	指标
平衡数据	Accuracy
不平衡数据	F1 score

References:

https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

特征工程

建模基础框架

建模步骤（按流程）

样本定义

基本要求：训练样本与线上调用模型的样本尽量保持一致
分层抽样：从不同（特征差异较大的子样本）层次中按比例抽取，比如分新老客
预测目标：y的定义，区分是否与时间有关，需要明确时间窗口

特征工程——RFM（例子：贷后逾期表现）

基础结构

拆分大框架——填补小细节

探索性数据分析 EDA

EDD
缺失率
重复值
单一值
其他Quality check

特征选择

单变量分析：IV (WOE)

一般切5个bin，最多10个bin
结合业务，提高变量的可解释性（时间以天为单位，可解释性更强？？）
将特征处理得更符合线性
比如年龄，衍生特征：log(年龄/10)
变量之间的相关性接近1的，可去掉其中一个（树模型会针对处理共线性变量，无须专门处理）

shap

STEP-WISE

先选一个变量入模
根据IV值（设置阈值）加入变量（同时参考与已有变量的相关性，相关性越低，效果提升越高）

训练模型

IS训练集：用于训练模型
OOS验证集：用于模型效果评估和调优，反向为训练集的调参提供参考
OOT测试集：只用于模型效果评估，评估模型应用的泛化能力

效果评估

准确率 Accuracy
精确率 Precision 召回率 Recall
AUC
过拟合overfitting
欠拟合underfitting
参数选择

objective: 'rank:pairwise' 解析

predict()

Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.

https://www.kaggle.com/c/expedia-hotel-recommendations/discussion/20225
https://blog.csdn.net/weixin_42001089/article/details/84146238

xgboost.predict()输出了负值（）。根据以上解释，y预测值的大小（含正负号）表示与y的相关性程度，因此为了得到Y=1的概率，可以归一化处理y_predict。

install bayesian-optimization anaconda

$ conda install -c conda-forge bayesian-optimization

Reference: https://github.com/fmfn/BayesianOptimization

xgboost install

first: remove

conda remove xgboost

second: install

conda install py-xgboost

third: import

import xgboost

https://stackoverflow.com/questions/36292484/xgboost-installation-issue-with-anaconda

intall shap

conda install -c conda-forge shap

https://github.com/slundberg/shap#citations

matplotlib.pyplot : Errors

Document some errors.....

vinatsai / xgboost_notebook Goto Github PK

xgboost_notebook's People

Watchers

xgboost_notebook's Issues

KS

max_delta_step

install imblearn

import

anaconda 安装

INN和OOT的特征差异

进一步查找存在人为缺失的特征

TP, TN, FP, FN

Accuracy, Precision, Recall and F1 score.

Conclusion:

References:

建模步骤（按流程）

样本定义

特征工程——RFM（例子：贷后逾期表现）

基础结构

探索性数据分析 EDA

特征选择

单变量分析：IV (WOE)

shap

STEP-WISE

训练模型

效果评估

predict()

first: remove

second: install

third: import

Recommend Projects

Recommend Topics

Recommend Org