Giter Club home page Giter Club logo

xgboost_notebook's People

Watchers

James Cloos avatar  avatar

xgboost_notebook's Issues

KS和AUC的理解和应用

KS

要弄明白ks值和auc值的关系首先要弄懂roc曲线和ks曲线是怎么画出来的。其实从某个角度上来讲ROC曲线和KS曲线是一回事,只是横纵坐标的取法不同而已。拿逻辑回归举例,模型训练完成之后每个样本都会得到一个类概率值(注意是类似的类),把样本按这个类概率值排序后分成10等份,每一份单独计算它的真正率和假正率,然后计算累计概率值,用真正率和假正率的累计做为坐标画出来的就是ROC曲线,用10等分做为横坐标,用真正率和假正率的累计值分别做为纵坐标就得到两个曲线,这就是KS曲线。AUC值就是ROC曲线下放的面积值,而ks值就是ks曲线中两条曲线之间的最大间隔距离。由于ks值能找出模型中差异最大的一个分段,因此适合用于cut_off,像评分卡这种就很适合用ks值来评估。但是ks值只能反映出哪个分段是区分最大的,而不能总体反映出所有分段的效果,因果AUC值更能胜任。

INN和OOT的KS,AUC差异大

image

INN和OOT的特征差异

  1. 检查特征的IV,WOE在INN和OOT是否有反转
  2. 检查特征在INN和OOT的稳定性:缺失率、均值、分位值

进一步查找存在人为缺失的特征

  • 检查特征在样本观察期内的上线情况

模型效果评估:Accuracy, Precision, Recall & F1 Score

TP, TN, FP, FN

  1. True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.
  2. True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.
  3. False Positives (FP) – When actual class is no and predicted class is yes.
  4. False Negatives (FN) – When actual class is yes but predicted class is no.
  • 以上可知,positives and negatives 分别表示预测结果中的1和0 (or 1和-1)。

Accuracy, Precision, Recall and F1 score.

Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.

  • Accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same.

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

  • High precision relates to a low false positive rate.
    Precision = TP/(TP+FP)

Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The recall above 0.5 is good.

  • High recall relates to a low false negative rate.
    Recall = TP/(TP+FN)

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

  • Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
    *Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2 * (Recall * Precision) / (Recall + Precision)

Conclusion:

使用特征 指标
平衡数据 Accuracy
不平衡数据 F1 score

References:

https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

建模基础框架

建模步骤(按流程)

样本定义

  1. 基本要求:训练样本与线上调用模型的样本尽量保持一致
  2. 分层抽样:从不同(特征差异较大的子样本)层次中按比例抽取,比如分新老客
  3. 预测目标:y的定义,区分是否与时间有关,需要明确时间窗口

特征工程——RFM(例子:贷后逾期表现)

基础结构

拆分大框架——填补小细节

探索性数据分析 EDA

  1. EDD
  2. 缺失率
  3. 重复值
  4. 单一值
  5. 其他Quality check

特征选择

单变量分析:IV (WOE)

  1. 一般切5个bin,最多10个bin
    image
  2. 结合业务,提高变量的可解释性(时间以为单位,可解释性更强??)
  3. 将特征处理得更符合线性
  4. 比如年龄,衍生特征:log(年龄/10)
  5. 变量之间的相关性接近1的,可去掉其中一个(树模型会针对处理共线性变量,无须专门处理)

shap

STEP-WISE

  1. 先选一个变量入模
  2. 根据IV值(设置阈值)加入变量(同时参考与已有变量的相关性,相关性越低,效果提升越高)

训练模型

  1. IS训练集:用于训练模型
  2. OOS验证集:用于模型效果评估和调优,反向为训练集的调参提供参考
  3. OOT测试集:只用于模型效果评估,评估模型应用的泛化能力

效果评估

  1. 准确率 Accuracy
  2. 精确率 Precision 召回率 Recall
  3. AUC
  4. 过拟合overfitting
  5. 欠拟合underfitting
  6. 参数选择

objective: 'rank:pairwise' 解析

predict()

Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.

https://www.kaggle.com/c/expedia-hotel-recommendations/discussion/20225
https://blog.csdn.net/weixin_42001089/article/details/84146238

xgboost.predict()输出了负值()。根据以上解释,y预测值的大小(含正负号)表示与y的相关性程度,因此为了得到Y=1的概率,可以归一化处理y_predict。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.