xgboost_notebook's People
xgboost_notebook's Issues
KS和AUC的理解和应用
KS
要弄明白ks值和auc值的关系首先要弄懂roc曲线和ks曲线是怎么画出来的。其实从某个角度上来讲ROC曲线和KS曲线是一回事,只是横纵坐标的取法不同而已。拿逻辑回归举例,模型训练完成之后每个样本都会得到一个类概率值(注意是类似的类),把样本按这个类概率值排序后分成10等份,每一份单独计算它的真正率和假正率,然后计算累计概率值,用真正率和假正率的累计做为坐标画出来的就是ROC曲线,用10等分做为横坐标,用真正率和假正率的累计值分别做为纵坐标就得到两个曲线,这就是KS曲线。AUC值就是ROC曲线下放的面积值,而ks值就是ks曲线中两条曲线之间的最大间隔距离。由于ks值能找出模型中差异最大的一个分段,因此适合用于cut_off,像评分卡这种就很适合用ks值来评估。但是ks值只能反映出哪个分段是区分最大的,而不能总体反映出所有分段的效果,因果AUC值更能胜任。
imbalanced data
learning_rate 设置原则
Google 的这篇论文则为理解这个现象提供了一个可能的解释:适当而不是过小的学习率能为优化过程带来隐式的梯度惩罚项,有助于收敛到更平稳的区域。笔者认为其分析过程还是值得参考学习的。
References:
y不平衡:SMOTE 过采样
install imblearn
conda install -c conda-forge imbalanced-learn
https://imbalanced-learn.readthedocs.io/en/stable/install.html
import
from imblearn.over_sampling import SMOTE
Parameter Tuning in XGBoost
anaconda python install settings
anaconda 安装
INN和OOT的KS,AUC差异大
模型效果评估:Accuracy, Precision, Recall & F1 Score
TP, TN, FP, FN
- True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.
- True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.
- False Positives (FP) – When actual class is no and predicted class is yes.
- False Negatives (FN) – When actual class is yes but predicted class is no.
- 以上可知,positives and negatives 分别表示预测结果中的1和0 (or 1和-1)。
Accuracy, Precision, Recall and F1 score.
Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.
- Accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same.
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
- High precision relates to a low false positive rate.
Precision = TP/(TP+FP)
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The recall above 0.5 is good.
- High recall relates to a low false negative rate.
Recall = TP/(TP+FN)
F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
- Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
*Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.
F1 Score = 2 * (Recall * Precision) / (Recall + Precision)
Conclusion:
使用特征 | 指标 |
---|---|
平衡数据 | Accuracy |
不平衡数据 | F1 score |
References:
特征工程
建模基础框架
建模步骤(按流程)
样本定义
- 基本要求:训练样本与线上调用模型的样本尽量保持一致
- 分层抽样:从不同(特征差异较大的子样本)层次中按比例抽取,比如分新老客
- 预测目标:y的定义,区分是否与时间有关,需要明确时间窗口
特征工程——RFM(例子:贷后逾期表现)
基础结构
拆分大框架——填补小细节
探索性数据分析 EDA
- EDD
- 缺失率
- 重复值
- 单一值
- 其他Quality check
特征选择
单变量分析:IV (WOE)
- 一般切5个bin,最多10个bin
- 结合业务,提高变量的可解释性(时间以天为单位,可解释性更强??)
- 将特征处理得更符合线性
- 比如年龄,衍生特征:log(年龄/10)
- 变量之间的相关性接近1的,可去掉其中一个(树模型会针对处理共线性变量,无须专门处理)
shap
STEP-WISE
- 先选一个变量入模
- 根据IV值(设置阈值)加入变量(同时参考与已有变量的相关性,相关性越低,效果提升越高)
训练模型
- IS训练集:用于训练模型
- OOS验证集:用于模型效果评估和调优,反向为训练集的调参提供参考
- OOT测试集:只用于模型效果评估,评估模型应用的泛化能力
效果评估
- 准确率 Accuracy
- 精确率 Precision 召回率 Recall
- AUC
- 过拟合overfitting
- 欠拟合underfitting
- 参数选择
objective: 'rank:pairwise' 解析
predict()
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
https://www.kaggle.com/c/expedia-hotel-recommendations/discussion/20225
https://blog.csdn.net/weixin_42001089/article/details/84146238
xgboost.predict()输出了负值()。根据以上解释,y预测值的大小(含正负号)表示与y的相关性程度,因此为了得到Y=1的概率,可以归一化处理y_predict。
install bayesian-optimization anaconda
$ conda install -c conda-forge bayesian-optimization
Reference: https://github.com/fmfn/BayesianOptimization
xgboost install
first: remove
conda remove xgboost
second: install
conda install py-xgboost
third: import
import xgboost
https://stackoverflow.com/questions/36292484/xgboost-installation-issue-with-anaconda
intall shap
conda install -c conda-forge shap
matplotlib.pyplot : Errors
Document some errors.....
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.