Giter Club home page Giter Club logo

ibm-hr-employee-attrition-prediction-study's Introduction

IBM-HR-employee-Attrition-Prediction-Study

Dataset source

In this project, we perform a complete machine learning work flow from data exploratory and feature engineering to build different machine learning models and evaluate model performances to predict if an employee will leave his/her current employer and analyze what features will mainly affect employee's attrition activity.

Contents of the project

  1. Data Exporatory and analysis

  2. Feature engineering

  3. Model training and performance evaluations

  4. Feature importance analysis

Data Exporatory and analysis

Data exporatory is the most important part of the work flow for machine learning project as it is the first approach to understand the whole dataset and all the features including numerical and non numerical, missing data, duplicate data, meaningful and meaningless. Since we will try the best to understand and apply all the useful features to the models later on, we would have to make ourself understand all the features that we have.

Visualization

It is one of the most effective ways to understand the statistical distribution and detect potential outliers in our dataset.

Distribution of Target Variable

Attrition Distribution

As we could see the distribution of our target variable is imbalanced for the binary variable yes and no.

Distribution of Numerical Features

Numerical Distribution

This is the distribution of our numerical features. We have preprocessed all the numerical features so there are no missing data or duplicate data. From the exproratory, it looks like exit employee's age and working time with his current position give out some signals of attrition as well as dailyrate which refers to paystub.

Distribution of Non Numerical Features

Non Numerical Distribution

This is the distribution of non numerical features. Even though all the distribution is imbalanced, we could still tell that too much over time work definitely will drive away a loyal employee. Traveling and martial status will also be considered as factors of the effect.

Feature Engineering

Correlation Matrix

Correlationn Matrix

From this matrix we could be able to see correlations between each feature and between features and target variable. For correlation score > 0.5, we would consider the two features are correlated.

For non numerical features, we applied one hot encoding and transform them into numerical.

Model Training

Naive Approach

Naive Approach

Naive approach is to apply features to the models without hyperparameter tuning. We could roughly get the result of performance of each model. Here we use the linear model, tree based model, ensemble learning model and deep learning model.

Hyperparameter Tuning

Here are the graphs with different hyperparameter affect the performance of logistic regression and k nearest neighbors.

LR KNN

K fold cross validation is the method we use to check the performance of the model on different dataset, so basically we split our dataset into trainig set and testing set, and we split training set into same different portions, and we apply each portion to our model and get the mean score of the model performance. Then we will apply use our testing set to verify the accuracy of our predictions.

Model Evaluation

There are different metrics to evalutate a model's performance. For classification problem, which is what we tried to solve in this project, accuracy is one of the metric that we will look at. Beside accuracy, precision and recall is another metric that we need to pay more attention especially for imbalanced variable. Assuming we only have 1 employee exit in our dataset and we predict everyone is staying, then we will have a model with accuracy 99% which still cannot be able to help us to find out the employee who would like to exit. So precision recall and f1 score will the metric for our evaluation. And to better evaluate and visualize the result of precision and recell, we use confusion matrix graph with labels of acutal and prediction.

cm_lr cm_knn cm_rf cm_dt cm_mlp cm_xgb

ROC analysis is another metric to evaluation how well the model could separate the true label and false lablel accurately. roc

AUC a.k.a area under curve, the higher the value is the less randomness the model will generate for the correct true or false label here is yes or no. The ideal AUC will be 1 which will fill up the whole square and worst is the triangle that goes diagnoal across the squre.

Feature Importance

This is the most important analysis that gained from random forest and LASSO aka linear model with regularization. The feature importance reveals how important each feature is and how it will affect the performance of the model.

rf_fi LASSO_fi

Conclusions

It is quite obvious and it is a conclusion that everyone will accept, which is paying more and working less may be the eltimate dream for all employees. Overtime, monthlyrate, age, years of works they all are very important feature that will affect the attrition activity and performance of an employee. So it will be quite efficient to apprach to those higher risk employees and better communicate with them, so that their intention of attrition could be resolved. It will be beneficial for both employer and employee. Otherwise, plan ahead of time to hire new candidate will be an alternative option.

Future Plans and Works

This is a complete flow for machine learning and there is still much of work to do. We will need to spend more time on tuning the model so that we will have a more robust model that could be deployed in the future. If possible, we could collect more data and more feaetures so that the model could learn more from data and make more accurate predictions.

ibm-hr-employee-attrition-prediction-study's People

Contributors

whwbj avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.