Giter Club home page Giter Club logo

credit_risk_analysis's Introduction

Credit_Risk_Analysis

Using supervised machine learning to predict credit risk Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, the analysis uses oversampling with the RandomOverSampler and SMOTE algorithms, and undersampling using the ClusterCentroids algorithm. Then, using a combinatorial approach of over- and undersampling it applies the SMOTEENN algorithm. Next, it compares two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk.

Tools: Jupyter Notebooks, Pandas, Python, supervised machine learning algorithms

Resources: LendingClub credit card data LoanStats_2019Q1.csv.zip

Overview of the analysis:

The purpose of this analysis is to use supervised machine learning to predict credit risk. Because there are so many fewer risky loans than good loans it is an unbalanced classification problem. The anlaysis will look at oversampling, undersampling, and decision tree classifiers ot try and determine the best model for this problem. In order to do so the analysis will look at the accuracy score, precision, recal, and F1 scores from 6 different modesl to determine if one is better at detecting risky loans better than the others.

Results: Using bulleted lists, describe the balanced accuracy scores and the precision and recall scores of all six machine learning models. Use screenshots of your outputs to support your results.

We cannot depend soley on accuracy score since the model could do well detecting all the low risk loans and miss the few risky loans and still get a good accuracy score. Therefore the precision and recal will also be considered and well as the F1 score. Precision divides the number of true positives detected by the total positves or how reliable are positive classifications. Recall is more of how likely the model is to detect the risky loans, calculated by true positive over true positvies plus false negatives. Precision will be more important for this data because predicted positives are likely true positives; but a number of other true positives may not be predicted which means that there will be some good loans that are not awarded, but there mon't be many risky loans that aren't detected. F1 score is 2(Precision * Sensitivity)/(Precision + Sensitivity), if the score is low it means there is a large imbalance between precision and recall which is true of this dataset. All models did predictably well at detecting the low risk loans. So the analysis needs to focus on how many of the high risk loans is it detecting without flagging too many good loans as high risk unecesarily.

  • Oversampling with RandomOverSampler accurcacy score- 65% Screen Shot 2022-08-08 at 3 49 45 PM

  • Oversampling with SMOTE accuracy acore- 64% Screen Shot 2022-08-08 at 3 48 20 PM

  • Undersampling with ClusterCentroids accuracy score-53% Screen Shot 2022-08-08 at 3 50 32 PM

  • Combination sampling with SMOTEEN accuracy_score - 64% Screen Shot 2022-08-08 at 3 51 15 PM

  • Ensemble machine learning with BalancedRandomForestClassifier accuracy score- 67% Screen Shot 2022-08-08 at 3 45 54 PM

  • Ensemble machine learning with EasyEnsembleClassifier Screen Shot 2022-08-08 at 3 47 10 PM

Summary:

None of the oversampling models did well with precision on the high risk loans, but they also didn't have particularily high recalls either. This is concerning in that it has trouble predicting when a high risk loan is actually high risk. The f1 scores are all low with oversampling as well because the precision was so low on all of them that the recall was much better. The recall was best for the combination sampling SMOTEEN model, at .71, which also had the high accuracy of the four sampling models, yet it was only 71% accurate.

The ensemble decision tree classifiers did a bit better, with the Easy Ensemble Classifier having the only good accuracy score at 93%. This model also had the best recall score at .91, but this was at the sacrifice of the precision which was very low. The Balanced Random Forest was the opposite with the best precision score and a terrible recall. Because recall is more important where Overlooked Cases (False Negatives) are more costly than False Alarms (False Positive), the Easy Ensemble model is the only one I would recommend so tooo many risky loans are not approved.

The analysis would benefit from some more constraints, perhaps if the loan is over a certain amount of money zero risk tolerance would be appropriate, but under a certain amount of money there is an acceptable amount of risk. Additionally, focusing on the features that contibute the most could help the model.

credit_risk_analysis's People

Contributors

emaynard10 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.