Giter Club home page Giter Club logo

test's Introduction

Abstract

IEEE-CIS Fraud Detection competition on Kaggle is a binary classification problem where we have to generate the probability of a fraudulent transaction. Necessary preprocessing like dropping, merging, label encoding, etc. are done on the train and test dataset. Then, EDA is done on the train dataset. Finally, the LightGBM model is trained using the training dataset which gives an accuracy of 0.936113 i.e. 93.6% on the test dataset.

Methodology

Both the train dataset and test dataset are divided into two parts. First, I merged them using the left join with respect to the TransactionID feature. Then, I performed exploratory data analysis to understand the train dataset. Most of the columns had missing values. So, I dropped columns that had more than 50% missing data. I performed label encoding on categorical features. I plotted the feature distributions and also printed the necessary statistical inference of features in the console to visualize the train dataset.

Initially, train dataset had around 434 columns/features. After preprocessing, I was left with 217 features that I used for training. This report would end up being very lengthy if I try to explain all 217 features and its distribution. I am going to provide just 2 examples of EDA performed in my project.

Statistical Data of C1, C2, C3, C4, C5, C6 Features

Hsitogram of card4 Feature Distribution

C1-C6 features weren’t plotted as it needed normalization. Instead, I could get insights into C1-C6 by just printing the statistical data.

Result Analysis

I used the LightGBM model for generating the probability as it is faster than most other classification models such as XGBoost, K-neighbours, Random Forest, Decision Trees, etc. Another advantage of LightGBM is that it can tolerate null values in the training dataset. LightGBM is similar to XGBoost as both use gradient boosting and they generate a prediction based on several ‘weak learners’ such as decision trees. The main difference between LightGBM and XGBoost is, LightGBM makes decision trees depth-wise (like DFS algorithm) whereas XGBoost makes decision trees breadth-wise (like BFS algorithm).

The evaluation metric used is ‘AUC’ i.e. calculating the area under the curve of the ROC graph which is generated from the confusion matrix. I have used 217 features to train the LightGBM model and plotted the top 50 important features according to my model which is given below:

50 Most Important Feature

Conclusion

Even though my model’s accuracy is 0.936113, it can be improved by a more sophisticated feature selection process like ‘Recursive Feature Selection’ and feature engineering. I also intend to implement other classification models like XGBoost, Logistic Regressor, K-neighbour’s model on this dataset and compare the accuracy.

test's People

Contributors

fahimmahmood avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.