Giter Club home page Giter Club logo

fraud-detection-using-machine-learning's Introduction

Fraud Detection using Machine Learning

With businesses moving online, fraud and abuse in online systems is constantly increasing as well. Traditionally, rule-based fraud detection systems are used to combat online fraud, but these rely on a static set of rules created by human experts. This project uses machine learning to create models for fraud detection that are dynamic, self-improving and maintainable. Importantly, they can scale with the online business.

Specifically, we show how to use Amazon SageMaker to train supervised and unsupervised machine learning models on historical transactions, so that they can predict the likelihood of incoming transactions being fraudulent or not. We also show how to deploy the models, once trained, to a REST API that can be integrated into an existing business software infrastructure. This project includes a demonstration of this process using a public, anonymized credit card transactions dataset provided by ULB, but can be easily modified to work with custom labelled or unlaballed data provided as a relational table in csv format.

Getting Started

You will need an AWS account to use this solution. Sign up for an account here.

To run this JumpStart 1P Solution and have the infrastructure deploy to your AWS account you will need to create an active SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your Studio instance is Ready, use the instructions in SageMaker JumpStart to 1-Click Launch the solution.

The solution artifacts are included in this GitHub repository for reference.

Note: Solutions are available in most regions including us-west-2, and us-east-1.

Caution: Cloning this GitHub repository and running the code manually could lead to unexpected issues! Use the AWS CloudFormation template. You'll get an Amazon SageMaker Notebook instance that's been correctly setup and configured to access the other resources in the solution.

Architecture

The project architecture deployed by the cloud formation template is shown here.

Project Description

The project uses Amazon SageMaker to train both a supervised and an unsupervised machine learning models, which are then deployed using Amazon Sagemaker-managed endpoints.

If you have labels for your data, for example if some of the transactions have been annotated as fraudulent and some as legitimate, then you can train a supervised learning model to learn to discern the two classes. In this project, we provide a recipe to train a gradient boosted decision tree model using XGBoost on Amazon SageMaker. The supervised model training process also handles the common issue of working with highly imbalanced data in fraud detection problems. The project addresses this issue into two ways by 1) implementing data upsampling using the "imbalanced-learn" package, and 2) using scale position weight to control the balance of positive and negative weights.

If you don't have labelled data or if you want to augment your supervised model predictions with an anomaly score from an unsupervised model, then the project also trains a RandomCutForest model using Amazon SageMaker. The RandomCutForest algorithm is trained on the entire dataset, without labels, and takes advantage of the highly imbalanced nature of fraud datasets, to predict higher anomaly scores for the fraudulent transactions in the dataset.

Both of the trained models are deployed to Amazon SageMaker managed real-time endpoints that host the models and can be invoked to provide model predictions for new transactions.

The model training and endpoint deployment is orchestrated by running a jupyter notebook on a SageMaker Notebook instance. The jupyter notebook runs a demonstration of the project using the aforementioned anonymized credit card dataset that is automatically downloaded to the Amazon S3 Bucket created when you launch the solution. However, the notebook can be modified to run the project on a custom dataset in S3. The notebook instance also contains some example code that shows how to invoke the REST API for inference.

In order to encapsulate the project as a stand-alone microservice, Amazon API Gateway is used to provide a REST API, that is backed by an AWS Lambda function. The Lambda function runs the code necessary to preprocess incoming transactions, invoke sagemaker endpoints, merge results from both endpoints if necessary, store the model inputs and model predictions in S3 via Kinesis Firehose, and provide a response to the client.

Data

The example dataset used in this solution was originally released as part of a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

The dataset contains credit card transactions from European cardholders in 2013. As is common in fraud detection, it is highly unbalanced, with 492 fraudulent transactions out of the 284,807 total transactions. The dataset contains only numerical features, because the original features have been transformed for confidentiality using PCA. As a result, the dataset contains 28 PCA components, and two features that haven't been transformed, Amount and Time. Amount refers to the transaction amount, and Time is the seconds elapsed between any transaction in the data and the first transaction.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

We cite the following works:

  • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
  • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
  • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
  • Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
  • Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
  • Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Contents

  • deployment/
    • fraud-detection-using-machine-learning.yaml: Creates AWS CloudFormation Stack for solution
  • source/
    • lambda
      • model-invocation/
        • index.py: Lambda function script for invoking SageMaker endpoints for inference
    • notebooks/
      • src
        • package
          • config.py: Read in the environment variables set during the Amazon CloudFormation stack creation
          • generate_endpoint_traffic.py: Custom script to show how to send transaction traffic to REST API for inference
          • util.py: Helper function and utilities
      • sagemaker_fraud_detection.ipynb: Orchestrates the solution. Trains the models and deploys the trained model
      • endpoint_demo.ipynb: A small notebook that demonstrates how one can use the solution's endpoint to make prediction.
    • scripts/
      • set_kernelspec.py: Used to update the kernelspec name at deployment.
    • test/
      • Files that are used to automatically test the solution

License

This project is licensed under the Apache-2.0 License.

fraud-detection-using-machine-learning's People

Contributors

dependabot[bot] avatar ehsanmok avatar jonomon avatar sojiadeshina avatar thvasilo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fraud-detection-using-machine-learning's Issues

Metric for evaluation and other issues

I found the following issues when running this notebook:

  1. Metric for evaluation: after running this notebook, I found that the recall was only about 73%, which for a fraud detection use case should be targeted at a higher level such as 90%.

    Suggested resolution: follow the example at https://github.com/awslabs/amazon-sagemaker-examples/blob/master/scientific_details_of_algorithms/linear_learner_class_weights_loss_functions/linear_learner_class_weights_loss_functions.ipynb, including:

    Set 'binary_classifier_model_selection_criteria': 'precision_at_target_recall'
    Split the dataset into validation and test along with train.
    Use class weights etc.
    
  2. The data upload cell cannot be run as written. Instead, use the default session bucket or allow the user to provide their own:

      bucket = session.default_bucket()
      prefix = 'fraud-detection-end-to-end-demo/linear-learner'
    

The timestamp part in the job name after training is different from what rcf.deploy is looking for

I am just trying to run the fraud-detection using machine learning stack and I am encountering this error

UnexpectedStatusException: Error hosting endpoint sagemaker-soln-fd1-rcf: Failed. Reason: Failed to download model data for container "container_1" from URL: "s3://cloud-bucket-test/fraud-classifier/output/sagemaker-soln-fd1-rcf-2021-03-26-18-36-26-322/output/model.tar.gz". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object..

My S3 path where the model.tar.gz is present is given below, my local time is PST

s3://cloud-bucket-test/fraud-classifier/output/sagemaker-soln-fd1-rcf-2021-03-27-00-48-29-687/output/model.tar.gz.

How to change the training container time? Please help

AttributeError when trying to set model attributes

Hello!

I'm trying to run through the "sagemaker_fraud_detection" notebook and I'm running into an issue when trying to set the 'content_type' and 'accept' attributes for the different predictors (Random Cut Forest, SMOTE).

Specifically, the commands with the issue:

rcf_predictor.content_type = 'text/csv'
rcf_predictor.serializer = csv_serializer
rcf_predictor.accept = 'application/json'
rcf_predictor.deserializer = json_deserializer

smote_predictor.content_type = 'text/csv'
smote_predictor.serializer = csv_serializer
smote_predictor.deserializer = None

Here is the error that I'm seeing:


AttributeError Traceback (most recent call last)
in
4
5 # Specify input and output formats.
----> 6 smote_predictor.content_type = 'text/csv'
7 smote_predictor.serializer = csv_serializer
8 smote_predictor.deserializer = None

AttributeError: can't set attribute

This issue seems to resolve itself with the random cut forest model but not with the SMOTE model.

Thanks in advance for any insights into this issue, and my apologies if I'm not doing something correctly.

Thanks!

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.