Giter Club home page Giter Club logo

credit_card_fraud_detection_2023's Introduction

credit_card_fraud_detection_2023

AI CUP 2023 玉山人工智慧公開挑戰賽-信用卡冒用偵測

This method achieved 7th position on the private leaderboard (TEAM_4043).

Final competition slide [link]

Introduction

There is no new idea in this method. Just calculated some rule-based features (basic statistics) and classified them by XGBoost classifier.

Note: I have not optimized the process. The processing steps are very time-consuming and consume a lot of memory (>50GB). To proceed, ensure you have a good CPU, GPU, and sufficient RAM.

Preprocessing Data and Checkpoints

The preprocessing data and checkpoints are not available now. If you need the checkpoints, please contact me.

  • Preprocessing table: ~14 GB
  • XGBoost models: ~400 MB per model

Data Preprocessing

The data preprocessing process involves concatenating the training, public_test, and private tables to calculate basic statistics for the "cano" and "chid" groups.

Note: Before the preprocessing, you should place the tables in the tables directory. (The filename of these tables should be training.csv, public.csv and private.csv).

cd Preprocessing
python preprocessing.py -o output/preprocessing.csv

Note: This step could take several hours to complete.

Training

The model is XGBoost. The file model.py contains the parameters which are currently unchangeable.

python train.py \
    --input output/preprocessing.csv \
    --model_output_dir  output/checkpoints/ \
    --thr_path config/your_thr.json \
    --epochs 300 \
    --runs 3 \ --> Number of models (for ensemble)
    --gpu 0

Note: This step could take about 1 hour to complete by using GPU.

Inference

To perform inference of the data without preprocessing and training your model, download the preprocessing table and model checkpoints first. Then, move them to the output directory.

python inference_submit.py \
    --input output/preprocessing.csv \
    --thrs config/thr.json \ --> Best thresholds of my models
    --ckpts output/checkpoints/ \
    --output submission.csv

Note: After inference, you must merge the "txkey" of the example submission file to get the correct submission.

Final Competition (2023.12.02)

All the code can be found in the final_code directory

Step 1: Preprocessing. Use training.csv provided before the final competition, private_1.csv and private_2_processed.csv to calculate basic statistics for the "cano" and "chid" groups as features. These files should be placed in the tables directory.

However, the CPU performance is insufficient to complete all the preprocessing. So we only transform some essential features (Only including the transformed features of numerical features).

cd final_code

python preprocess_numerical.py -o ../output/preprocessing_final.csv

Step2: Since the environment of the final competition is no GPU, we change the device of XGBoost classifier to "cpu". Also, you must first turn off the parameters subsample and sampling_method in model.py because they are only available on GPU.

python train_numerical.py \
    --input ../output/preprocessing_final.csv \
    --model_output_dir  ../output/checkpoints/ \
    --thr_path thr_final.json \
    --epochs 100 \
    --runs 3 \
    --gpu cpu

Step3: Inference (ensemble 3 models)

python inference_numerical.py \
    --input ../output/preprocessing_final.csv \
    --thrs thr_final.json \
    --ckpts ../output/checkpoints/ \
    --output submission.csv

Note: After step 3, you must merge the "txkey" of the example submission file to get the correct submission.

credit_card_fraud_detection_2023's People

Contributors

yenjia avatar

Stargazers

 avatar Willy Wang avatar Johnny-Hu-406 avatar Bingru avatar

Watchers

 avatar

credit_card_fraud_detection_2023's Issues

模型訓練疑問

您好 :
又來打擾 歹勢

想請問
1.一開始您用select_thr_f1這個函式,怎麼會想自己寫這個?不就一開始追求F1-score最高就好,為何會想到可以找recall跟precision的權衡

2.這個模型訓練次數,我看到您是寫3次,我很好奇,這次數的想法是想說因為要用奇數投票方便,所以設定3嗎?有沒有可能5次更好?(我今天自己用您的code試了一下,確實1次的F-score會比3的低,但5次我目前沒時間,所以沒實驗過)

感謝解答

對於預處理有些疑問

您好
我今年也有參加ESUN credit card fraud這個競賽
不過我名次很後面@@"

看了您的做法 有兩個疑問

問題1: code位置 為什麼您會想到補missing value用負1,其實我的疑問點應該是說很多變數實際應該是類別變數,為何您會想到直接用int8或int4等等numeric變數處理????因為我看到這個資料,我一開始Load_dataset就想說要改成類別變數處理,但我承認我確實也不知道怎麼處理missing value的問題就是了

問題2:承接第一個問題,我看到您feature engineer的技術還蠻厲害的,我想問您怎麼從0到1想到這個做法???因為我看到這些變數頂多想到box-cox或者一些簡單的做法而已@@,為什麼您可以想到做這麼多變數處理???

感謝解答

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.