Giter Club home page Giter Club logo

t_brain_malware_detection_competition's Introduction

T_Brain_Malware_Detection_Competition

image This competition was held by TrendMicro, a well-known antivirus software company, on data science competition platform T-Brain. T-Brian is established in this year and attempts to create a competitive data science environment in Taiwan. It has planned to launch series of competitions in following days. If you get interested, look around this promising platform!

Problem Statement

Malware detection is an crucial issue in the field of Cyber Security. Traditionally, they use malwares' specific signatures to detect, but it costs lots of time and computation resources. Therefore, TrendMicro attempts to rely on machine learning skill to detect malware in time and reduce costs.
Given the query log within three monthes, build a predictive model to detect malware under the situation of file agnostic.

Dataset

  1. Query Log
    Query log contains 83,273,110 records of 81,977 unique FileID,including columns of FileID, CustomerID, ProductID and querytime.
  2. Train Training dataset contains 52,559 unique FileID and their malware_or_not(Target value) tag.
  3. Test
    Tesing dataset contains 29,418 unique FileID.

Timeline

Starts at: Jan 22 2018
Closed on: Mar 23 2018

Measure

AUC (Area Under ROC Curve)

Method

I generated Files' aggregated features from query log. Totally, I created almost 300 variables.

Feature Engineering

  1. Frequency-based features:
    First, I counted frequency by FileID and other category variables. Then, I calculated aggregate features like mean, variance, max and min.ect on frequency_count by FileID. For example, "Groupby(['FileID','CustomerID']).Count()" can get File's customer usage distribution. Afterwards, if you compute mean on it, you can get FileID's mean customer usage frequency.
  2. Time series features:   I calculated aggregate features like mean, variance, max and min.ect on QueryTime grouped by FileID and other category variables.
  3. Time difference features:
    I computed time difference for each file's usages, and then calculate aggregate features like mean, variance, max and min.ect on time_difference grouped by FileID and other category variables.
  4. Average response features:
    I generated users' average response rate in each cross-validation process and found it as key feature

Modeling

I used stacking model to make final prediction. In stacking model, I treated xgboost, lightboost and random forest as layer1 to generate meta features on train and test data. For train data, I use 3 folds cross-validation way to generate features fold by fold,but for test data I average 3 folds' features. Afterwards, I use logistic regression model with meta features as layer2 to make classification on test data.
I used 3 folds cross-validation with grid-search to train four models. The hyper-parameters I fine-tuned are as follows:

  1. XGB:n_iteration、max_depth、learning rate.
  2. LGB:n_iteration、max_depth、learning rate.
  3. Random Forest:n_iteration、max_depth、min_samples_leaf.
  4. Logistic Regression:C、Penalty。

Result

My team name is BigPikachu. I got 4th in Public Leaderboard(AUC=0.962997) and 7th in Private Leaderboard(0.967284).
image

Improvement

  1. More detailed variables:
    Maybe I should do more effort on EDA to know the data trend in each dimension. Then, I can generate features like "Count the number of times FileID's time difference > 300 seconds" to depict important trend.
  2. Matrix factorization method:
    Refer to others, I can try ALS and FFT to generate key features, which I can simultaneously do dimension reduction and depict File-Customer, File-Time or File-Product relationships.
  3. Dimension reduction:
    Although my score got higher in private leaderboard, the progress rate wasn't better than others. I think that's because my model learned too well on traing data, which indicated a little bit over-fitting. Despite that I had tried autoencoder to do dimension reduction, it seemed my local cv score didn't get better. I think I should train my autoencoder with larger iteration value(more than my prior 500 iterations) to let it converge.

Reference

  1. T-Brain:Malware Detection: https://tbrain.trendmicro.com.tw/Competitions/Details/1
  2. Autoencoder: https://www.kaggle.com/deepspacelearning/autoencoder-for-dimensionality-reduction
  3. Stacking Model Introduction:http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

t_brain_malware_detection_competition's People

Contributors

tang-li-jen avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.