Giter Club home page Giter Club logo

ml-ids's Introduction

A machine learning based approach towards building an Intrusion Detection System

Problem Description

With the rising amount of network enabled devices connected to the internet such as mobile phones, IOT appliances or vehicles the concern about the security implications of using these devices is growing. The increase in numbers and types of networked devices inevitably leads to a wider surface of attack whereas the impact of successful attacks is becoming increasingly severe as more critical responsibilities are assumed be these devices.

To identify and counter network attacks it is common to employ a combination of multiple systems in order to prevent attacks from happening or to detect and stop ongoing attacks if they can not be prevented initially. These systems are usually comprised of an intrusion prevention system such as a firewall as the first layer of security with intrusion detection systems representing the second layer. Should the intrusion prevention system be unable to prevent a network attack it is the task of the detection system to identify malicious network traffic in order to stop the ongoing attack and keep the recorded network traffic data for later analysis. This data can subsequently be used to update the prevention system to allow for the detection of the specific network attack in the future. The need for intrusion detection systems is rising as absolute prevention against attacks is not possible due to the rapid emergence of new attack types.

Even though intrusion detection systems are an essential part of network security many detection systems deployed today have a significant weakness as they facilitate signature-based attack classification patterns which are able to detect the most common known attack patterns but have the drawback of being unable to detect novel attack types. To overcome this limitation research in intrusion detection systems is focusing on more dynamic approaches based on machine learning and anomaly detection methods. In these systems the normal network behaviour is learned by processing previously recorded benign data packets which allows the system to identify new attack types by analyzing network traffic for anomalous data flows.

This project aims to implement a classifier capable of identifying network traffic as either benign or malicious based on machine learning and deep learning methodologies.

Data

The data used to train the classifier is taken from the CSE-CIC-IDS2018 dataset provided by the Canadian Institute for Cybersecurity. It was created by capturing all network traffic during ten days of operation inside a controlled network environment on AWS where realistic background traffic and different attack scenarios were conducted. As a result the dataset contains both benign network traffic as well as captures of the most common network attacks. The dataset is comprised of the raw network captures in pcap format as well as csv files created by using CICFlowMeter-V3 containing 80 statistical features of the individual network flows combined with their corresponding labels. A network flow is defined as an aggregation of interrelated network packets identified by the following properties:

  • Source IP
  • Destination IP
  • Source port
  • Destination port
  • Protocol

The dataset contains approximately 16 million individual network flows and covers the following attack scenarios:

  • Brute Force
  • DoS,
  • DDos
  • Heartbleed,
  • Web Attack,
  • Infiltration,
  • Botnet

Approach

The goal of this project is to create a classifier capable of categorising network flows as either benign or malicious. The problem is understood as a supervised learning problem using the labels provided in the dataset which identify the network flows as either benign or malicious. Different approaches of classifying the data will be evaluated to formulate the problem either as a binary classification or a multiclass classification problem differentiating between the individual classes of attacks provided in the dataset in the later case. A relevant subset of the features provided in the dataset will be used as predictors to classify individual network flows. Machine learning methods like k-nearest neighbours, random forest or SVM will be applied to the problem and evaluated in the first step in order to assess the feasibility of using traditional machine learning approaches. Subsequently deep learning models like convolutional neural networks, autoencoders or recurrent neural networks will be employed to create a competing classifier as recent research has shown that deep learning methods represent a promising application in the field of anomaly detection. The results of both approaches will be compared to select the best performing classifier.

Deliverables

The classifier will be deployed and served via a REST API in conjunction with a simple web application providing a user interface to utilize the API.

The REST API will provide the following functionality:

  • an endpoint to submit network capture files in pcap format. Individual network flows are extracted from the capture files and analysed for malicious network traffic.
  • (optional) an endpoint to stream continuous network traffic captures which are analysed in near real-time combined with
  • (optional) an endpoint to register a web-socket in order to get notified upon detection of malicious network traffic.

To further showcase the project, a testbed could be created against which various attack scenarios can be performed. This testbed would be connected to the streaming API for near real-time detection of malicious network traffic.

Computational resources

The requirements regarding the computational resources to train the classifiers are given below:

Category Resource
CPU Intel Core i7 processor
RAM 32 GB
GPU 1 GPU, 8 GB RAM
HDD 100 GB

Classifier

The machine learning estimator created in this project follows a supervised approach and is trained using the Gradient Boosting algorithm. Employing the CatBoost library a binary classifier is created, capable of classifying network flows as either benign or malicious. The chosen parameters of the classifier and its performance metrics can be examined in the following notebook.

Deployment Architecture

The deployment architecture of the complete ML-IDS system is explained in detail in the system architecture.

Model Training and Deployment

The model can be trained and deployed either locally or via Amazon SageMaker.
In each case the MLflow framework is utilized to train the model and create the model artifacts.

Installation

To install the necessary dependencies checkout the project and create a new Anaconda environment from the environment.yml file.

conda env create -f environment.yml

Afterwards activate the environment and install the project resources.

conda activate ml-ids

pip install -e .

Dataset Creation

To create the dataset for training use the following command:

make split_dataset \
  DATASET_PATH={path-to-source-dataset}

This command will read the source dataset and split the dataset into separate train/validation/test sets with a sample ratio of 80%/10%/10%. The specified source dataset should be a folder containing multiple .csv files.
You can use the CIC-IDS-2018 dataset provided via Google Drive for this purpose.
Once the command completes a new folder dataset is created that contains the splitted datasets in .h5 format.

Local Mode

To train the model in local mode, using the default parameters and dataset locations created by split_dataset, use the following command:

make train_local

If the datasets are stored in a different location or you want to specify different training parameters, you can optionally supply the dataset locations and a training parameter file:

make train_local \
  TRAIN_PATH={path-to-train-dataset} \
  VAL_PATH={path-to-train-dataset} \
  TEST_PATH={path-to-train-dataset} \
  TRAIN_PARAM_PATH={path-to-param-file}

Upon completion of the training process the model artifacts can be found in the build/models/gradient_boost directory.

To deploy the model locally the MLflow CLI can be used.

mlflow models serve -m build/models/gradient_boost -p 5000

The model can also be deployed as a Docker container using the following commands:

mlflow models build-docker -m build/models/gradient_boost -n ml-ids-classifier:1.0

docker run -p 5001:8080 ml-ids-classifier:1.0

Amazon SageMaker

To train the model on Amazon SageMaker the following command sequence is used:

# build a new docker container for model training
make sagemaker_build_image \
  TAG=1.0

# upload the container to AWS ECR
make sagemaker_push_image \
  TAG=1.0

# execute the training container on Amazon SageMaker
make sagemaker_train_aws \
  SAGEMAKER_IMAGE_NAME={ecr-image-name}:1.0 \
  JOB_ID=ml-ids-job-0001

This command requires a valid AWS account with the appropriate permissions to be configured locally via the AWS CLI. Furthermore, AWS ECR and Amazon SageMaker must be configured for the account.

Using this repository, the manual invocation of the aforementioned commands is not necessary as training on Amazon SageMaker is supported via a GitHub workflow that is triggered upon creation of a new tag of the form m* (e.g. m1.0).

To deploy a trained model on Amazon SageMaker a GitHub Deployment request using the GitHub API must be issued, specifying the tag of the model.

{
  "ref": "refs/tags/m1.0",
  "payload": {},
  "description": "Deploy request for model version m1.0",
  "auto_merge": false
}

This deployment request triggers a GitHub workflow, deploying the model to SageMaker. After successful deployment the model is accessible via the SageMaker HTTP API.

Using the Classifier

The classifier deployed on Amazon SageMaker is not directly available publicly, but can be accessed using the ML-IDS REST API.

REST API

To invoke the REST API the following command can be used to submit a prediction request for a given network flow:

curl -X POST \
  http://ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com/api/predictions \
  -H 'Accept: */*' \
  -H 'Content-Type: application/json; format=pandas-split' \
  -H 'Host: ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com' \
  -H 'cache-control: no-cache' \
  -d '{"columns":["dst_port","protocol","timestamp","flow_duration","tot_fwd_pkts","tot_bwd_pkts","totlen_fwd_pkts","totlen_bwd_pkts","fwd_pkt_len_max","fwd_pkt_len_min","fwd_pkt_len_mean","fwd_pkt_len_std","bwd_pkt_len_max","bwd_pkt_len_min","bwd_pkt_len_mean","bwd_pkt_len_std","flow_byts_s","flow_pkts_s","flow_iat_mean","flow_iat_std","flow_iat_max","flow_iat_min","fwd_iat_tot","fwd_iat_mean","fwd_iat_std","fwd_iat_max","fwd_iat_min","bwd_iat_tot","bwd_iat_mean","bwd_iat_std","bwd_iat_max","bwd_iat_min","fwd_psh_flags","bwd_psh_flags","fwd_urg_flags","bwd_urg_flags","fwd_header_len","bwd_header_len","fwd_pkts_s","bwd_pkts_s","pkt_len_min","pkt_len_max","pkt_len_mean","pkt_len_std","pkt_len_var","fin_flag_cnt","syn_flag_cnt","rst_flag_cnt","psh_flag_cnt","ack_flag_cnt","urg_flag_cnt","cwe_flag_count","ece_flag_cnt","down_up_ratio","pkt_size_avg","fwd_seg_size_avg","bwd_seg_size_avg","fwd_byts_b_avg","fwd_pkts_b_avg","fwd_blk_rate_avg","bwd_byts_b_avg","bwd_pkts_b_avg","bwd_blk_rate_avg","subflow_fwd_pkts","subflow_fwd_byts","subflow_bwd_pkts","subflow_bwd_byts","init_fwd_win_byts","init_bwd_win_byts","fwd_act_data_pkts","fwd_seg_size_min","active_mean","active_std","active_max","active_min","idle_mean","idle_std","idle_max","idle_min"],"data":[[80,17,"21\\/02\\/2018 10:15:06",119759145,75837,0,2426784,0,32,32,32.0,0.0,0,0,0.0,0.0,20263.87212,633.2460039,1579.1859130859,31767.046875,920247,1,120000000,1579.1859130859,31767.046875,920247,1,0,0.0,0.0,0,0,0,0,0,0,606696,0,633.2460327148,0.0,32,32,32.0,0.0,0.0,0,0,0,0,0,0,0,0,0,32.0004234314,32.0,0.0,0,0,0,0,0,0,75837,2426784,0,0,-1,-1,75836,8,0.0,0.0,0,0,0.0,0.0,0,0]]}'

ML-IDS API Clients

For convenience, the Python clients implemented in the ML-IDS API Clients project can be used to submit new prediction requests to the API and receive real-time notifications on detection of malicious network flows.

ml-ids's People

Contributors

cstub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ml-ids's Issues

RecursionError

Hi there,

I got an error when I run the command "make train_local“:

python ./models/gradient_boost/envs/local/train.py
--train-path dataset/train.h5
--val-path dataset/val.h5
--test-path dataset/test.h5
--output-path build/models/gradient_boost
--param-path models/gradient_boost/training_params.json
2021/02/05 14:44:47 INFO mlflow.projects: === Created directory /tmp/tmp99qbsv62 for downloading remote URIs passed to arguments of type 'path' ===
2021/02/05 14:44:47 INFO mlflow.projects: === Running command 'source /root/miniconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70 1>&2 && pip install -e ../../../. && python train.py --train-path /data/mlids/ml-ids-master/dataset/train.h5 --val-path /data/mlids/ml-ids-master/dataset/val.h5 --test-path /data/mlids/ml-ids-master/dataset/test.h5 --output-path /data/mlids/ml-ids-master/build/models/gradient_boost --artifact-path /data/mlids/ml-ids-master/build/models/gradient_boost --use-val-set True --random-seed -1 --nr-iterations 10 --tree-depth 2 --l2-reg 4.813919374945952 --border-count 254 --random-strength 5 --task-type GPU --nr-samples-attack-category 10' in run with ID '8e89b8c786a6428e993655dd08b1b373' ===
Looking in indexes: https://mirrors.ustc.edu.cn/pypi/web/simple
Obtaining file:///data/mlids/ml-ids-master
Installing collected packages: ml-ids
Found existing installation: ml-ids 0.1
Uninstalling ml-ids-0.1:
Successfully uninstalled ml-ids-0.1
Running setup.py develop for ml-ids
Successfully installed ml-ids
/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/catboost/core.py:5: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/imp.py:342: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
return _load(spec)
Using TensorFlow backend.
2021-02-05 14:44:48,206 - main - INFO - Loading datasets...
2021-02-05 14:44:48,212 - numexpr.utils - INFO - NumExpr defaulting to 6 threads.
2021-02-05 14:44:48,661 - main - INFO - Evaluation dataset will be used for early stopping.
2021-02-05 14:44:48,667 - main - INFO - Starting training...
2021-02-05 14:44:48,667 - ml_ids.models.gradient_boost.train - INFO - Training model with parameters [samples-per-attack-category=10, hyperparams=GradientBoostHyperParams(nr_iterations=10, tree_depth=2, l2_reg=4.813919374945952, border_count=254, random_strength=5, task_type='GPU')]
Traceback (most recent call last):
File "train.py", line 193, in
train()
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "train.py", line 159, in train
random_seed=random_seed)
File "/data/mlids/ml-ids-master/ml_ids/models/gradient_boost/train.py", line 153, in train_model
X_train, y_train = preprocess_train_dataset(pipeline, train_dataset, nr_attack_samples, random_seed)
File "/data/mlids/ml-ids-master/ml_ids/models/gradient_boost/train.py", line 74, in preprocess_train_dataset
random_state=random_state)
File "/data/mlids/ml-ids-master/ml_ids/transform/sampling.py", line 31, in upsample_minority_classes
sample_dict[i] = max(counts[i], min_samples)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/series.py", line 1071, in getitem
result = self.index.get_value(self, key)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/indexes/category.py", line 548, in get_value
return series.take([indexer])[0]
...
...
...
return series.take([indexer])[0]
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/series.py", line 1071, in getitem
result = self.index.get_value(self, key)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/indexes/category.py", line 548, in get_value
return series.take([indexer])[0]
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/series.py", line 4447, in take
new_values, index=new_index, fastpath=True
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/series.py", line 321, in init
self._set_axis(0, index, fastpath=True)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/series.py", line 427, in _set_axis
is_all_dates = labels.is_all_dates
File "pandas/_libs/properties.pyx", line 34, in pandas._libs.properties.CachedProperty.get
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 1888, in is_all_dates
return is_datetime_array(ensure_object(self.values))
File "pandas/_libs/algos_common_helper.pxi", line 303, in pandas._libs.algos.ensure_object
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 1337, in array
ret = take_1d(self.categories.values, self._codes)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/algorithms.py", line 1653, in take_nd
elif is_interval_dtype(arr):
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 675, in is_interval_dtype
return IntervalDtype.is_dtype(arr_or_dtype)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 1110, in is_dtype
return super().is_dtype(dtype)
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/dtypes/base.py", line 256, in is_dtype
if isinstance(dtype, (ABCSeries, ABCIndexClass, ABCDataFrame, np.dtype)):
File "/root/miniconda3/envs/mlflow-d3cf0f0fbdbc8a62d4c965feb9e68417c7f69b70/lib/python3.7/site-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, "_typ") in comp
RecursionError: maximum recursion depth exceeded while calling a Python object

then I ran the command again with changed the argument "n" in "import sys sys.setrecursionlimit(n)", no matter what the "n" changed to, the error is still there. How can I fix it?

Thanks a lot.

Master project

Hello Dear

I need to implement AI-IDS as a projects for university, can you help me please?

Thank you in advanced

Regards

Deployed model locally, what next?

Hi anyone able to use the model after deploying it locally? I had successfully deployed the model locally, but I had no idea how to use it to scan the traffic.

Error on running dl-classifier notebook code

Hello Christoph,

I ran into an issue when running the dl-classifier.ipynb. I used the data you suggested in your README and basically wanted to see the results before using it on another data source. I was hoping you might've run into this earlier and could shed some light on this issue.


The error message I got was: recursionerror: maximum recursion depth exceeded while calling a Python object.

The code causing this is in 2.1 Data loading and preperation
X_train, y_train, X_val, y_val, X_test, y_test, column_names = transform_data(dataset=dataset,
imputer_strategy='median',
scaler=StandardScaler,
attack_samples=100000,
random_state=rand_state)

This is defined in notebook_utils.py, in which line 79 is where the issue is caused.
X_train, y_train = upsample_minority_classes(X_train,
y_train,
min_samples=attack_samples,
random_state=random_state)


I've tried debugging this for some time now and would find your input extremely valuable.

Thank you

real time detection

Hi, Could you please tell me how to use your model in a real-time situation, how to capture traffic, and preprocess it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.