Giter Club home page Giter Club logo

ml4logs_q6aq's Introduction

Installation

  1. Clone the source: https://github.com/LogAnalysisTeam/ml4logs

  2. Activate your virtual environment (conda, venv).

  3. Either install the package as usual:

python setup.py install

or in development regime:

python setup.py develop

Usage

Various pipelines are run using batches in scripts/. We suggest to run the scripts via Makefile:

make COMMAND_NAME

The scripts support SLURM cluster batch scheduler. Set ML4LOGS_SHELL environment variable to sbatch in case you perform the experiments on the cluster. See RCI Quick Start for full details on how to setup development environment.

If init_environment.sh script exists in the project root directory, it is sourced (via bash source command) prior running any batch in scripts/. Use it to set up your virtual environment, scheduler modules, etc.

Run Benchmark on HDFS1 (100k lines)

  • make hdfs1_100k_data
  • wait until finish
  • make hdfs1_100k_preprocess
  • wait
  • make hdfs1_100k_train_test

Run Benchmark on HDFS1

  • make hdfs1_data
  • wait
  • make hdfs1_preprocess
  • wait
  • make hdfs1_train_test

Results

The following table (generated using script) shows the current LAD method leaderboard for the HDFS1 dataset. The methods are sorted by decreasing F1 score.

Unsupervised/Semi-Supervised Methods

Method Preprocess Precision Recall F1 MCC
PCA Drain3 0.849 0.809 0.828 0.824
Isolation Forest (sklearn) Drain3 0.808 0.800 0.804 0.798
Local Outlier Factor (sklearn) Drain3 0.429 0.928 0.587 0.616
Isolation Forest (sklearn) fastText block-max 0.989 0.364 0.532 0.594
PCA fastText block-max 0.380 0.384 0.382 0.363
Local Outlier Factor (sklearn) fastText block-max 0.258 0.014 0.027 0.055

Supervised Methods

Method Preprocess Precision Recall F1 MCC
Decision Tree Drain3 0.997 0.999 0.998 0.998
Logistic Regression Drain3 0.980 0.995 0.988 0.987
LSTM M2O fastText 0.992 0.471 0.639 0.678
Decision Tree fastText block-max 0.614 0.634 0.624 0.612
Logistic Regression fastText block-max 0.911 0.420 0.575 0.612
Linear SVC fastText block-max 0.948 0.387 0.550 0.599
Linear SVC Drain3 1.000 0.230 0.375 0.475
LSTM M2M fastText 0.874 0.111 0.197 0.309

Notes:

  • Currently only LOF and IF methods for Drain3-preprocessed data have meta-parameters tuned (using grid or random search). We found the meta-parameter tunning extremely important. The results for other combinations of methods and preprocessing pipelines will follow soon...
  • All experiments above included time-deltas merged with the rest of features.
  • The features differ based on a selected preprocessing pipeline:
    • Drain3: Log keys are extracted getting per-block BOWs which is in turn weighted using TF-IDF. While we have currenlty best results for Drain3, the big disadvantage is, that the fixed categorical distribution over the log keys does not allow log lines based on yet unseen templates to be processed.
    • fastText: Block loglines are represented as a sequence of 100-dimensional fastText embbeddings.
    • fastText block-max: Tue same 100-dimensional fastText embbeddings aggregated to a single 100-dimensional vector using max-pooling.

Scripts and Configuration Files

  • Each script executes corresponds to a single pipeline config (see configs/ directory)
  • Config describes a sequential pipeline of actions which is applied to data

data

  • Downloads archive.
  • Extracts archive.
  • Prepares the dataset:
    • TODO: ADD DETAILS HERE
    • Time deltas are computed. Time deltas measure the time differences between successive log lines.

drain_preprocess

  • Parses log keys (log templates) using IBM/Drain3.
  • Aggregates log lines by blocks, which correspond to level at which anomaly labels are given.

fasttext_preprocess

  • Trains the fastText model.
  • Gets embeddings for all log lines.
  • Concatenates the embeddings with the time deltas.
  • Aggregates per-log line embeddings to per-block ones using selected method (sum, average, min, max).

drain_loglizer

Trains and tests models which are specified by loglizer on Drain-parsed dataset. These are:

  • Logistic regression
  • Decision tree
  • Linear SVC
  • LOF
  • One class SVM
  • Isolation forest

fasttext_loglizer

Trains and tests loglizer specified models for aggregated fastText embeddings:

  • Logistic regression
  • Decision tree
  • Linear SVC
  • LOF
  • One class SVM
  • Isolation forest
  • PCA

fasttext_seq2seq

  • Trains and tests a sequential model as defined in [1].
  • Predicts the following log line embedding based on a history of log line embeddings.
  • Uses LSTM based Torch model.
  • Computes the threshold on a train dataset (assuming 5% logs are anomalies).
  • Tests different thresholds and saves the statistics.

Results

TODO put result tables here

Data Files Description

Block-Level Labeled Datasets (e.g., HDFS)

N - Number of log lines
B - Number of blocks (e.g. blk_ in HDFS)
E - Number of event ids (e.g. extracted by drain)
F - Embedding dimension (e.g. fasttext)
data
├── interim
│   └── {DATASET_NAME}
│       ├── blocks.npy                  (N, )       Block ids
│       ├── fasttext-timedeltas.npy     (N, F + 1)  Fasttext embeddings with timedeltas
│       ├── fasttext.npy                (N, F)      Fasttext embeddings
│       ├── ibm_drain-eventids.npy      (N, )       Event ids
│       ├── ibm_drain-templates.csv     (E, )       Event ids, their templates and occurrences
│       ├── labels.npy                  (B, )       Labels (1 stands for anomaly, 0 for normal)
│       ├── logs.txt                                Raw logs
│       └── timedeltas.npy              (N, )       Timedeltas
├── processed
│   └── {DATASET_NAME}
│       ├── fasttext-average.npz        (B, F + 1)  Fasttext embeddings with timedeltas aggregated by blocks
│       └── ibm_drain.npz               (B, E)      Count vectors
└── raw
    └── {DATASET_NAME}
        ├── {ARCHIVE}.tar.gz
        └── Dataset specific files

References

[1] M. Souček, "Log Anomaly Detection", master thesis, Czech Technical University in Prague, 2020.

ml4logs_q6aq's People

Contributors

dhonza avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.