Giter Club home page Giter Club logo

deeplog's Introduction

DeepLog

A Pytorch implementation of DeepLog's log key anomaly detection model.

If you are confusing about how to extract log key (i.e. log template), I recommend using Drain which is proposed in this paper. As far as I know, it is the most effective log parsing method. By the way, there is a toolkit and benchmarks for automated log parsing in this repository.

Requirement

  • python>=3.6
  • pytorch==1.4
  • tensorboard==2.0.2

Dataset

The dataset can be downloaded HERE. The website can't accessed now, but you can find the HDFS data in this repository.

The original HDFS logs can be found [HERE] (http://people.iiis.tsinghua.edu.cn/~weixu/sospdata.html).

Visualization

Run the following code in terminal, then navigate to https://localhost:6006.

tensorboard --logdir=log

Reference

Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar. "Deeplog: Anomaly detection and diagnosis from system logs through deep learning." ACM SIGSAC Conference on Computer and Communications Security(CCS), 2017.

Contributing

If you have any questions, please open an issue.

Welcome to pull requests to implement the rest of this paper!

deeplog's People

Contributors

asdqwqqq avatar gutjuri avatar wuyifan18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeplog's Issues

Anomaly Detection

A more typical use case would be to have the normal and abnormal test data mixed together as one dataset. How would one identify the anomalies (abnormal data) from the predictions of the fitted model? As it stands, the prediction is run individually on each line in test_xx_loader - and is only really useful for calculating metrics. How would you map this back to the original dataset? How would we calculate a row-wise loss or 'anomaly score' for each sequence (row) in the test data? For example, I would like to say line number 5 in the test data was an anomalous/abnormal sequence.

Question regarding the predicted variable

Yifan,

Source: LogKeyModel_predict.py

In the code below, can you please explain the difference between the output and predicted variables? Is output the same as predicted except it being sorted in tensors? Also, shouldn't the value of the predicted variable be something binary so that we can determine whether the predicted outcome is anomalous or not?

output = model(seq)
predicted = torch.argsort(output, 1)[0][-num_candidates:]

Thanks,
Deep

CPU for inference?

Could you please advise why are you using CPU for inference not GPU ?

Case of single log message in a process.

When the data has sequences of length less than a typical window size of 3. The sequences are determined as anomalies, though the sequences are already present in training data.
Making the window size less than 3 doesn't make sense for syslog data. But some process have 1 ,2 or 3 logs over time , they are been taken as anomalies by the lstm.

How can we handle these sequences and not generate more false positives

Hi Wuyifan

Hi,
Please let me know sequence of execution of the 3 given python files and what all input needs to be given and output is expected in each step of execution.

Thanks
Hamad

Final Trained Model application on test data

Yifan,

I have one question about your final model. Apart from seeing the precision, recall, and F measure, where can I actually get the predicted results of the anomalies by applying the trained model on the target dataset? In a nutshell, I would want to see which logs from the test data are anomalous, and which are not.

Thanks,
Deep

Reproducing from the HDFS logs including parsing and encoding

The data in this repo is already encoded. I tried looking through the other issues to get an understanding of how to reproduce the results using the original HDFS dataset and haven't been able to understand what to do.

I understand that the data needs to be parsed and encoded and Drain is a recommended tool for parsing. From there, it isn't clear if that is actually the tool used and what part or parts of the parsed data to use. I see in the conclusion of the paper this : DeepLog learns and encodes entire log message including timestamp, log key, and parameter values. Unsure if that is also what is done for this implementation or not.

conversion of system log

How to use my own system logs i am not sure how to convert them in data model
any help would be appreciated

How to convert the parsed data to training data?

Thank you so much for your efforts!
Would you please help with this issue, I think it is somehow repeated but I wanted to clear the confusion.
After we get the logs, I used SPELL parser and I end up with dictionary looks like this:

{ "0": {"log_id": [10,2,3], "abstraction": "......"},
  "1":{"log_id": [0,2,7,9,10], "abstraction": "......"}
...
....
}

I used another SPELL parser and got this result:

0[]
1[]
0[['203519'], ['/10.250.10.6:40524'], ['/10.250.10.6:50010']]
0[['203519', '145'], ['/10.250.14.224:42420'], ['/10.250.14.224:50010']]
...
...

What is the next step ? I mean how to come up with the training data now? test_normal and test_abnormal?

How to get log key?

I have read this parper recently, before DeepLog, I found there need to parser logs to be log keys. so I want to know is there any tools to parse logs or how do you parse logs to be log keys? if you know any orther tools or source code, please tell me, thanks a lot!

Training Sequence Size

Yifan,

Is there a limit to the length of the training data log sequence? Can I have a training data sequence as small as the following?

9, 7, 8
8, 9, 7, 8, 4, 7
9, 7, 6, 9, 2
.
.
.
.
7, 8, 9, 5, 8

or a training data sequence as large as the following?

6, 8, 5, 9, 2, 9, ...... 1024 sequences
9, 8, 4, 7,
6, 1, 5, 6, 2, 9, ...... 1004 sequences
.
.
.
6, 10, 5, 9, 3, 9, ...... 1059 sequences

In the latter case, I see that deeplog is misclassifying the sequence in most cases.

Do you know whether the number of sequences has anything to do with the classification results?

Thanks,
Deep

TensorFlow Implementation

First of all, I want to thank @wuyifan18 for this awesome code. I am trying to translate this Pytorch code into Tensorflow and I am having some trouble. I am wondering if anyone here has done it or can take a look at my implementation.

I have reproduced the Model class on something that I think is the equivalent:

window_size = 10
input_size = 1
hidden_size = 64
num_layers = 2
num_classes = 28
num_epochs = 20
num_candidates = 9
batch_size = 2048
model_path = 'model/'
log = 'TFAdam_batch_size=' + str(batch_size) + ';epoch=' + str(num_epochs)


modeltf = keras.models.Sequential([
  keras.layers.LSTM(hidden_size, return_sequences=True,
                         batch_input_shape=(batch_size, 10, 1)),
  keras.layers.LSTM(hidden_size, return_sequences=False),
  keras.layers.Dense(num_classes),
])

modeltf.compile(optimizer=keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=['sparse_categorical_accuracy'])

#slight modification on generate function, for keeping np arrays instead of pytorch tensors
def generate(name):
    num_sessions = 0
    inputs = []
    outputs = []
    with open('data/' + name, 'r') as f:
        for line in f.readlines():
            num_sessions += 1
            line = tuple(map(lambda n: n - 1, map(int, line.strip().split())))
            for i in range(len(line) - window_size):
                inputs.append(line[i:i + window_size])
                outputs.append(line[i + window_size])
    print('Number of sessions({}): {}'.format(name, num_sessions))
    print('Number of seqs({}): {}'.format(name, len(inputs)))
    #dataset = [tf.convert_to_tensor(inputs), tf.convert_to_tensor(outputs)]
    dataset = [np.array(inputs), np.array(outputs)]
    return dataset

seq_dataset = generate('hdfs_train')
trainX = seq_dataset[0]
trainY = seq_dataset[1]

#I am letting down the last sessions as tf only accepts a multiple of the batch size as input size
trainX = trainX[:45056]
trainY = trainY[:45056]

trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))

modeltf.fit(trainX, trainY, epochs=num_epochs, batch_size=batch_size)

The problem is that my results are not good, they do not resemble at all what the Pytorch code achieves.
This can be seen directly from the loss and the accuracy the model displays when training. It gets stuck at around 3. I have tried batch normalization without much success and it is really bugging me because in pytorch it works perfectly. I think that it might be something subtle but I am not able to find it.

If anyone here has done it or want to take time to take a look I will be very grateful :)

Procedure for DeepLog

Hi,

I want to make sure I get the procedure to implement DeepLog correct. Here's what I'm thinking. Given train log data and test log data, do the following:

  1. Run spell on the train data and test log data to get log keys and features.
  2. Sort the outputs from spell into different sessions or blocks for both train and test data.
  3. Take only the log keys and put into file. Each row will represent one session or block of log outputs. Do this for both train and test data.
  4. Take sessions from train data that do not have errors and train it on deeplog.
  5. Run the test data to make predictions.

Can someone confirm whether this thinking is correct?

update model if false positive

Hello,
In the deeplog document, it is said that when false positives are found, we must update the workflow.
I don't feel like it's working, am I wrong or not?
P.S.: Thank you for this complexe algorithm.

Two concerns about DeepLog

I have two questions / concerns about the DeepLog.

Question 1:
The example about false positive detection is as below in the paper section 3.3.:
{k1, k2, k3 -> k1} and {k1, k2, k3 -> k2}. The former has been trained with normal logs and the latter one has false positive in detection phase. Then online updating the model with the latter one will fix the issue.

In the real world, there are cases that the sequence pattern itself (without considering the target) is abnormal and those sequences can never appear in the normal training dataset of course.
E.g. suppose h=3, and {k1, k3, k2} is a known abnormal sequence.
In this case, the detection result will not be reliable, e.g. the probability of {k1, k3, k2 -> k*} at output layer might be high.

How does DeepLog handle this abnormal sequence pattern?

Question 2:
For the online update of model, it looks DeepLog author used a small amount of train dataset that includes the FP instances for the updating. I have a concern that If I use a large train dataset to update the model, I will have the catastrophic forgetting issue that general incremental learning is facing. However, it is not a big issue as offline update from scratch with the accumulated dataset is always a backup.

How does DeepLog handle the catastrophic forgetting issue in the incremental model update?

Question about the reproducible results?

I notice your repo used the same parameters as the origin paper,but after the experiment , the result
F1-measure of this repo is only about ~89%, which is not as good as the 96% in the paper.
What's your experiment result @wuyifan18 ?
Thank you!

Missing License

It would be great to add a license file such as Apache v2.0, MIT, or BSD. This would make it clear how your code can be used, and is required for other codebases to properly call your code / import / include it.

RuntimeError: required keyword attribute 'name' has the wrong type

runfile('/Users/mandevay/LogKey_Model_Train.py', wdir='/Users/mandevay')
Number of sessions(hdfs_train.txt): 4856
Number of seqs(hdfs_train.txt): 46573
Traceback (most recent call last):

File "/Users/mandevay/LogKey_Model_Train.py", line 99, in
writer.add_graph(model, seq)

File "/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 707, in add_graph
self._get_file_writer().add_graph(graph(model, input_to_model, verbose))

File "/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 295, in graph
list_of_nodes = parse(graph, trace, args)

File "/anaconda2/envs/py36/lib/python3.6/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 219, in parse
attr_name = node.s('name')

RuntimeError: required keyword attribute 'name' has the wrong type

Settings:
python==3.6
pytorch==1.4.0
tensorboard=2.0.0

When I first ran the code, I ran into this error and cannot found any solution via Google. Could you give me any suggestions?Thank you.

Online learning support

Hi,
thanks for the great work.
I read in the paper there is a strong support in their algorithm for online learning, which means the model can update itself in "online stream mode" in case of a new log. However, I can't see any mention for this functionality in the code.
Can anyone help me to clarify the issue?

parameter value vector anomaly detection

Hi, first of all thank you so much for your job! I am trying to implement the second part of Deeplog model concerning the parameter value vector anomaly detection and test in on the HDFS dataset used in this code. According to the paper, the problem is reduced to a multivariate time series anomaly detection but I don't understand how to build it whence the parameter is a text as "FileId". Can anyone help me to clarify this issue?

Thank you.

Model Performance is just 88%

I trained the model using parameters in the paper for 300 epochs, but the final training accuracy is just 88%, and the testing F1-score is 88%. Could you please give some ideas? Has anyone got a higher performance?

Difference between train data and test_normal data

Hi Wufiyan,

I am able to convert structured logs to numbers as your data files and then I feed the train data to LogKeyTrain, model is generated. In the detection stage, the question is what should be the test_normal data. I tried providing a set of logs with similar logkeys but with different blk_id, So the results is not that good. And If I provide test_normal as train_data, I get 100% accuracy, which is obvious.
Could you guide me on how the test_normal data should be.

Great help.
Thank you.

multiple block_ids in sorted.log

Hello,

I have noticed that in the raw log file (sorted.log) which you used to generate the txt files with the sessions, there are lines that contain more than 1 block_ids (they range from 2 up to 100).

An example of a template including multiple block_ids is '<>BLOCK ask <> to delete <>' .

For these lines we cannot know which of the block_ids is the actual id used for the given session and which are "noise" (if we can use this term..)

How did you tackle this issue?

Did you, for example, delete such lines from your structured file, or wherever applicable, did you assign the given event_id (action described in the line) to the sessions reflected in the multiple block_ids? (in this case the number of the sessions would remain the same, but their lengths would be slightly longer).

I appreciate your time and thank you very much.

Test Dataset Question

Yifan,

Can you please let me know how you created the sample for hdfs_test_normal and hdfs_test_abnormal test datasets? Is the sampling of the normal and abnormal test datasets done manually? I guess I am understanding something wrong here. If I have to do the sampling of normal and abnormal datasets manually on such a large scale data, then that can be a really tedious task.

Thanks in advance for your help.

Thanks,
Deep

One-hot encoding

Hello @wuyifan18 ,
First, thank you for your implementation!
I'm wondering why you're not using one-hot encoding to encode the log keys in input, as these are categorical variables. (in addition, deeplog paper says that they've done it)
You are doing here an ordinal encoding by mapping each log key to an integer.
Thank you for your answer!

Converting text to numbers

Hi,
Thanks for providing such a complete algorithm, I tested it using everything that was provided, and it worked great !

However, I am trying to use it for a school project (we are recieving logs from an electronic card). The logs that I have are structured (we have already done a preprocessing of the data), but I really struggle to convert the logs to numbers (converting text file to hdfs). Therefore, I cannot train the algorithm using my data.
I know this question has already been asked before but I really do not understand how it is done, even by reading the paper... Any help is appreciated !

Cheers,

The processed dataset

Hi, I found that the dataset in the ZIP I cloned which is parsed from the raw hdfs log data, is different to what I processed. I processed the same raw log data according to the log template provided by the method called Spell, but I found that the log sequences are different to what is in the "hdfs_train" file and the other three files. And I also found that there are autually 33 log keys rather than 29 log keys. Did you filtered the dataset or not, it confused me these days.

Number of sessions in test data

Thanks for sharing this code firstly. But I have a question about the number of sessions in test data. For my understanding, in both train and test data, each row represents one session or one block. So the number of sessions of hdfs_test_normal / hdfs_test_abnormal are 553365 / 16838 respectively. But after running "LogKeyModel_predict.py", it shows the following output,

Number of sessions(hdfs_test_normal): 14177 Number of sessions(hdfs_test_abnormal): 4123

I wander if my understanding is wrong, or the output is given wrong?

I will always be grateful for your or others reply.
Thank u again.

Input dimension of Deeplog

Hi, wuyifan.
I am confused about the input dimension of Deeplog of your code. It seems that the input dimension is (batch_size, sequence_length, input_size). But according to the documentation, the input dimension is (sequence_length, batch_size, input_size). So I don't know whether I misunderstand something.
Any help would be appreciated.

About the baseline method mentioned in the Deeplog paper: n-gram

The n-gram method mentioned in this paper also has a good prediction effect, but only the prediction algorithm of log key is mentioned in this paper, but the back propagation method is not mentioned, and how the n-gram algorithm is trained and gradually improves the prediction accuracy is not known. Therefore, this paper puts forward a discussion.

Error

Hi everybody,

I tried to run "LogKeyModel_Train.py" but i have the following error : `` FileNotFoundError: [Errno 2] No such file or directory: 'model/Adam with batch_size=2048;epoch=300.pt' ".
Does someone know what is wrong please?

Thanks for advance.

Error executing LogKeyModel_train.py

(base) jay@pop-os:~/git/DeepLog$ python ./LogKeyModel_train.py
Number of sessions(hdfs_train): 4855
Number of seqs(hdfs_train): 46575
Traceback (most recent call last):
File "./LogKeyModel_train.py", line 68, in
writer = SummaryWriter(log_dir='log/' + log)
File "/home/jay/.local/lib/python3.6/site-packages/tensorboardX/writer.py", line 254, in init
self._get_file_writer()
File "/home/jay/.local/lib/python3.6/site-packages/tensorboardX/writer.py", line 310, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir, **self.kwargs)
TypeError: init() got an unexpected keyword argument 'log_dir'

Change the line in test_data

Hi @wuyifan18,thank you for the great tool.It performs very well,but i have a question about the time when i should change the line in train_data such as hdfs_test_abnormal.

I have seen the Deeplog and Drain , and i know the number in test_data represents log events but i still do not understand the time to change the line.Any suggestion is highly appreciated.

Thanks.

Raw log data format

Hi,

I am applying DeepLog on the firewall logs. Does the raw log data need to abide by a certain format for log-parsers such as Spell or Drain?

Thanks,
Deep

Update model - change of num_classes

Hi Wufiyan

I am training data using your approach
The model works fine when tested, but gives some false positives for sequences not present in training dataset.
Hence, to improve accuracy, I am trying to load this pretrained model and train the model again with new data (update pretrained model with new dataset)
This retraining build a new model with new dataset with more num_classes, Hence I am not able to combine or append to the pretrained model.
Also, it doesn't seem to detect the sequences from the pretrained model (it overwrites the pretrained model)
What steps can I follow to use pretrained model as well as the new trained data for further testing.
Any help appreciated :)

About generate data and window size

@wuyifan18 Thank you for your implementation of DeepLog!

I have a question about the data processing part. When calling the generate function, is there any specific reason we need "line = tuple(map(lambda n: n - 1, map(int, line.strip().split())))" which is line 21 in LogKeyModel_train.py. I'm not sure why we need lambda n: n - 1, and it seems that map(int, line.strip().split()) doesn't change anything for the given train data.

Also is there any specific reason for a window size of 10? Because it seems like any session with length less than 10 will be padded with -1, correct me if I'm wrong, but isn't that make them automatically be set detected as abnormal? In that case, isn't different window size, like 10 and 3 will greatly affect the result?

Any hint will be helpful! Thank you!

python version

can you share the python version you are using in the project ? python2 or python3 ?
Actually,i am using python3.5 to run the project in centos6.
But there are some problems.

Memory Error while running the training code

Hi @wuyifan18 , thank you for the great tool. It works perfectly with a very small dataset, but whenever I try running it with a larger dataset I get this error during the training phase:

[enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 278528000 bytes. Error code 12 (Cannot allocate memory)

My machine has large RAM but I am not sure why is this happening or how can I resolve this issue? Or how can we edit the code so that we use less memory space with each epoch? now each epoch takes approximately 1 GB of memory.

Any suggestion is highly appreciated.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.