logpai / loglizer Goto Github PK

View Code? Open in Web Editor NEW

1.2K 92.0 422.0 10.84 MB

A machine learning toolkit for log-based anomaly detection [ISSRE'16]

License: MIT License

Python 2.53% Jupyter Notebook 97.47%

log-analysis anomaly-detection failure-diagnosis machine-learning aiops

loglizer's Introduction

loglizer

Loglizer is a machine learning-based log analysis toolkit for automated anomaly detection.

Loglizer是一款基于AI的日志大数据分析工具, 能用于自动异常检测、智能故障诊断等场景

Logs are imperative in the development and maintenance process of many software systems. They record detailed runtime information during system operation that allows developers and support engineers to monitor their systems and track abnormal behaviors and errors. Loglizer provides a toolkit that implements a number of machine-learning based log analysis techniques for automated anomaly detection.

🔭 If you use loglizer in your research for publication, please kindly cite the following paper.

Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Experience Report: System Log Analysis for Anomaly Detection, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2016. [Bibtex][中文版本] (ISSRE Most Influential Paper)

Framework

The log analysis framework for anomaly detection usually comprises the following components:

Log collection: Logs are generated at runtime and aggregated into a centralized place with a data streaming pipeline, such as Flume and Kafka.
Log parsing: The goal of log parsing is to convert unstructured log messages into a map of structured events, based on which sophisticated machine learning models can be applied. The details of log parsing can be found at our logparser project.
Feature extraction: Structured logs can be sliced into short log sequences through interval window, sliding window, or session window. Then, feature extraction is performed to vectorize each log sequence, for example, using an event counting vector.
Anomaly detection: Anomaly detection models are trained to check whether a given feature vector is an anomaly or not.

Models

Anomaly detection models currently available:

Model	Paper reference
Supervised models
LR	[EuroSys'10] Fingerprinting the Datacenter: Automated Classification of Performance Crises, by Peter Bodík, Moises Goldszmidt, Armando Fox, Hans Andersen. [Microsoft]
Decision Tree	[ICAC'04] Failure Diagnosis Using Decision Trees, by Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, Eric Brewer. [eBay]
SVM	[ICDM'07] Failure Prediction in IBM BlueGene/L Event Logs, by Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra Sahoo. [IBM]
Unsupervised models
LOF	[SIGMOD'00] LOF: Identifying Density-Based Local Outliers, by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander.
One-Class SVM	[Neural Computation'01] Estimating the Support of a High-Dimensional Distribution, by John Platt, Bernhard Schölkopf, John Shawe-Taylor, Alex J. Smola, Robert C. Williamson.
Isolation Forest	[ICDM'08] Isolation Forest, by Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou.
PCA	[SOSP'09] Large-Scale System Problems Detection by Mining Console Logs, by Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael I. Jordan. [Intel]
Invariants Mining	[ATC'10] Mining Invariants from Console Logs for System Problem Detection, by Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, Jiang Li. [Microsoft]
Clustering	[ICSE'16] Log Clustering based Problem Identification for Online Service Systems, by Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen. [Microsoft]
DeepLog (coming)	[CCS'17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning, by Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar.
AutoEncoder (coming)	[Arxiv'18] Anomaly Detection using Autoencoders in High Performance Computing Systems, by Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini.

Log data

We have collected a set of labeled log datasets in loghub for research purposes. If you are interested in the datasets, please follow the link to submit your access request.

Install

git clone https://github.com/logpai/loglizer.git
cd loglizer
pip install -r requirements.txt

API usage

# Load HDFS dataset. If you would like to try your own log, you need to rewrite the load function.
(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(...)

# Feature extraction and transformation
feature_extractor = preprocessing.FeatureExtractor()
feature_extractor.fit_transform(...) 

# Model training
model = PCA()
model.fit(...)

# Feature transform after fitting
x_test = feature_extractor.transform(...)
# Model evaluation with labeled data
model.evaluate(...)

# Anomaly prediction
x_test = feature_extractor.transform(...)
model.predict(...) # predict anomalies on given data

For more details, please follow the demo in the docs to get started. Please note that all ML models are not magic, you need to figure out how to tune the parameters in order to make them work on your own data.

Benchmarking results

If you would like to reproduce the following results, please run benchmarks/HDFS_bechmark.py on the full HDFS dataset (HDFS100k is for demo only).

		HDFS
Model	Precision	Recall	F1
LR	0.955	0.911	0.933
Decision Tree	0.998	0.998	0.998
SVM	0.959	0.970	0.965
LOF	0.967	0.561	0.710
One-Class SVM	0.995	0.222	0.363
Isolation Forest	0.830	0.776	0.802
PCA	0.975	0.635	0.769
Invariants Mining	0.888	0.945	0.915
Clustering	1.000	0.720	0.837

Contributors

Shilin He, The Chinese University of Hong Kong
Jieming Zhu, The Chinese University of Hong Kong, currently at Huawei Noah's Ark Lab
Pinjia He, The Chinese University of Hong Kong, currently at ETH Zurich

Feedback

For any questions or feedback, please post to the issue page.

History

May 14, 2016: initial commit
Sep 21, 2017: update code and readme
Mar 21, 2018: rewrite most of the code and add detailed comments
Feb 18, 2019: restructure the repository with hands-on demo

loglizer's People

Contributors

Stargazers

Watchers

Forkers

xdanos petertsehsun sadhopedream jnaulty adripurkayastha ujjwalshukla kianqunki zhongwang24 beckdevil kgyang tinkle1129 dadao999 tuantmb spirit888h sjl421 hfinch1991 sergio6678 afleshel tartaruszen jymcheong chou-chou behzad68 daprueba666 mrinal4242 lulzzz tokiran kalyankumarpichuka uzay00 ozymandiiaz dl5rcw polya20 bisoisk gitrekm yulincoder sreendra vyasbhavyesh burakince markuskont neo4reo akshay772 prhldk xennygrimmato zhangxu0307 hibax zyingfei saviosampaio hiwang123 aporia3517 leslieuc oopsoutofmemory sridharagowda83 mr-brody bladefidz arunsigood cloriszhou yamlin sanjeeku belalmohsen nikhilslounge anil013 unclegen alkapio lcheng61 faisal-w xzhthu2018 te87037 jfelske piotr-kostecki evido3s steven0706 tarunsinghal92 rajanigwal etnarojas letslego elieser1101 theshortj chenkaigithub christianmatei myd1 mohan67nv shankarpentyala07 laochonlam hamzakeurti ayesha049 devops8012 rajatguptarg patrickcnkm ankit-jha podilaaditya kgekanwang uvvu navaneethsen binchow-ai dineshkumares mmejdoubi jock312452 ishan-gupta chillleon mingmingtao databill86

loglizer's Issues

still two issues after update

Hi，
Thank you very much for the update. I tried it the first time after the update.Successfully analyzed the log abnormal:
Train validation:
====== Evaluation summary ======
Precision: 0.966, recall: 0.365, F1-measure: 0.530

Test validation:
====== Evaluation summary ======
Precision: 0.967, recall: 0.561, F1-measure: 0.710

But I still have two questions：
1、How can I get anomaly_label.csv in demo file?
2、How do I get specific exception log information after the exception detection is completed?

best wishes ，thank you！

Please, Can you assist, I am working a research project using log files from AM/3D printing machines

Dear Sir/Ma.

Please, I am currently writing a paper, 'using machine learning techniques for Anomaly Detection in Additive Manufacturing/3D Printing Log Files'. The data was collected via Additive Manufacturing/3D printing machines. And it is totally a log files. I didn't know how to extract useful features from the data before applying machine learning algorithm. I believe your open source code was in Python. Is it compulsory to use log parser to extract the valuable features, or is there any other way to extra useful data from the log file. or can I use it directly. Another question will your source code in python be suitable for this type of my study?

Below is the sample of the log files collected:

Wed 04/25/18 11:44:26 DEBUG NOTICE 0 NextSubState: CHECK_PLATFORM_STATE -> WAIT_CONFIRM_PLATFORM_STATE
Wed 04/25/18 11:44:26 DEBUG NOTICE 0 Setting resolution mode to (PP) 2
Wed 04/25/18 11:44:26 DEBUG NOTICE 0 LoadBuildStyle(): leaving isVerifySupport and isVerifyPart unchanged, both set to FALSE, strStyleName= UHD COLUMN
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: UHD COLUMN
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: style file: ..\data\BuildStyles_HD_3500_Plus.ini
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: params: ..\data\Param_HD_Plus_UHD_TaiPan.ini
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: common: ..\data\Common_TAIPAN.ini
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: calibration: ..\data\Calibration_656dpi.ini
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: machine: ..\data\Machine.ini
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: support type: original
Wed 04/25/18 11:44:26 DEBUG PARAM_CHANGE 0 LoadBuildStyle: support: C:\ProJet\Bin..\data\fill_HiRes_TaiPan.dct
Wed 04/25/18 11:44:26 DSP LOG 0 SetResolutionMode : 2
Wed 04/25/18 11:44:26 DSP LOG 0 Changed passes to 20
Wed 04/25/18 11:44:27 DEBUG NOTICE 0 Read and updated ini parameters. Cure Passes = 10
Wed 04/25/18 11:44:27 DEBUG NOTICE 0 Send Parameters To DSP
Wed 04/25/18 11:44:27 DEBUG NOTICE 0 Setting resolution mode to (PP) 2
Wed 04/25/18 11:44:27 PREP IDLE 0 Max Files=25, Params=9999
Wed 04/25/18 11:44:27 PREP CTL_INITIALIZE 2 Input File: C:\ProJet\Bin..\work\jobs\Job642_2018-4-25_11-36-592\parts.ctl, Support Type: ORIGINAL, Minimum Supports: 0.110000
Wed 04/25/18 11:44:27 DSP LOG 0 SetResolutionMode : 2
Wed 04/25/18 11:44:27 DSP LOG 0 Changed passes to 20
Wed 04/25/18 11:44:34 DEBUG BTN_PRESSED 11 Yes Pressed
Wed 04/25/18 11:44:34 DEBUG NOTICE 0 NextSubState: WAIT_CONFIRM_PLATFORM_STATE -> AFTER_CLOSE_CHAMBER_DOOR_STATE
Wed 04/25/18 11:44:36 DEBUG NOTICE 0 NextSubState: AFTER_CLOSE_CHAMBER_DOOR_STATE -> CHECK_WASTE_STATE
Wed 04/25/18 11:44:40 DEBUG NOTICE 0 NextSubState: CHECK_WASTE_STATE -> WAIT_CLOSE_WASTE_DRAWER_STATE_SHARK_ENTRY_POINT
Wed 04/25/18 11:44:40 DEBUG NOTICE 0 NextSubState: WAIT_CLOSE_WASTE_DRAWER_STATE_SHARK_ENTRY_POINT -> AFTER_CLOSE_WASTE_DRAWER_STATE
Wed 04/25/18 11:44:41 DEBUG NOTICE 0 NextSubState: AFTER_CLOSE_WASTE_DRAWER_STATE -> WAIT_CLOSE_WASTE_DRAWER_STATE

interpretation of Invariants mining result

I am having a problem understanding the meaning of the invariants mining results. For example, in the demo:

Invariant space dimension: 11
Mined 10 invariants: {(1, 13): [1.0, -1.0], (3, 11): [-2.0, 1.0], (0, 12): [-10.0, 1.0], (1, 8): [-1.0, 1.0], (3, 9): [-101.0, 1.0], (1, 6): [1.0, -3.0], (1, 10): [-1.0, 1.0], (3, 7): [1.0, -1.0], (3, 4): [1.0, -1.0], (0, 2): [1.0, -1.0]}

I thought the pair (0, 12): [-10.0, 1.0] means that whenever event 12 happens once event 0 has to happen 10 times. but it doesn't make any sense because the HDFS data does not have any event with ID 0 or 12.

Missing file HDFS.npz

Missing HDFS.npz file

where is rm_repeat_mlabel.txt' for hdfs demo?

Invariants Mining recall and precision

Hi,

I am using Invariants Mining without Label code to train data and have modified the code to get the anomaly.csv from the output we get from trained data.

Then, I tried using the above generated anomaly.csv instead of the anomaly provided in the project, to test data with train-ratio as 0.5. (using Invariants mining with label)

I get the results (number of anomalies in train and test), but it shows recall and precision as 0.00.
Tried modifying HDFS log file but does not help.

Please help, The code is indeed helpful but documentation needed.
Also, can I deploy this algorithm with Apache Spark? Since it will be too slow otherwise.

vectorization

I did log parsing and have HDFS_2.log_structured.csv.
now how can i get rm_repeat_rawTFVector.txt (log sequence file) by vectorization based on block-id?

Missing Log-event mapping - log sequence data

Hi, thank for your project. It really helpful.
I'm trying run loglizer, but have problem with log event mapping (for case demo bgl) and log sequence data (for case demo hdfs). After I run success logparser, output include templates.txt contain all templates and difference text files contain log ID. It's good with your describe in README file. So how can I generate 2 missing files?
Thank you.

To teach students on Unsupervised Machine learning based Log Analysis

Sir
I am Asst. prof, i have chosen Big Data as subject for this semester and the learning methodology i have selected is "learn by doing", hence I kindly request you guide me in demonstrating a project on "Unsupervised Machine learning based Log Analysis" as i am new to this field and also i would like persuasive further in this subject.
thanking you
with warm regards
Hanumantha Rao K R
Assistant Professor
Dept. of Computer Applications
JSS Academy of Technical Education
c 20 / 1, sector 62, NOIDA, 201301, U. P , INDIA

Steps for running model

Hi,

Can you please give the steps that we should follow to run a model? For example: if I want to run and see the anomaly prediction on zookeeper logs, what all steps should I follow (sequence of running programs) with your code and my data (I can add my data inside data folder of your downloaded package).
This will be very helpful for us to ascertain how is this working. Thank you a lot.

How to use loglizer?

Hi,
I have just started researching log anomaly detection recently. I am very interested in your results. I have the following questions, please also guide me.
1.How should I use the results of the logparser operation in loglizer?
'path':'../../Data/SOSP_data/', # directory for input data
'log_seq_file_name':'rm_repeat_rawTFVector.txt', # filename for log sequence data file
'label_file_name':'rm_repeat_mlabel.txt', # filename for label data file
(1)'path' data is log_structured.csv?or log_templates.csv?
(2)Where can I find 'rm_repeat_rawTFVector.txt and 'rm_repeat_mlabel.txt?
2.In demo_bgl, the input parameters are log_file_name and log_event_mapping and in demo_hdfs ,the input parameters are log_seq_file_name and label_file_name. Do you mean to support these two parameter input methods?
Thank you very much, look forward to your reply！

Suggestions for collecting logs

Hi, looking forward to trying your project! Its a great fit because we have millions of lines of apache logs per day (not including our local edge caching and cdn caching), we are an all python shop, and we get hit by malicious bots randomly that do impact our public experience.

Im trying to figure out the collection piece of the equation, whether its for adhoc exploration for a specific range of dates or if its on going collection for feeding loglizer.

Do you have any suggestions?

Thanks
Thatcher

mining_invariants

in mining_invariants.py

a. what is "scale_list" in list para? Why "scale_list": 1,2,3?

def check_invar_validity(para, event_count_matrix, selected_columns):
 
    .....   

1.     min_theta, FLAG_contain_zero = compute_eigenvector(sub_matrix)
2.     abs_min_theta = [np.fabs(it) for it in min_theta]
3. 	 if FLAG_contain_zero:
4. 		return validity, [];
5. 	else:
6.              for i in para['scale_list']:
7. 			min_index = np.argmin(abs_min_theta)
8. 			scale = float(i) / min_theta[min_index]
9. 			scaled_theta = np.array([round(item * scale) for item in min_theta])
10. 			scaled_theta[min_index] = i
11. 			scaled_theta = scaled_theta.T
12. 			if 0 in np.fabs(scaled_theta):
13. 				continue
14. 			dot_submat_theta = np.dot(sub_matrix, scaled_theta)

what is the meaning of vector "min_theta" transformations in rows 7-10?
0 not in "abs_min_theta" consequently 0 not in "scaled_theta", why condition in row 12?

Code of DeepLog

If the code of DeepLog is open？How can I get this code for experiment？Thanks~

about "epsilon" in mining_invariants

in paper mining invariants by Lou et al it is written that "epsilon"~0,5 (sqrt(4)/4) [page 8, before section 6.2], but in

loglizer/demo_hdfs/mining_invariants_hdfs.py

para = {
'path':'../../Data/SOSP_data/',        # directory for input data
'log_seq_file_name':'rm_repeat_rawTFVector.txt', # filename for log sequence data file
'label_file_name':'rm_repeat_mlabel.txt', # filename for label data file
'epsilon':2.0,                          # threshold for the step of estimating invariant space
'threshold':0.98,                       # percentage of vector Xj in matrix satisfies the condition that |Xj*Vi|<epsilon
'scale_list':[1,2,3],					# list used to sacle the theta of float into integer
'stop_invar_num':3                      # stop if the invariant length is larger than stop_invar_num. None if not given
}

'epsilon'=2.0. Why?

这个报错的log文件在哪里能搞得到

Traceback (most recent call last):
File "F:/aiops/loglizer-master/demo_bgl/classifiers_bgl.py", line 26, in
raw_data, event_mapping_data = data_loader.bgl_data_loader(para)
File "F:\aiops\loglizer-master\demo_bgl\utils\data_loader.py", line 55, in bgl_data_loader
data_df = pd.read_csv(file_path, delimiter=r'\s+', header=None, names = ['label','time'], usecols = para['select_column']) #, parse_dates = [1], date_parser=dateparse)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 405, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 764, in init
self._make_engine(self.engine)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 985, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1605, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.cinit (pandas_libs\parsers.c:4209)
File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas_libs\parsers.c:8873)
FileNotFoundError: File b'../../Data/BGL_data/BGL_MERGED.log' does not exist

BGL_MERGED.log不存在

label file

hi i did not find rm_repeat_mlabel.txt file ,could you please where is it ?
what is the structure of this file ? is it like this?
BlockId | Label

Predictive maintenance

Hello,
First, thank you for your implementations, it helped me a lot!
I want to build a predictive maintenance pipeline based on logs for research purposes.
Therefore, for predictive maintenance, I'm implementing DeepLog (that's the only solution I found, I don't think there are others, let me know if you have other solutions) but it supposes that the training set in input contains logs from normal execution.
In order to apply my pipeline for any logs, I don't want to use labelled datasets. In addition, logs can be huge and it's in many cases complicated to say whether it's an anomaly or not if I'm not an expert (Spark logs for example).
The LSTM (DeepLog) needs normal execution logs (otherwise, it would consider anomalies as normal behavior).
I thought of, as a preprocessing step, using one of the unsupervised learning methods (Log Clustering, PCA or Invariant Mining for example) to keep normal execution logs for my training set of DeepLog.
I would run this whole pipeline on datasets from loghub which are not labelled.
What do you think about this approach? Is there anything better that can be done? Any advice?
Thank you for your answer!

invariants mining without label file

my problem is when I try to re-implement the InvariantsMiner_demo.py to work without the label file ( I don't care about measuring the correctness at this point) it produces a different set of invariants compared to the original InvariantsMiner_demo.py.

With my implementation I get the following results:
Invariant space dimension: 17
Mined 17 invariants: {(19,): [1], (0, 3): [-1.0, 1.0], (0, 4): [1.0, -1.0], (0, 5): [-17.0, 1.0], (0, 6): [-17.0, 1.0], (0, 10): [-17.0, 1.0], (0, 15): [-35.0, 1.0], (0, 16): [-9.0, 1.0], (0, 17): [1.0, -2.0], (1, 9): [1.0, -3.0], (1, 11): [-1.0, 1.0], (1, 14): [-1.0, 1.0], (1, 18): [1.0, -1.0], (5, 8): [1.0, -1.0], (5, 13): [-101.0, 1.0], (6, 8): [1.0, -1.0], (6, 13): [-101.0, 1.0]}

With the original InvariantsMiner_demo.py code I get the following results:

Invariant space dimension: 12
Mined 11 invariants: {(14,): [1], (0, 2): [1.0, -1.0], (0, 12): [-15.0, 1.0], (1, 6): [1.0, -3.0], (1, 8): [-1.0, 1.0], (1, 10): [-1.0, 1.0], (1, 13): [1.0, -1.0], (3, 4): [1.0, -1.0], (3, 7): [1.0, -1.0], (3, 9): [-101.0, 1.0], (3, 11): [-2.0, 1.0]}

can you please point out what I am doing wrong?

training a new log parser

Hello,

is it possible to train the system with my own device logs?

Thanks

Run the project Loglizer

I'm new to python and machine learning. I want to know how to run this project

https://github.com/logpai/loglizer
in this link. Plus can I get any full documentation of the entire project. Any help would be appreciable.

Loglizer metrics with PCA

Hello,

Using loglizer/data/HDFS/HDFS_100k.log_structured.csv file as the structured data set and the anomaly_label.csv from the same path, i run PCA_demo.py with the same parameters as yours for the dataloader.py (for the train/test split).

The results i got for this case are here:

We see that recall metrics don't match. Is there something i am missing?

Thank you in advance.

Questions regarding dataset and implementation details.

With great interest I've read (nearly) all of the papers released by this research group. I've found the papers a great resource, as the combination of papers give a broad view on the area of automated log parsing.

I've been working on an implementation for automated log parsing. Thus far I've adopted the Drain algorithm to generate log templates, and trained an LSTM to detect anomalies within the sequences of log keys. Seems to work great!

I still have some questions regarding the project, I hope you could answer these for me:

The datasets (found here; training, test - normal, test - abnormal) contain log keys (or template ID's) of parsed HDFS logs, is that correct?
Is there a specific reason that the log keys within these datasets are separated by newlines? I.E, does every line describe the logkeys from a specific time-bucket?
In my implementation I haven't bucketed any of the data, as I'm using a sliding window to generate sequences of log keys out of the entire dataset of log keys.
Could you elaborate on how the workflow diagram is constructed? Did you consider the raw logs, or the parsed logkeys+parameters to construct the diagram?
Could you elaborate on how the parameter time series anomaly detection is created? The way I interpret it, for every unique log key a specific LSTM is trained, is that correct?
What kind of representation did you utilize to build such model? Because several parameters, such as filenames, will have a high variety (most of the names are unique). If these are converted to one-hot encodings, we will end up with a super high dimensional (sparse) vector representation, making it quite computational exhaustive..

I hope you can provide me with some answers to my questions. A big thanks and thumbs up for the great research you guys are doing!

A question about 'Feature Extraction' in PCA.

Hi~ After reading your paper about log parsers and log anomolies detection, it is so helpful for my study especially your toolkit. Thanks a lot!
Here I have a question about PCA: I have read the paper of PCA method, and I know what's meaning of the files 'rm_repeat_rawTFVector.txt' and 'rm_repeat_mlabel.txt'. But it seems that there is not any method generating those files. They are belong to the phase of 'Feature Extraction', but I do not find any method about generating them. I also check the toolkit of log parsers, and I also cannot find such method. Hope that you can answer my question~ Thanks very much!

ValueError: shapes (16,16) and (14,) not aligned: 16 (dim 1) != 14 (dim 0)

I use PCA_demo.py and change train_ratio to 0.8 to train model.

(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(struct_log,
                                                            label_file=label_file,
                                                            window='session', 
                                                            train_ratio=0.8,
                                                            split_type='uniform')

After finishing the training, I dump the model and want to use it to predict.Because for prediction, I lowered the train_ratio to 0.3 to generate the test data and then encountered ValueError: shapes.
If all follow-up data is to be used for prediction, how to generated data in model.predict format?

(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(struct_log,
                                                            label_file=label_file,
                                                            window='session', 
                                                            train_ratio=0.3,
                                                            split_type='uniform')
feature_extractor = preprocessing.FeatureExtractor()
x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf', 
                                          normalization='zero-mean')
x_test = feature_extractor.transform(x_test)

with open('pca.pickle', 'rb') as f:
	model = pickle.load(f)
model.predict(x_test[0:100])

pwrai@ab434bffe3be:~/loglizer/loglizer$ python ../demo/PCA_demo3.py
====== Input data summary ======
Total: 7940 instances, 313 anomaly, 7627 normal
Train: 2381 instances, 93 anomaly, 2288 normal
Test: 5559 instances, 220 anomaly, 5339 normal

====== Transformed train data summary ======
Train data shape: 2381-by-14

====== Transformed test data summary ======
Test data shape: 5559-by-14

Traceback (most recent call last):
File "../demo/PCA_demo3.py", line 33, in
print(model.predict(x_test[0:100]))
File "../loglizer/models/PCA.py", line 93, in predict
y_a = np.dot(self.proj_C, X[i, :])
ValueError: shapes (16,16) and (14,) not aligned: 16 (dim 1) != 14 (dim 0)
pwrai@ab434bffe3be:~/loglizer/loglizer$

Invariant Mining with Drain

Hello Colleagues,

Thank you for your work, I really appreciate it.

At the moment we are in the stage of evaluation of different approaches to analyze the logs (SAP HANA database) and search for anomalies in it. After reading your paper I would like to start deploying two of your approaches:

Invariant Mining
Decision Tree

Before that however, I would like to confirm that my thinking is correct:

The input to the algorithms should be the parsed log events. You referred in your paper to the logparser as a possible way to go. Looking at the accuracy of different logparser algorithms, I would pick the Drain to go first.
--> Would the output of the logparser-Drain be a proper input to the loglizer algorithms (Invariant Mining / Decision Tree)?
Or are there any intermediate steps needed?

Kind Regards,
Kamil Damuc

如何查看异常数据呢

data Loader

what is the structure of log sequence data file for PCA_hdfs.py?

PCA components

Hi!

Is there a way to get the components from the trained PCA model? (like in sklearn.decomposition.PCA.components_)

Thanks

Python version

Hi thanks for your work!! Which python version do you use for this implementation?

event count matrix for bgl data

Hi thanks for open source your code. When I try to produce the event count matrix for bgl data, I found the method (bgl_preprocess_data) in your loglizer/loglizer/dataloader.py file and I got similar F-measure using this method. However, I found a problem in lines 232-233 (if label_data[k]: label = 1). If I'm right, these two lines are to label the windows. But this two lines will label all the windows to 1 (anomaly) except those windows containing no events. From the description of bgl data, I think "-" should indicate normal logs.

Not able to reproduce accuracy results

First of all thank you for the amazing work you have done w.r.t loglizer project.
I tried to run the code on some of the example datasets (BGL) shared with me via log hub.

But the accuracy number's that I am getting does not match with the figures published in the paper.
Please let me know if I am missing something. Thanks for your time.

Below are the results of my test.

python3 log_clustering_bgl.py
The raw data shape is (4747963, 2) and label shape is (4747963, 1)
The number of anomaly logs is 348460, but it requires further processing
Loading start_end_index_list from file
there are 5151 instances (sliding windows) in this dataset
There are 385 log events
Among all instances, 1400 are anomalies
weighted data size is (5151, 385)
seperating the initial training Data...
knowledge base size: (3090, 385), online learning: (1030, 385), testing size: (1031, 385)
(907, 385) (2183, 385)
There are 46 clusters in this initial clustering
failure data clustering finished...
There are 469 clusters in this initial clustering
success data clustering finished...
Start online learning...
==================== RESULT ====================
Precision: 0.159817, Recall: 0.297872, F1_score: 0.208024

python3 PCA_bgl.py
The raw data shape is (4747963, 2) and label shape is (4747963, 1)
The number of anomaly logs is 348460, but it requires further processing
Loading start_end_index_list from file
there are 5151 instances (sliding windows) in this dataset
There are 385 log events
Among all instances, 1400 are anomalies
principal components number is 7
there are 1400 anomalies
the threshold is 409048.391027
==================== RESULT ====================
Precision: 0.503597, Recall: 0.300000, F1_score: 0.376007

python3 classifiers_bgl.py
The raw data shape is (4747963, 2) and label shape is (4747963, 1)
The number of anomaly logs is 348460, but it requires further processing
Loading start_end_index_list from file
there are 5151 instances (sliding windows) in this dataset
There are 385 log events
Among all instances, 1400 are anomalies
Training size is 4120 while testing size is 1031
Train a Decision Tree Model
==================== RESULT ====================
Precision: 1.000000, Recall: 0.629787, F1_score: 0.772846

Invariant Mining
==================== RESULT ====================
Precision: 0.715909, Recall: 0.405000, F1_score: 0.517336

How to generate the graph ?

Thanks for the details, it helps a lot.
But if I want to generate a graph to illustrate the algorithm how can I do it with help of EventID.
To generate the graph we need to have n x 2 (where n is row and 2 is column), with the logic we will have n x m and with that I need to know how can we generate the graph.

Correct me If I'm wrong.

Fit Drain output to Invariant Mining input

Hello,

I have issues to understand how to fit the output of the Drain parser to the Mining Invariants input.
The Drain outputs two files, the structured.csv and templates.csv. The mining_invariants_bgl.py, which I took to rework it for my case, is calling the data_loader.py and feeds it with the logfile itself and the log_event_mapping (for the bgl case).
Now, my naive thinking would be that the log_event_mapping would be the templates.csv, where I have two columns: alphanumeric EventId and the EventTemplate. But the loader does not accept this that way.
Also, the raw log gets converted to the 2-column matrix consisting of the label, which is always '1' and the time difference in seconds. Somehow I am not able to make sense out of that and cannot see how the raw_data and event_mapping_data should fit together.
Also I am not able to find the corresponding input files for the bgl case in the directories that I could debug the working example of yours.
I read your paper and understand the idea, however i would like to re-use as much of your code, based on the bgl case.

So, could you please explain, how the required input for the bgl data_loader (raw_data, and especially the log_event_mapping) would relate to the Drain output (structured.csv and templates.csv)?
Is the log_event_mapping expected to be the patterns (templates.csv) but just in other form/structure or should the log_event_mapping be ALL the logs with the assigned event IDs upfront, converted to the numeric form? (like the structured.csv, after you would keep the two columns only: EventId (converted from alphanumeric --> int) and Content ?

Kind Regards,
Kamil

SVM: evaluate on same train set

Hello,

I am using the HDFS strucuted file you have : data/HDFS/HDFS_100k.log_structured.csv then I split train test and train with SVM:

(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(struct_log,
label_file=label_file,
window='session',
train_ratio=0.5,
split_type='uniform')

feature_extractor = preprocessing.FeatureExtractor()
x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
model = SVM()
model.fit(x_train,y_train)

Now to check everything is right, I test in the same xtrain dataset:
precision, recall, f1 = model.evaluate(x_train, y_train)

However the metrics are:
'Precision: 1.000, recall: 0.365, F1-measure: 0.535' I would expect almost perfect metrics since I am predicting on the trained set. Do you know what the issue is?

Invariants mining broken invariants

I am using the invariants mining tool for anomaly detection, right now I want to print out the list of broken invariants if possible to see if the results make sense or not. or at least print the log sequences that have anomalies in them. is this do-able right now? and if yes, what should I do?
thanks

Find the position of the anomalies in the log file

Thank you first for this toolkit,
I have a question about the position of the anomalies in the starting log file.
Is there a solution to detect and display them after using PCA or another algorithm ?
Thank you in advance.

error in implementation of loglizer.dataloader.load_HDFS function

in this function, you are returning in line 115 x_data (whole data set).
return (x_data, None), (x_test, None)

Instead, you are supposed to return
return (x_train, None), (x_test, None).

Can you confirm please if this is actually an issue?

Thanks a lot for your work.

Nex Steps

i have some time working with your solution, and after some struggle i was able to use your code. As i read the issues i saw that many people had the same problem.

In that order i make a wrapper that allows me to use the framework with one configurable class as a more friendly API.

I would like to know if you would be interested in a pull request??, so i can clean my code and contribute to your project.

In other hand, i would like to know about the future of the project, would you like some help to make this tool more user friendly?

Also i think there is a need to add some unitesting and other automation if you woul like to make this more reliable!

Using loglizer to analyze Jenkins Logs

Hi,
Could you please give me some pointers on how to use loglizer perform anomaly detection on jenkins console logs.

I want to analyze the build console logs and classify them into say infrastructure vs non-infrastructure failures.

Thanks

About the format of the BGL_MERGED.log file

hi，
Is the format of the BGL_MERGED.log file a raw log file or a parsed log file?

How to load log data in sliding window

Now we need to process a batch of logs.We want to load this data in sliding window.
But in your file named dataloder.py, the function (load_BGL) is TODO.
We want to know if this function has been completed.And if completed, can you provide a source code to us?
We need your help.thanks.

Where does the structured.csv comes from

Hello,

I would like to know which raw file produces the structured csv file in data/HDFS and which parser algo has been used if you don't mind.
Thanks

PCA.py

Hi,
Thanks for your project. It's really helpful.
But I feel confused about 'c_alpha' in PCA.py. How to determine the value? I find it was 3.2905 before. Currently it is 8.1.
Thank you.

More info needed

Can you update readme to indicate which Python file is used for what purpose. There are many Python files and it is not clear what is the purpose of each of these

Evaluation phase of Mining Invariants is completely wrong.

line 265 of models/mining_invariants.py:

for key in invar_dict:
	valid_col_list.append(list(key))
	valid_invar_list.append(list(invar_dict[key]))

Please notice that set is an unordered data structure. You use frozenset to index the dict, but transforming the frozenset object to list will bring unexpected mismatch between column number and theta value.

Due to the implementation error, I believe the algorithm performance reported in your paper is not convincing.

benchmark dataset HDFS.npz???

..\loglizer\data\HDFS could not find the "HDFS.npz" in the data path to run the benchmark dataset? where can we get the file?

Thanks in advance...

-Ranjith G

loading

hi, when I run PCA_hdfs.py I got this error for this line of data_loader.py:(can not load label file)
line 32
label_df = pd.read_csv(label_path, delimiter=r'\s+', header=None, usecols = [0], dtype =int) # usecols must be a list

error
File "pandas/_libs/parsers.pyx", line 1162, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for long() with base 10: 'BlockId,Label'

about "epsilon" in mining_invariants #2

"epsilon" this is a hyperparameter or calculated by some formula? (in paper mining invariants by Lou et al for "epsilon" missing formula)

demo_bgl failed to run

Hi,
I used the latest loglizer to test demo_bgl in the benchmark folder and found the following problem.
I use BGL_templates.csv in the latest logparser as log_event_mapping,But the results are as follows:
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1162, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'EventId,EventTemplate'
Thank you!

logpai / loglizer Goto Github PK

loglizer's Introduction

loglizer

Framework

Models

Log data

Install

API usage

Benchmarking results

Contributors

Feedback

History

loglizer's People

Contributors

Stargazers

Watchers

Forkers

loglizer's Issues

Recommend Projects

Recommend Topics

Recommend Org