zentures / sequence Goto Github PK
View Code? Open in Web Editor NEW(Unmaintained) High performance sequential log analyzer and parser
Home Page: http://sequencer.io
(Unmaintained) High performance sequential log analyzer and parser
Home Page: http://sequencer.io
Steps to Reproduce:
echo "t=|" > input.txt
go run sequence.go analyze --input input.txt
Expected Results:
2016/05/12 13:13:56 Analyzed 1 messages, found 1 unique patterns, 1 are new.
(No error and message is analyzed.)
Actual Results:
2016/05/12 13:13:56 Error analyzing: t=|
2016/05/12 13:13:56 Analyzed 1 messages, found 0 unique patterns, 0 are new.
Comments:
I think something is going wrong with the heuristics for key=value pairs. I found this bug while processing an actual log file. One of the log events in question:
81.181.146.13 - - [15/Mar/2005:05:06:49 -0500] "GET //cgi-bin/awstats/awstats.pl?configdir=|%20id%20| HTTP/1.1" 404 1050 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Hi,
i tried to add a rule for this log message from sshd.
msg:
Feb 06 13:37:00 box sshd[4388]: Accepted publickey for cryptix from dead:beef:1234:5678:223:32ff:feb1:2e50 port 58251 ssh2: RSA de:ad:be:ef:74:a6:bb:45:45:52:71:de:b2:12:34:56
rule:
%msgtime% %apphost% %appname% [ %sessionid% ] : Accepted publickey for %dstuser% from %srcipv6% port %integer% ssh2: RSA %string%
but i get Error (sequence: no pattern matched for this message)
.
I can't match the address nor the fingerprint because they are tokanized too much.
Here is what sequence scan -m
returns for the message:
# 0: { Field="%funknown%", Type="%time%", Value="Feb 06 16:00:44" }
# 1: { Field="%funknown%", Type="%literal%", Value="higgs" }
# 2: { Field="%funknown%", Type="%literal%", Value="sshd" }
# 3: { Field="%funknown%", Type="%literal%", Value="[" }
# 4: { Field="%funknown%", Type="%integer%", Value="4388" }
# 5: { Field="%funknown%", Type="%literal%", Value="]" }
# 6: { Field="%funknown%", Type="%literal%", Value=":" }
# 7: { Field="%funknown%", Type="%literal%", Value="Accepted" }
# 8: { Field="%funknown%", Type="%literal%", Value="publickey" }
# 9: { Field="%funknown%", Type="%literal%", Value="for" }
# 10: { Field="%funknown%", Type="%literal%", Value="cryptix" }
# 11: { Field="%funknown%", Type="%literal%", Value="from" }
# 12: { Field="%funknown%", Type="%literal%", Value="dead" }
# 13: { Field="%funknown%", Type="%literal%", Value=":" }
# 14: { Field="%funknown%", Type="%literal%", Value="beef" }
# 15: { Field="%funknown%", Type="%literal%", Value=":" }
# 16: { Field="%funknown%", Type="%integer%", Value="1234" }
# 17: { Field="%funknown%", Type="%literal%", Value=":" }
# 18: { Field="%funknown%", Type="%integer%", Value="5678" }
# 19: { Field="%funknown%", Type="%literal%", Value=":" }
# 20: { Field="%funknown%", Type="%integer%", Value="223" }
# 21: { Field="%funknown%", Type="%literal%", Value=":" }
# 22: { Field="%funknown%", Type="%literal%", Value="32ff" }
# 23: { Field="%funknown%", Type="%literal%", Value=":" }
# 24: { Field="%funknown%", Type="%literal%", Value="feb1" }
# 25: { Field="%funknown%", Type="%literal%", Value=":" }
# 26: { Field="%funknown%", Type="%literal%", Value="2e50" }
# 27: { Field="%funknown%", Type="%literal%", Value="port" }
# 28: { Field="%funknown%", Type="%integer%", Value="58251" }
# 29: { Field="%funknown%", Type="%literal%", Value="ssh2" }
# 30: { Field="%funknown%", Type="%literal%", Value=":" }
# 31: { Field="%funknown%", Type="%literal%", Value="RSA" }
# 32: { Field="%funknown%", Type="%mac%", Value="de:ad:be:ef:74:a6" }
# 33: { Field="%funknown%", Type="%literal%", Value=":" }
# 34: { Field="%funknown%", Type="%mac%", Value="bb:45:45:52:71:de" }
# 35: { Field="%funknown%", Type="%literal%", Value=":" }
# 36: { Field="%funknown%", Type="%literal%", Value="b2" }
# 37: { Field="%funknown%", Type="%literal%", Value=":" }
# 38: { Field="%funknown%", Type="%integer%", Value="12" }
# 39: { Field="%funknown%", Type="%literal%", Value=":" }
# 40: { Field="%funknown%", Type="%integer%", Value="34" }
# 41: { Field="%funknown%", Type="%literal%", Value=":" }
# 42: { Field="%funknown%", Type="%integer%", Value="56" }
I would like to see this:
# 0: { Field="%funknown%", Type="%time%", Value="Feb 06 16:00:44" }
# 1: { Field="%funknown%", Type="%literal%", Value="higgs" }
# 2: { Field="%funknown%", Type="%literal%", Value="sshd" }
# 3: { Field="%funknown%", Type="%literal%", Value="[" }
# 4: { Field="%funknown%", Type="%integer%", Value="4388" }
# 5: { Field="%funknown%", Type="%literal%", Value="]" }
# 6: { Field="%funknown%", Type="%literal%", Value=":" }
# 7: { Field="%funknown%", Type="%literal%", Value="Accepted" }
# 8: { Field="%funknown%", Type="%literal%", Value="publickey" }
# 9: { Field="%funknown%", Type="%literal%", Value="for" }
# 10: { Field="%funknown%", Type="%literal%", Value="cryptix" }
# 11: { Field="%funknown%", Type="%literal%", Value="from" }
# 12: { Field="%funknown%", Type="%ipv6%", Value="2a02:8108:2140:6b64:223:32ff:feb1:2e50" }
# 13: { Field="%funknown%", Type="%literal%", Value="port" }
# 14: { Field="%funknown%", Type="%integer%", Value="58251" }
# 15: { Field="%funknown%", Type="%literal%", Value="ssh2" }
# 16: { Field="%funknown%", Type="%literal%", Value=":" }
# 17: { Field="%funknown%", Type="%literal%", Value="RSA" }
# 18: { Field="%funknown%", Type="%fingerprint%", Value="d1:93:fd:09:74:a6:bb:45:45:52:71:de:b2:38:9b:54" }
kind regards,
Steps to Reproduce:
echo "get http://example.com" > input.txt
go run sequence.go analyze --input input.txt --output patterns.txt
go run sequence.go parse --input input.txt --patterns patterns.txt
Expected Results:
Message is parsed and there is no error.
Actual Results:
2016/05/12 12:18:40 Error (sequence: no pattern matched for this message) parsing: get http://example.com
2016/05/12 12:18:40 Parsed 1 messages in 0.00 secs, ~ 999.90 msgs/sec
2016/05/12 12:18:40 Quiting...
Comments:
There is no error and the results are correct if the URI is removed. I think the URI fails to match because the scanner says it is type uri
but the patterns file is looking for %object%
.
I was unable to confirm this because %uri%
is not accepted in a patterns file (Invalid tag token "%uri%": unknown type
). And I have not figured out how to prevent the URI from being tagged as an object.
Also note this bug causes the analyze
command to report the incorrect number of new patterns.
What are your thoughts on parsing iostat
output?
For example:
03/18/2015 03:17:55 AM
avg-cpu: %user %nice %system %iowait %steal %idle
0.44 0.00 0.38 0.31 0.00 98.87
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdb 0.00 0.00 64.00 0.00 1584.00 0.00 49.50 0.04 0.69 0.69 0.00 0.25 1.60
xvda 0.00 0.00 0.00 3.00 0.00 12.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdf 0.00 110.00 0.00 71.00 0.00 1392.00 39.21 0.03 0.39 0.00 0.39 0.39 2.80
xvde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 64.00 0.00 1584.00 0.00 49.50 0.04 0.69 0.69 0.00 0.25 1.60
03/18/2015 03:17:56 AM
avg-cpu: %user %nice %system %iowait %steal %idle
0.19 0.00 0.13 0.19 0.00 99.50
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdb 0.00 0.00 35.00 0.00 708.00 0.00 40.46 0.06 1.71 1.71 0.00 0.46 1.60
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdf 0.00 140.00 0.00 87.00 0.00 1472.00 33.84 0.01 0.14 0.00 0.14 0.14 1.20
xvde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 35.00 0.00 708.00 0.00 40.46 0.06 1.71 1.71 0.00 0.46 1.60
03/18/2015 03:17:57 AM
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 0.06 0.13 0.00 99.69
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdb 0.00 0.00 18.00 0.00 456.00 0.00 50.67 0.00 0.00 0.00 0.00 0.00 0.00
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdf 0.00 125.00 0.00 78.00 0.00 1332.00 34.15 0.04 0.46 0.00 0.46 0.36 2.80
xvde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 18.00 0.00 456.00 0.00 50.67 0.00 0.00 0.00 0.00 0.00 0.00
In this case, it's a bit funny, because iostat splits the output over multiple lines.
So you need to extract the timestamp on the first line, then you need to combine that with the lines below until the next timestamp, to get one "entry" - i.e. CPU/IO stats at that one second instance, in this case.
CPU info itself is split over two lines, and then IO per device is put into a table.
Can you see sequence being used to parse something like this, or is it a bit out of scope?
Consider the exim log message:
2015-02-11 11:04:40 H=(amoricanexpress.com) [64.20.195.132]:10246 F=<[email protected]> rejected RCPT <[email protected]>: Sender verify failed
I might make the pattern:
%msgtime% H=( %srchost% ) [ %srcipv4% ] : %srcport% F=< %srcemail% > rejected RCPT < %dstemail% >: Sender verify failed
But, I cannot find a way to turn the Sender verify failed
bit into a single field, because %string%
appears to break on whitespace.
Any ideas?
It's great to see the analyzer finally released with all the other bits, by the way. This project is amazing.
Have never built any go code and have not yet worked out how to build.
go build.
go run sequnce.go don't produce anything useful.
Please consider including a two liner on how to run/build, if possible.
Hi,
I'd like to discuss the possibility to write some kind of patterndb integration. I was thinking about a program that would generate the syslog-ng db from a sequence analyzer output.
What are your thoughts/ideas/comments on that?
Since a path is a frequent object in a log, it would be very useful if It had a parsing category like "timeFormats" so we would be able to parse for different path syntax (linux VS windows, file/folder, %appdata%, ../../path, ./path, etc.)
Currently:
C:\test\test\test.cxx
Would be analyse as:
c : %string%
I have a number of log sources that have characters that force the analyzer to stop before completion with an "unknown token encountered" error. Is there a way to run the analyzer in a "best effort" mode so that if the analyzer encounters a line with characters it is unable to tokenize it skips that line?
If not, is there documentation on how one might deal with this type of error? I have no problem pre-processing the logs, I just am unable to tell from the source code alone which characters result in the TokenUnknown case.
(First, thanks for this tool - it's potentially really useful)
When analysing a log file, it would be very useful to have
not only the patterns and the example line, but also the
number of lines in the log file that matched each pattern,
to get an idea for the prevalence of each pattern in the file.
>> go test
# github.com/strace/sequence
./analyzer_test.go:188:59: Sprintf format %s has arg len(atree.levels[l]) of wrong type int
FAIL github.com/strace/sequence [build failed]
Hi,
do you have an idea why these rules don't match these messages?
msg:
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max connection rate 1/60s for (smtp:5.5.5.5) at Jan 31 21:39:37
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max connection count 1 for (smtp:5.5.5.5) at Jan 31 21:39:37
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max cache size 1 at Jan 31 21:39:37
rules:
%msgtime% %apphost% %appname%[%integer%]: statistics: max connection rate %string% for (smtp:%appipv4%) at %time%
%msgtime% %apphost% %appname%[%integer%]: statistics: max connection count %integer% for (smtp:%appipv4%) at %time%
%msgtime% %apphost% %appname%[%integer%]: statistics: max cache size %integer% at %time%
the rules work perfectly, except for the %time%
at the end.
Steps to Reproduce:
echo "get //example.com" > input.txt
go run sequence.go scan --input input.txt
Expected Results:
# 0: { Tag="funknown", Type="uri", Value="//example.com", ... }
Actual Results:
# 0: { Tag="funknown", Type="literal", Value="//example.com", ... }
Comments:
I found this bug processing an actual log file. One of the log events in question:
81.181.146.13 - - [15/Mar/2005:05:06:49 -0500] "GET //cgi-bin/awstats/awstats.pl?configdir=|%20id%20| HTTP/1.1" 404 1050 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
A related question: what is the best way to handle relative URI's? Sequence's heuristic algorithm for processing URI's breaks down on these...
Hi,
i can't match the port at the end of this message.
msg:
Feb 06 15:56:09 higgs sshd[902]: Server listening on 0.0.0.0 port 22.
rule:
%msgtime% %apphost% %appname% [ %sessionid% ] : Server listening on %srcipv4% port %integer% .
It's a minor issue in this simple case but it's a bit confusing while writing rules.
Hi,
I have been working with sequence for the last two months extending it to output its patterns in syslog-ng patterndb and grok for Logstash formats. I have had to make a few changes to sequence code, largely around remembering where the spaces are, adding a database so we can decide to print the patterns on demand, rather than after each analysis, among other things.
It is in a company repo for now, but the goal is to make it available to the open source community.
I would love to discuss this with you.
https://www.linkedin.com/in/louise-harding-3b964551/
Regards
Louise
devid=0 date="2013/05/21 09:53:17" dname=themis logtype=9 pri=6 ver=0.3.0 mod=webui from=10.1.5.200 agent="Mozilla/5.0 " admin=administrator act=登录 result=0 msg="成功" dsp_msg="administrator 登录" fwlog=0
i was hoping to run few quick tests, however i am unable to find the data folder containing the log messages. I checked all your github libs. github.com/strace/sequence, zentures/sequence and library download
When analyzing this part of the log,
GET%3Cbody%3E%3CSCRIPT
the analyzer identifies "% 3Cbody%" as a tag, although it is only part of the GET request
Steps to reproduce:
go run sequence.go analyze -i input.txt
Expected output:
%action% %object%
#1 log messages matched
# get http://example.com
2016/05/12 11:27:43 Analyzed 1 messages, found 1 unique patterns, 1 are new.
Actual output:
2016/05/12 11:27:43 Analyzed 1 messages, found 1 unique patterns, 1 are new.
Hi,
i'm trying to write rules for postfix, most of them contain a hexadecimal string, like a message ID.
Feb 4 02:01:15 mail postfix/oqmgr[86819]: BF468251C: removed
Feb 4 02:01:17 mail postfix/oqmgr[86819]: 746702526: removed
Feb 4 02:33:33 mail postfix/oqmgr[86819]: 24CB02536: removed
Feb 4 04:01:55 mail postfix/oqmgr[86819]: EC3DE2562: removed
The probleme here for me is that, in some cases these IDs are all digits and than sequence
want's it to be an %integer%
instead, forcing me to duplicate my rules.
In another instances of this, when it logs delays like this:
delay=0.4
delay=1
the numbers want to be %float%
and %integer%
.
With each such case the number of rules explodes. I'm asking if it wouldn't make more sense to let later processing deal with the validity of single fields. Maybe collapse float and int into a single number type? Or have maybe have another %literal%
that ignores the type.
kind regards,
cryptix
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.