zentures / sequence Goto Github PK

View Code? Open in Web Editor NEW

519.0 33.0 72.0 168 KB

(Unmaintained) High performance sequential log analyzer and parser

Home Page: http://sequencer.io

Go 100.00%

sequence's Issues

'|' (pipe character) causes error during analyze

Steps to Reproduce:

echo "t=|" > input.txt
go run sequence.go analyze --input input.txt

Expected Results:

2016/05/12 13:13:56 Analyzed 1 messages, found 1 unique patterns, 1 are new.

(No error and message is analyzed.)

Actual Results:

2016/05/12 13:13:56 Error analyzing: t=|
2016/05/12 13:13:56 Analyzed 1 messages, found 0 unique patterns, 0 are new.

Comments:
I think something is going wrong with the heuristics for key=value pairs. I found this bug while processing an actual log file. One of the log events in question:

81.181.146.13 - - [15/Mar/2005:05:06:49 -0500] "GET //cgi-bin/awstats/awstats.pl?configdir=|%20id%20| HTTP/1.1" 404 1050 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

ipv6 address not tokenized properly

Hi,

i tried to add a rule for this log message from sshd.

msg:
Feb 06 13:37:00 box sshd[4388]: Accepted publickey for cryptix from dead:beef:1234:5678:223:32ff:feb1:2e50 port 58251 ssh2: RSA de:ad:be:ef:74:a6:bb:45:45:52:71:de:b2:12:34:56

rule:
%msgtime% %apphost% %appname% [ %sessionid% ] : Accepted publickey for %dstuser% from %srcipv6% port %integer% ssh2: RSA %string%

but i get Error (sequence: no pattern matched for this message).

I can't match the address nor the fingerprint because they are tokanized too much.

Here is what sequence scan -m returns for the message:

#   0: { Field="%funknown%", Type="%time%", Value="Feb 06 16:00:44" }
#   1: { Field="%funknown%", Type="%literal%", Value="higgs" }
#   2: { Field="%funknown%", Type="%literal%", Value="sshd" }
#   3: { Field="%funknown%", Type="%literal%", Value="[" }
#   4: { Field="%funknown%", Type="%integer%", Value="4388" }
#   5: { Field="%funknown%", Type="%literal%", Value="]" }
#   6: { Field="%funknown%", Type="%literal%", Value=":" }
#   7: { Field="%funknown%", Type="%literal%", Value="Accepted" }
#   8: { Field="%funknown%", Type="%literal%", Value="publickey" }
#   9: { Field="%funknown%", Type="%literal%", Value="for" }
#  10: { Field="%funknown%", Type="%literal%", Value="cryptix" }
#  11: { Field="%funknown%", Type="%literal%", Value="from" }
#  12: { Field="%funknown%", Type="%literal%", Value="dead" }
#  13: { Field="%funknown%", Type="%literal%", Value=":" }
#  14: { Field="%funknown%", Type="%literal%", Value="beef" }
#  15: { Field="%funknown%", Type="%literal%", Value=":" }
#  16: { Field="%funknown%", Type="%integer%", Value="1234" }
#  17: { Field="%funknown%", Type="%literal%", Value=":" }
#  18: { Field="%funknown%", Type="%integer%", Value="5678" }
#  19: { Field="%funknown%", Type="%literal%", Value=":" }
#  20: { Field="%funknown%", Type="%integer%", Value="223" }
#  21: { Field="%funknown%", Type="%literal%", Value=":" }
#  22: { Field="%funknown%", Type="%literal%", Value="32ff" }
#  23: { Field="%funknown%", Type="%literal%", Value=":" }
#  24: { Field="%funknown%", Type="%literal%", Value="feb1" }
#  25: { Field="%funknown%", Type="%literal%", Value=":" }
#  26: { Field="%funknown%", Type="%literal%", Value="2e50" }
#  27: { Field="%funknown%", Type="%literal%", Value="port" }
#  28: { Field="%funknown%", Type="%integer%", Value="58251" }
#  29: { Field="%funknown%", Type="%literal%", Value="ssh2" }
#  30: { Field="%funknown%", Type="%literal%", Value=":" }
#  31: { Field="%funknown%", Type="%literal%", Value="RSA" }
#  32: { Field="%funknown%", Type="%mac%", Value="de:ad:be:ef:74:a6" }
#  33: { Field="%funknown%", Type="%literal%", Value=":" }
#  34: { Field="%funknown%", Type="%mac%", Value="bb:45:45:52:71:de" }
#  35: { Field="%funknown%", Type="%literal%", Value=":" }
#  36: { Field="%funknown%", Type="%literal%", Value="b2" }
#  37: { Field="%funknown%", Type="%literal%", Value=":" }
#  38: { Field="%funknown%", Type="%integer%", Value="12" }
#  39: { Field="%funknown%", Type="%literal%", Value=":" }
#  40: { Field="%funknown%", Type="%integer%", Value="34" }
#  41: { Field="%funknown%", Type="%literal%", Value=":" }
#  42: { Field="%funknown%", Type="%integer%", Value="56" }

I would like to see this:

#   0: { Field="%funknown%", Type="%time%", Value="Feb 06 16:00:44" }
#   1: { Field="%funknown%", Type="%literal%", Value="higgs" }
#   2: { Field="%funknown%", Type="%literal%", Value="sshd" }
#   3: { Field="%funknown%", Type="%literal%", Value="[" }
#   4: { Field="%funknown%", Type="%integer%", Value="4388" }
#   5: { Field="%funknown%", Type="%literal%", Value="]" }
#   6: { Field="%funknown%", Type="%literal%", Value=":" }
#   7: { Field="%funknown%", Type="%literal%", Value="Accepted" }
#   8: { Field="%funknown%", Type="%literal%", Value="publickey" }
#   9: { Field="%funknown%", Type="%literal%", Value="for" }
#  10: { Field="%funknown%", Type="%literal%", Value="cryptix" }
#  11: { Field="%funknown%", Type="%literal%", Value="from" }
#  12: { Field="%funknown%", Type="%ipv6%", Value="2a02:8108:2140:6b64:223:32ff:feb1:2e50" }
#  13: { Field="%funknown%", Type="%literal%", Value="port" }
#  14: { Field="%funknown%", Type="%integer%", Value="58251" }
#  15: { Field="%funknown%", Type="%literal%", Value="ssh2" }
#  16: { Field="%funknown%", Type="%literal%", Value=":" }
#  17: { Field="%funknown%", Type="%literal%", Value="RSA" }
#  18: { Field="%funknown%", Type="%fingerprint%", Value="d1:93:fd:09:74:a6:bb:45:45:52:71:de:b2:38:9b:54" }

kind regards,

Sequences with URI's are not matched correctly

Steps to Reproduce:

echo "get http://example.com" > input.txt
go run sequence.go analyze --input input.txt --output patterns.txt
go run sequence.go parse --input input.txt --patterns patterns.txt

Expected Results:

Message is parsed and there is no error.

Actual Results:

2016/05/12 12:18:40 Error (sequence: no pattern matched for this message) parsing: get http://example.com
2016/05/12 12:18:40 Parsed 1 messages in 0.00 secs, ~ 999.90 msgs/sec
2016/05/12 12:18:40 Quiting...

Comments:

There is no error and the results are correct if the URI is removed. I think the URI fails to match because the scanner says it is type uri but the patterns file is looking for %object%.

I was unable to confirm this because %uri% is not accepted in a patterns file (Invalid tag token "%uri%": unknown type). And I have not figured out how to prevent the URI from being tagged as an object.

Also note this bug causes the analyze command to report the incorrect number of new patterns.

iostat output - split over multiple lines, and in multi-line table

What are your thoughts on parsing iostat output?

For example:

03/18/2015 03:17:55 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.44    0.00    0.38    0.31    0.00   98.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00   64.00    0.00  1584.00     0.00    49.50     0.04    0.69    0.69    0.00   0.25   1.60
xvda              0.00     0.00    0.00    3.00     0.00    12.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdf              0.00   110.00    0.00   71.00     0.00  1392.00    39.21     0.03    0.39    0.00    0.39   0.39   2.80
xvde              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   64.00    0.00  1584.00     0.00    49.50     0.04    0.69    0.69    0.00   0.25   1.60

03/18/2015 03:17:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.19    0.00    0.13    0.19    0.00   99.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00   35.00    0.00   708.00     0.00    40.46     0.06    1.71    1.71    0.00   0.46   1.60
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdf              0.00   140.00    0.00   87.00     0.00  1472.00    33.84     0.01    0.14    0.00    0.14   0.14   1.20
xvde              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   35.00    0.00   708.00     0.00    40.46     0.06    1.71    1.71    0.00   0.46   1.60

03/18/2015 03:17:57 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.13    0.00    0.06    0.13    0.00   99.69

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00   18.00    0.00   456.00     0.00    50.67     0.00    0.00    0.00    0.00   0.00   0.00
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdf              0.00   125.00    0.00   78.00     0.00  1332.00    34.15     0.04    0.46    0.00    0.46   0.36   2.80
xvde              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   18.00    0.00   456.00     0.00    50.67     0.00    0.00    0.00    0.00   0.00   0.00

In this case, it's a bit funny, because iostat splits the output over multiple lines.

So you need to extract the timestamp on the first line, then you need to combine that with the lines below until the next timestamp, to get one "entry" - i.e. CPU/IO stats at that one second instance, in this case.

CPU info itself is split over two lines, and then IO per device is put into a table.

Can you see sequence being used to parse something like this, or is it a bit out of scope?

A greedy (whitespace-consuming) %string% type?

Consider the exim log message:

2015-02-11 11:04:40 H=(amoricanexpress.com) [64.20.195.132]:10246 F=<[email protected]> rejected RCPT <[email protected]>: Sender verify failed

I might make the pattern:

%msgtime% H=( %srchost% ) [ %srcipv4% ] : %srcport% F=< %srcemail% > rejected RCPT < %dstemail% >: Sender verify failed

But, I cannot find a way to turn the Sender verify failed bit into a single field, because %string%
appears to break on whitespace.

Any ideas?

It's great to see the analyzer finally released with all the other bits, by the way. This project is amazing.

Readme doesn't say how to run / build

Have never built any go code and have not yet worked out how to build.

go build.
go run sequnce.go don't produce anything useful.

Please consider including a two liner on how to run/build, if possible.

syslog-ng patterndb integration

Hi,

I'd like to discuss the possibility to write some kind of patterndb integration. I was thinking about a program that would generate the syslog-ng db from a sequence analyzer output.

What are your thoughts/ideas/comments on that?

Path not correctly analyse

Since a path is a frequent object in a log, it would be very useful if It had a parsing category like "timeFormats" so we would be able to parse for different path syntax (linux VS windows, file/folder, %appdata%, ../../path, ./path, etc.)

Currently:
C:\test\test\test.cxx
Would be analyse as:
c : %string%

Unknown token encountered

I have a number of log sources that have characters that force the analyzer to stop before completion with an "unknown token encountered" error. Is there a way to run the analyzer in a "best effort" mode so that if the analyzer encounters a line with characters it is unable to tokenize it skips that line?

If not, is there documentation on how one might deal with this type of error? I have no problem pre-processing the logs, I just am unable to tell from the source code alone which characters result in the TokenUnknown case.

show message counts

(First, thanks for this tool - it's potentially really useful)

When analysing a log file, it would be very useful to have
not only the patterns and the example line, but also the
number of lines in the log file that matched each pattern,
to get an idea for the prevalence of each pattern in the file.

unit test failed in analyzer_test.go

>> go test
# github.com/strace/sequence
./analyzer_test.go:188:59: Sprintf format %s has arg len(atree.levels[l]) of wrong type int
FAIL    github.com/strace/sequence [build failed]

time not recognized at end of line

Hi,

do you have an idea why these rules don't match these messages?

msg:
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max connection rate 1/60s for (smtp:5.5.5.5) at Jan 31 21:39:37
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max connection count 1 for (smtp:5.5.5.5) at Jan 31 21:39:37
Jan 31 21:42:59 mail postfix/anvil[14606]: statistics: max cache size 1 at Jan 31 21:39:37

rules:
%msgtime% %apphost% %appname%[%integer%]: statistics: max connection rate %string% for (smtp:%appipv4%) at %time%
%msgtime% %apphost% %appname%[%integer%]: statistics: max connection count %integer% for (smtp:%appipv4%) at %time%
%msgtime% %apphost% %appname%[%integer%]: statistics: max cache size %integer% at %time%

the rules work perfectly, except for the %time% at the end.

URI's starting with "//" are not tokenized correctly

Steps to Reproduce:

echo "get //example.com" > input.txt
go run sequence.go scan --input input.txt

Expected Results:

#   0: { Tag="funknown", Type="uri", Value="//example.com", ... }

Actual Results:

#   0: { Tag="funknown", Type="literal", Value="//example.com", ... }

Comments:
I found this bug processing an actual log file. One of the log events in question:

81.181.146.13 - - [15/Mar/2005:05:06:49 -0500] "GET //cgi-bin/awstats/awstats.pl?configdir=|%20id%20| HTTP/1.1" 404 1050 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

A related question: what is the best way to handle relative URI's? Sequence's heuristic algorithm for processing URI's breaks down on these...

integer with trailing dot recognized as float

Hi,

i can't match the port at the end of this message.

msg:
Feb 06 15:56:09 higgs sshd[902]: Server listening on 0.0.0.0 port 22.

rule:
%msgtime% %apphost% %appname% [ %sessionid% ] : Server listening on %srcipv4% port %integer% .

It's a minor issue in this simple case but it's a bit confusing while writing rules.

Been working on sequence for the last 2 months - would love to discuss

Hi,

I have been working with sequence for the last two months extending it to output its patterns in syslog-ng patterndb and grok for Logstash formats. I have had to make a few changes to sequence code, largely around remembering where the spaces are, adding a database so we can decide to print the patterns on demand, rather than after each analysis, among other things.
It is in a company repo for now, but the goal is to make it available to the open source community.
I would love to discuss this with you.
https://www.linkedin.com/in/louise-harding-3b964551/

Regards
Louise

Sequence can't handle Chinese

devid=0 date="2013/05/21 09:53:17" dname=themis logtype=9 pri=6 ver=0.3.0 mod=webui from=10.1.5.200 agent="Mozilla/5.0 " admin=administrator act=登录 result=0 msg="成功" dsp_msg="administrator 登录" fwlog=0

data folder is missing in the root directory

i was hoping to run few quick tests, however i am unable to find the data folder containing the log messages. I checked all your github libs. github.com/strace/sequence, zentures/sequence and library download

Why does the html tag identify as a tag.

When analyzing this part of the log,

GET%3Cbody%3E%3CSCRIPT

the analyzer identifies "% 3Cbody%" as a tag, although it is only part of the GET request

No output to `stdout` when output file flag is empty

Steps to reproduce:

go run sequence.go analyze -i input.txt

Expected output:

%action% %object%
#1 log messages matched
# get http://example.com
2016/05/12 11:27:43 Analyzed 1 messages, found 1 unique patterns, 1 are new.

Actual output:

2016/05/12 11:27:43 Analyzed 1 messages, found 1 unique patterns, 1 are new.

relaxing types

Hi,

i'm trying to write rules for postfix, most of them contain a hexadecimal string, like a message ID.

Feb  4 02:01:15 mail postfix/oqmgr[86819]: BF468251C: removed
Feb  4 02:01:17 mail postfix/oqmgr[86819]: 746702526: removed
Feb  4 02:33:33 mail postfix/oqmgr[86819]: 24CB02536: removed
Feb  4 04:01:55 mail postfix/oqmgr[86819]: EC3DE2562: removed

The probleme here for me is that, in some cases these IDs are all digits and than sequence want's it to be an %integer% instead, forcing me to duplicate my rules.

In another instances of this, when it logs delays like this:

delay=0.4
delay=1

the numbers want to be %float% and %integer%.

With each such case the number of rules explodes. I'm asking if it wouldn't make more sense to let later processing deal with the validity of single fields. Maybe collapse float and int into a single number type? Or have maybe have another %literal% that ignores the type.

kind regards,

cryptix

zentures / sequence Goto Github PK

sequence's Issues

Recommend Projects

Recommend Topics

Recommend Org