Giter Club home page Giter Club logo

grokky's Introduction

grokky

GoDoc WTFPL License Build Status Coverage Status GoReportCard Gitter

Package grokky is a pure Golang Grok-like patterns library, which can help you to parse log files and other. This is based on RE2 regexp that much more faster than Oniguruma in some cases. Check out the "much more faster" article to understand the difference.

The library was disigned for creating many patterns and using it many times. The behavior and capabilities are slightly different from the original library. The goals of the library are:

  1. simplicity,
  2. fast,
  3. ease of use.

Also

See also another golang implementation vjeantet/grok that is closer to the original library.

The difference:

  1. The grokky allows named captures only. Any name of a pattern is just name of a pattern and nothing more. You can treat is as an alias for regexp. It's impossible to use a name of a pattern as a capture group. In some cases the grooky is similar to the grok that created as g, err := grok.NewWithConfig(&grok.Config{NamedCapturesOnly: true}).

  2. The grokky prefered top named group. If you have two patterns. And the second pattern has same named group and nested into first. Then the named group of the first pattern will be used. The grok uses last (closer to tail) group in any cases. But the grok also has ParseToMultiMap method. To see the difference explanation get the package (using go get -t) and run the following command go test -v -run the_difference github.com/logrusorgru/grokky. Or check out source code of the test.

  3. The grokky was designed as a factory of patterns. E.g. compile once and use many times.

Get it

go get -u -t github.com/logrusorgru/grokky

Run test case

go test github.com/logrusorgru/grokky

Run benchmark comparsion with vjeantet/grok

go test -bench=.* github.com/logrusorgru/grokky

Example

package main

import (
	"github.com/logrusorgru/grokky"
	"fmt"
	"log"
	"time"
)

func createHost() grokky.Host {
	h := grokky.New()
	// add patterns to the Host
	h.Must("YEAR", `(?:\d\d){1,2}`)
	h.Must("MONTHNUM2", `0[1-9]|1[0-2]`)
	h.Must("MONTHDAY", `(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]`)
	h.Must("HOUR", `2[0123]|[01]?[0-9]`)
	h.Must("MINUTE", `[0-5][0-9]`)
	h.Must("SECOND", `(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?`)
	h.Must("TIMEZONE", `Z%{HOUR}:%{MINUTE}`)
	h.Must("DATE", "%{YEAR:year}-%{MONTHNUM2:month}-%{MONTHDAY:day}")
	h.Must("TIME", "%{HOUR:hour}:%{MINUTE:min}:%{SECOND:sec}")
	return h
}

func main() {
	h := createHost()
	// compile the pattern for RFC3339 time
	p, err := h.Compile("%{DATE:date}T%{TIME:time}%{TIMEZONE:tz}")
	if err != nil {
		log.Fatal(err)
	}
	for k, v := range p.Parse(time.Now().Format(time.RFC3339)) {
		fmt.Printf("%s: %v\n", k, v)
	}
	//
	// Yes, it's better to use time.Parse for time values
	// but this is just example.
	//
}

Performance note

Don't complicate regular expressions. Use simplest regular expressions possible. Here is example about Nginx access log, combined format:

h := New()

h.Must("NSS", `[^\s]*`) // not a space *
h.Must("NS", `[^\s]+`)  // not a space +
h.Must("NLB", `[^\]]+`) // not a left bracket +
h.Must("NQS", `[^"]*`)  // not a double quote *
h.Must("NQ", `[^"]+`)   // not a double quote +

h.Must("nginx", `%{NS:remote_addr}\s\-\s`+
	`%{NSS:remote_user}\s*\-\s\[`+
	`%{NLB:time_local}\]\s\"`+
	`%{NQ:request}\"\s`+
	`%{NS:status}\s`+
	`%{NS:body_bytes_sent}\s\"`+
	`%{NQ:http_referer}\"\s\"`+
	`%{NQ:user_agent}\"`)

nginx, err := h.Get("nginx")
if err != nil {
	panic(err)
}

for logLine := range catLogFileLineByLineChannel {
	values := nginx.Parse(logLine)

	// stuff

}

or there is a version (thanks for @nanjj)

h := New()

h.Must("NSS", `[^\s]*`) // not a space *
h.Must("NS", `[^\s]+`)  // not a space +
h.Must("NLB", `[^\]]+`) // not a left bracket +
h.Must("NQS", `[^"]*`)  // not a double quote *
h.Must("NQ", `[^"]+`)   // not a double quote +
h.Must("A", `.*`)       // all (get tail)

h.Must("nginx", `%{NS:clientip}\s%{NSS:ident}\s%{NSS:auth}`+
	`\s\[`+
	`%{NLB:timestamp}\]\s\"`+
	`%{NS:verb}\s`+
	`%{NSS:request}\s`+
	`HTTP/%{NS:httpversion}\"\s`+
	`%{NS:response}\s`+
	`%{NS:bytes}\s\"`+
	`%{NQ:referrer}\"\s\"`+
	`%{NQ:agent}\"`+
	`%{A:blob}`)

// [...]

More performance

Since the grokky.Pattern inherits regexp.Regexp, it's possible to use methods of the regexp.Regexp. E.g. you can to use FindStringSubmatch for example instead of (grokky.Pattern).Parse. Or any other method of the regexp.Regexp.

Check out Benchmark_parse_vs_findStringSubmatch for example.

For my machine result of this becnhmark is (the map is Parse, and the slice is FindStringSubmatch)

map-4      200000    9980 ns/op    1370 B/op    5 allocs/op
slice-4    200000    7508 ns/op     416 B/op    2 allocs/op

Licensing

Copyright © 2016-2018 Konstantin Ivanov [email protected]
This work is free. It comes without any warranty, to the extent permitted by applicable law. You can redistribute it and/or modify it under the terms of the Do What The Fuck You Want To Public License, Version 2, as published by Sam Hocevar. See the LICENSE file for more details.

grokky's People

Contributors

logrusorgru avatar mpenick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

grokky's Issues

Slightly different base patterns

I'm looking to use this package to port over some Logstash functionality into a new Go codebase, but when I copy over my grok patterns, they don't seem to be matching the same way. I checked, and it looks like the base patterns in this repo are slightly different from Logstash's implementation of it.

For example, here is this repo's IPV4 pattern:

(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))

And here's Logstash's:

(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

The result of this is that when an IP is matched, it isn't consumed from the string which through. I didn't dig into why the pattern behaves that way (since it's pretty complicated), but either way, it seems like undesired behavior.

If I put up a PR to switch everything to Logstash's implementation of the patterns, would that be merged in? What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.