jjneely / statsrelay Goto Github PK
View Code? Open in Web Editor NEWA Golang consistent hashing proxy for statsd
License: MIT License
A Golang consistent hashing proxy for statsd
License: MIT License
Hey Jack,
Thanks for releasing statsrelay, this is a great tool! I'd like to run multiple statsrelay daemons behind a simple UDP load balancer with LVS in roundrobin mode and was wondering would the statsrelay daemons be able to share the same hash table? Would they each send the same metrics to the same backend server or would they each maintain their own hash table?
Thanks!
Dave
I am running statsrelay as repeater daemon in my application boxes(ec2 m4.large) and forwarding the udp packets to group of boxes listening behind aws network load balancer(udp). This is opening around 50K connection flows in the NLB. The count of metrics published by the application boxes is around 10K/second. What is the reason behind opening such a high number of flows?
Hey Guys,
I'm facing a bit of a strange problem at the moment, a statsrelay process is crashing at one of our sites for some reason. I've run the daemon in verbose mode and this is the error message just after it crashes.
Could this be the result of a malformed metric name received?
2015/10/27 12:15:01 Procssed 11261 metrics. Running total: 4006453. Metrics/sec: 14411
panic: runtime error: slice bounds out of range
goroutine 384 [running]:
runtime.panic(0x53b7c0, 0x624c6f)
/usr/lib/golang/src/pkg/runtime/panic.c:279 +0xf5
main.getMetricName(0xc208312cbb, 0x30, 0xbb345, 0x0, 0x0)
/admin/golang/statsrelay/statsrelay.go:79 +0xa6
main.handleBuff(0xc2082ce000, 0xcceb5, 0x100000)
/admin/golang/statsrelay/statsrelay.go:144 +0x9d8
created by main.runServer
/admin/golang/statsrelay/statsrelay.go:281 +0x1db
goroutine 16 [select]:
main.runServer(0x7fff8c1a06b7, 0xc, 0x2ee1)
/admin/golang/statsrelay/statsrelay.go:278 +0x2db
main.main()
/admin/golang/statsrelay/statsrelay.go:346 +0x694
goroutine 19 [finalizer wait, 4 minutes]:
runtime.park(0x416580, 0x6283d8, 0x626f09)
/usr/lib/golang/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x6283d8, 0x626f09)
/usr/lib/golang/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
/usr/lib/golang/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
/usr/lib/golang/src/pkg/runtime/proc.c:1445
goroutine 20 [syscall, 4 minutes]:
os/signal.loop()
/usr/lib/golang/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
/usr/lib/golang/src/pkg/os/signal/signal_unix.go:27 +0x32
goroutine 21 [IO wait]:
net.runtime_pollWait(0x7fd668027730, 0x72, 0x0)
/usr/lib/golang/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20802c220, 0x72, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20802c220, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).Read(0xc20802c1c0, 0xc20846fd3f, 0xfc2c1, 0xfc2c1, 0x0, 0x7fd6680262b8, 0xb)
/usr/lib/golang/src/pkg/net/fd_unix.go:242 +0x34c
net.(*conn).Read(0xc20803c028, 0xc20846fd3f, 0xfc2c1, 0xfc2c1, 0x4d, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/net.go:122 +0xe7
main.readUDP(0x7fff8c1a06b7, 0xc, 0x2ee1, 0xc208070000)
/admin/golang/statsrelay/statsrelay.go:240 +0x6fa
created by main.runServer
/admin/golang/statsrelay/statsrelay.go:275 +0x132
The current code base does nothing to detect or react to StatsD daemons that are not alive. The UDP StatsD protocol is designed to be fire-and-forget and offers no way to detect if the other side has received the packet.
StatsD daemons have a TCP administrative interface that's probably very useful for checking if the process is alive. That may be of help with this issue.
Things to think about:
hi,
we are having issue with statsrelay - how it distributes metrics between statsite hosts. We run 4 statsrelay servers, and 5 statsite servers at the baclkend. All are configured identically.
It appears, that one statsite node is always getting higher amount of metrics, look at the is-1941b
:
We tried to add INSTANCE values (in 'HOST:PORT:INSTANCE') , but it did not make any visible effect.
Our current idea is that consistent hashing is not distributing metrics in efficient way.
We sometimes use very long metric names, for example:
env.analyze.document_analyzer.analyzeDocument.entities.InternetDomainName.source.DataProviderNameHERE.DocumentNameHashHere.occurrences
In our case, it looks like all metric names prefixed env.analyze.document_analyzer.analyzeDocument.entities
are forwarded to the same single statsite backend. Which is not good, we would expect those metric names to be distributed between different statsite backends.
Any ideas how to solve ?
I have started testing statsrelay in our envioment by using statsd repeater. Onething I noticed is difference in metrics recieved in graphite vs statsd proxy.
statsd repeater -> statsrelay -> statsd > graphite ^ [currently using one statsrelay and statsd]
And the difference is huge when we add more statsd backends.
The same graphs works fine when I replace statsrelay with statsd. [statsd repeater -> statsd > graphite]
Any thoughts on this @jjneely @szibis
Hi Guys,
For some reason I've noticed the statsrelay processes for one of our sites is crashing every 10-20 minutes, this is the error:
panic: runtime error: index out of range
goroutine 6 [running]:
main.readUDP(0x7fffc2e446b9, 0xc, 0x2ee1, 0xc2080640c0)
/admin/downloads/statsrelay/latest/golang/statsrelay/statsrelay.go:276 +0x9a2
created by main.runServer
/admin/downloads/statsrelay/latest/golang/statsrelay/statsrelay.go:309 +0x174
goroutine 1 [select]:
main.runServer(0x7fffc2e446b9, 0xc, 0x2ee1)
/admin/downloads/statsrelay/latest/golang/statsrelay/statsrelay.go:312 +0x396
main.main()
/admin/downloads/statsrelay/latest/golang/statsrelay/statsrelay.go:382 +0x762
goroutine 5 [syscall, 3 minutes]:
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
/usr/lib/golang/src/os/signal/signal_unix.go:27 +0x35
Has anyone come across this problem before?
Thanks,
Dave
I have the following architecture in-place:
statsd metrics -> statsrelay -> x2 gostatsd -> x2 go-carbon
As far as I understood, statsrelay
uses fnv1a
hashing algorithm for hash ring, but buckytools inconsistent
command shows inconsistency with all supported (carbon
, fnv1a
, jump_fnv1a
) algorithms.
$ statsrelay -p=8125 10.0.0.1:18125:node-1 10.0.0.2:18125:node-2
$ echo "foo.bar.bob:1|c" | nc -w1 -u 10.0.0.1 8125 <-- landed on node-2
$ echo "foo.bar.john:1|c" | nc -w1 -u 10.0.0.1 8125 <-- landed on node-1
$ buckyd -node 10.0.0.1 -hash fnv1a 10.0.0.1:2003=node-1 10.0.0.2:2003=node-2
$ buckyd -node 10.0.0.2 -hash fnv1a 10.0.0.1:2003=node-1 10.0.0.2:2003=node-2
$ bucky inconsistent -f | grep foo
2020/09/02 17:42:35 10.0.0.1:4242 returned 110656 metrics
2020/09/02 17:42:35 10.0.0.2:4242 returned 109531 metrics
2020/09/02 17:42:35 Hashing...
2020/09/02 17:42:35 Hashing time was: 0s
2020/09/02 17:42:35 36404 inconsistent metrics found on 10.0.0.1:4242
2020/09/02 17:42:35 35111 inconsistent metrics found on 10.0.0.2:4242
10.0.0.1:4242: stats.foo.bar.john
10.0.0.1:4242: stats_counts.foo.bar.john
I've run into a strange problem recently which I did not have before and am slowly loosing my mind trying to get to the bottom of it :)
This is my test environment (using RHEL 7):
I'm running statsrelay with the following command options on serverx (2 backend statsd hosts):
statsrelay -bind=10.0.0.1 -port=12001 -prefix=serverx -verbose=true 10.0.0.10:12000 10.0.0.20:12000
2015/05/12 15:11:59 Listening on 10.0.0.1:12001
2015/05/12 15:11:59 Setting socket read buffer size to: 212992
2015/05/12 15:11:59 Rock and Roll!
Then on my test client I'm sending it a statsd metric using netcat like so:
echo "a.b.c:1|c" | nc -w 1 -u 10.0.0.1 12001
As soon as the metric hits serverx on 12001 (UDP), statsrelay crashes and the following appears in the console:
panic: runtime error: slice bounds out of range
goroutine 27 [running]:
runtime.panic(0x53b7c0, 0x624c6f)
/usr/lib/golang/src/pkg/runtime/panic.c:279 +0xf5
main.getMetricName(0xc208380011, 0x0, 0xfffef, 0x0, 0x0)
/home/dave/statsrelay/statsrelay.go:79 +0xa6
main.handleBuff(0xc208380000, 0x11, 0x100000)
/home/dave/statsrelay/statsrelay.go:139 +0x1ea
created by main.runServer
/home/dave/statsrelay/statsrelay.go:276 +0x1db
goroutine 16 [select]:
main.runServer(0x7fffec2bc801, 0x7, 0x2ee1)
/home/dave/statsrelay/statsrelay.go:273 +0x2db
main.main()
/home/dave/statsrelay/statsrelay.go:341 +0x694
goroutine 19 [finalizer wait]:
runtime.park(0x416570, 0x6283d8, 0x626f09)
/usr/lib/golang/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x6283d8, 0x626f09)
/usr/lib/golang/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
/usr/lib/golang/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
/usr/lib/golang/src/pkg/runtime/proc.c:1445
goroutine 20 [syscall]:
os/signal.loop()
/usr/lib/golang/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
/usr/lib/golang/src/pkg/os/signal/signal_unix.go:27 +0x32
goroutine 21 [IO wait]:
net.runtime_pollWait(0x7fc7fa7be708, 0x72, 0x0)
/usr/lib/golang/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20802c060, 0x72, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20802c060, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).Read(0xc20802c000, 0xc208480000, 0x100000, 0x100000, 0x0, 0x7fc7fa7bd2b8, 0xb)
/usr/lib/golang/src/pkg/net/fd_unix.go:242 +0x34c
net.(*conn).Read(0xc20803a018, 0xc208480000, 0x100000, 0x100000, 0x0, 0x0, 0x0)
/usr/lib/golang/src/pkg/net/net.go:122 +0xe7
main.readUDP(0x7fffec2bc801, 0x7, 0x2ee1, 0xc208054000)
/home/dave/statsrelay/statsrelay.go:235 +0x6fa
created by main.runServer
/home/dave/statsrelay/statsrelay.go:270 +0x132
Any help/pointers would be greatly appreciated! Most likely I've completely missed the boat here and doing something wrong, have been stuck with this issue for a couple hours now!
Thanks!
Dave
From the beginning of statsrelay, there is a problem with handling messages in UDP input.
What I observe is that with some clients messages are sliced into 1024 parts, or bigger depending on buffer set in system or in tool - example from nc:
cat test | nc -w 1 -u 127.0.0.1 8125
One file with multiple metrics inside each with \n
2017/04/19 10:55:29 Buffer from server run
2017/04/19 10:55:29 input buffer debug &jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-re
1024 bytes read from 127.0.0.1:56579
2017/04/19 10:55:29 input buffer debug &quests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:4|g
jetty.handler.put-requests.15MinuteRate:0|g
jetty.handler.put-requests.1MinuteRate:0|g
jetty.handler.put-requests.10MinuteRate:
1024 bytes read from 127.0.0.1:56579
This will produce very bad things like metrics in multiple random places with multiple random locations in graphite. When we add tags there are sliced with wrong places and generates multiple error metrics.
I found this when I start to use nginx-statsd with statsrelay with a huge number of metrics and multiple trash metrics starting to appear in graphite.
Setting -bufsize is ok but some apps not respecting this, setting net.core.rmem_max is not right thing always and looks like this is not solving this problem for all cases.
I think in this scenario better will be to test somehow that whole line is not in the buffer, maybe in this place?
https://github.com/jjneely/statsrelay/blob/master/statsrelay.go#L374
@jjneely can you take a look a tell what you thinking about this ? You can test it using nc with a bigger file containing multiple new lines of metrics.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.