Hey Jack, Thanks for releasing statsrelay, this is a great tool! I'd

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Issue <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

Multiple StatsRelay daemons behind LVS about statsrelay HOT 17 OPEN

justdaver commented on August 16, 2024

Multiple StatsRelay daemons behind LVS

from statsrelay.

Comments (17)

jjneely commented on August 16, 2024

Dave,

That is exactly my use case. Provided that you invoke statsrelay with the same list of hash ring members in the same order on each machine you will get identical behavior from each daemon. So no matter what daemon handles the stat it will always go back to the same StatsD.

Jack

from statsrelay.

jimbrowne commented on August 16, 2024

@jjneely you might want to put a version of what you wrote above in the readme.

from statsrelay.

justdaver commented on August 16, 2024

Thanks Jack, this is exactly what I've been searching for. I'm going to run a few tests today to see how far I can push it and will report back with some results.

This may be unrelated but If I configure statsrelay with 2 backend hosts say A and B, and A goes offline. What happens to the metrics which were meant to go to A? Are they'd queued up or would they be directed to the next backend host B?

from statsrelay.

jjneely commented on August 16, 2024

Benchmarks / feedback would be fantastic. I've had fairly reasonable performance with 200,000 metrics/sec at one statsrelay daemon using a load generator:

https://github.com/octo/statsd-tg

The UDP protocol that StatsD uses is specifically designed not to have any feedback. So just using that protocol statsrelay can't tell if the statsd daemon its talking to is alive. (Nagios steps in here for me.) StatsD does have a TCP port for some administrative commands -- but I imagine the best use of that port is to monitor the process from remote (or in this case statsrelay). That should be its own issue I think as that code hasn't been written yet.

In short, the current code base can't tell if the StatsD daemons are alive and will just continue to sent traffic in their direction.

from statsrelay.

jjneely commented on August 16, 2024

Issue #2 created.

from statsrelay.

justdaver commented on August 16, 2024

Jack - my apologies for the late reply, unfortunately I was sidetracked with work and studies just as I was really getting into testing this. Thanks for the heads up about statsd-tg, works great. I've run into a few small issues with our tests but will send feedback as soon as I can trust our test results :)

from statsrelay.

jjneely commented on August 16, 2024

Excellent! I look forward to your feedback.

from statsrelay.

justdaver commented on August 16, 2024

Ok so far the tests we're running are looking great. I'm running multiple statsrelay daemons behind an LVS load balancer. The statsrelay daemons are each running with the same options, eg.

statsrelay -bind=10.0.0.1 -port=12001 -verbose -prefix=statsrelayX 10.0.0.10:12000 10.0.0.20:12000

On the statsd servers (statsd1/10.0.0.10 and statsd2/10.0.0.20) I can confirm metrics are going to both, and they are both not receiving the same metric namespaces - brilliant :)

One thing I have noticed however is that the distribution of metrics to each backend statsd server is not what I expected, most likely because I am still trying to get my head around how consistent hashing works. Would running statsrelay with the options above distribute say 50% of metrics to statsd1 and the other 50% to statsd2? From my tests I'm seeing roughly 20% of metrics hit statsd1 and the remaining 80% hit statsd2.

How would I run statsrelay to allow a 50/50 split in metrics the the backend statsd hosts?

Thanks!

from statsrelay.

jjneely commented on August 16, 2024

By very carefully choosing the hostnames of the statsd locations is the only way to get an even balance across 2 statsd daemons.

The hash ring works by setting up a ring buffer with 64k (or some other power of 2) number of slots. The host/port is hashed to figure out where on the ring they live. When a metric name comes in, that name is also hashed to find where it would fall in the ring buffer. Then the algorithm finds the closest slot with a host/port and the metric is sent to that host.

So most likely the two slots that represent the hosts in the hash ring are fairly close together which gives you an uneven distribution. The more statsd hosts you have in the ring the more even the distribution will be.

I run several statsrelay daemons under LVS and the ring includes 18 statsd daemons. At last check I'm doing about 500,000 statsd metrics per second with room to spare. Although, I haven't looked closely at how even the distribution is...I normally track the CPU usage of the nodejs code.

from statsrelay.

justdaver commented on August 16, 2024

Thanks Jack much appreciated. I managed to get an almost event balance by altering the hostnames a bit as you mentioned.

At the moment I'm running 6 statsrelay daemons across 2 servers, each doing about 45000 metrics a second (270k/sec total). I'm a bit confused about the numbers reported by the statsProcessed metric statsrelay reports - I'm seeing about figures of 2.3-2.8 million for statsProcessed. Is that total metrics processed per minute?

from statsrelay.

jjneely commented on August 16, 2024

StatsRelay sends a counter metric into Statsd to report metrics processed. If your Statsd flush interval is set to 60 seconds then that's probably right. You are seeing metrics processed per minute. This is what I do in Graphite:

alias(sum(1min.statsd.prod.statsrelay.graphite*.statsProcessed.rate), "Metrics Processed Per Second")

from statsrelay.

justdaver commented on August 16, 2024

Ah Ok that makes sense - thanks for clearing that up :)

Here are my final test results & info.

Test environment:

Fired off some test metrics using statsd-tg (https://github.com/octo/statsd-tg) with the following options:

statsd-tg -d 10.0.0.1 -D 12000 -T 2 -s 0 -c 4000 -t 0 -g 400

I'm making use of 2 load balancers, a Citrix NetScaler and using LVS on RHEL 7. Metrics first hit the NetScaler which evenly distributes metrics to 2 RHEL 7 servers (server 1 & 2). The VS on the NetScaler is configured as roundrobin / sessionless / persistence:none.

The LVS configuration of server 1 & 2 is (server 1):

ipvsadm -A -u 10.0.0.2:12000 -s rr -o
ipvsadm -a -u 10.0.0.2:12000 -r 10.0.0.2:12001 -m
ipvsadm -a -u 10.0.0.2:12000 -r 10.0.0.2:12002 -m
ipvsadm -a -u 10.0.0.2:12000 -r 10.0.0.2:12003 -m

Metrics are evenly passed to 3 statsrelay daemons using roundrobin on ports 12001/2/3 UDP.

There are 3 statsrelay daemons running with the following options (server 1):

statsrelay -bind=10.0.0.2 -port=12001 -prefix="server1.statsrelay1" 10.0.0.10:12000:100 10.0.0.20:12000:400
statsrelay -bind=10.0.0.2 -port=12002 -prefix="server1.statsrelay2" 10.0.0.10:12000:100 10.0.0.20:12000:400
statsrelay -bind=10.0.0.2 -port=12003 -prefix="server1.statsrelay3" 10.0.0.10:12000:100 10.0.0.20:12000:400

I tested a few different instance values with specifying the statsd hosts and eventually settled on :100 and :400. This seemed to give me the closest to a 50/50 distribution of metrics, more 40/60 but that's good enough for me.

For the statsd servers (server 3 & 4) I'm running statsdnet (https://github.com/lukevenediger/statsd.net).

The statsrelay servers (server 2 & 3) are VM's running RHEL 7 with 4Gb memory and 4 CPU's. The statsd servers (server 3 & 4) are running Win Server 2012 R2 Std with 4Gb memory and 2 CPU's.

Combined metrics processed p/sec by both server 2 & 3 (all statsrelay daemons): 270k/s (135k/s each)
Avg metrics processed p/sec by each statsrelay daemon: 45k/s
Load Avg for statsrelay servers: 2-2.5 (memory usage is so low it's not worth mentioning)

I'm also using CollectD to report some basic UDP stats for LVS on the statsrelay servers (the ipvs module is awesome btw) and can see an average of about 3000 UDP InDatagrams/sec and 200 UDP InErrors/sec reported. Am still investigating the cause of this and not sure yet if this is something I should be really worried about.

All in all statsrelay is doing a great job, thanks again Jack :)

from statsrelay.

justdaver commented on August 16, 2024

One thing I forgot to mention is that while there is no built in fault tolerance with statsrelay I put together a basic bash script which I run every 10 minutes to check if the statsd hosts are up. If one of them is down and stops responding then it kills all statsrelay processes and fires them up again leaving out the failed statsd host so that it does not attempt to send metrics to it. So far this works but definitely an area which could be improved in many ways.

from statsrelay.

denen99 commented on August 16, 2024

Out of curiosity, @jjneely whats the advantage of running multiple stats-relay processes on a single server ?

from statsrelay.

szibis commented on August 16, 2024

I do have up to 300k metrics/sec on single statsrelay with my load tests inside AWS. Statsite can handle ~700k metrics per second.

We are starting to use it with kubernetes after some improvements and architecture looks like this:

Some older overview based on EC2 instances:

from statsrelay.

jjneely commented on August 16, 2024

I should use some of these awesome diagrams for the actual documentation! Thanks!

300k/s matches my own benchmarks. I've run it hotter, but the UDP drop starts to become unmanageable, and that's what I'd like to avoid. I have pondered a re-write in C which could handle more throughput (similar to statsite), but not quite gotten around to my next set of Statsd scaling.

from statsrelay.

szibis commented on August 16, 2024

it was a year ago when i evaluate idea of relaying statsd to central place and take down CPU usage from EC2 instances, improve percentile/histogram calculation and simpler maintenance.

I think 200k-300k is high enough, because in most statsrelay will exist on single instance/k8s Pod with single app and i think 99% of cases will generate in most ten of thousand's metrics per second and there statsrelay is perfect. Multiple statsrelays and only couple statsites/telegraf's at the end.

Main feature we need now is buffering on memory or/and disk with backfill when remote recover. It is discussed here #2.

I think maybe soon when i get some spare time i will try to make new series of tests.

from statsrelay.

Multiple StatsRelay daemons behind LVS about statsrelay HOT 17 OPEN

Comments (17)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent