Comments (10)
Super cool project, we've recently deployed this in our clusters and it's proving to be super useful.
Glad we could help !
I don't see why not - it would put some extra strain on Prometheus with the extra time series, but it would indeed make for better graphs and alerts.
👍
from goldpinger.
wdyt @seeker89?
I tried going through the code yesterday to see if I could work out how to implement this, but to be honest getting a bit lost... it seems like the podIPs (and hostIPs) are used as primary identifiers almost everywhere and passed around in ad-hoc maps, I suppose we'll need to define a struct for that and use it throughout the code?
Generally, I think it would be great if we could substitute pod IPs with pod names and host IPs with host names (probably keeping the IP data as well though, but make it secondary) because having just the IP addresses displayed is rather confusing.
It would be (at least in my pov) so much more human friendly if the metrics and the UI (graph, heatmap, etc..) referenced the pod names + host(node) names, instead of plain IPs.
Sorry if I'm going out of scope here.
from goldpinger.
👍
seeing at the source, it seems the code only care about HostIP of ping targets (at least in the prometheus metrics). I believe it'd be more useful when debugging issues if we can somehow give more context of the target nodes.
In the monitoring dashboard/alerting, instead of this
"node A and B are reporting that some other unknown nodes are down"
this would be better
"node X and Y are reported down by 2 nodes"
from goldpinger.
Hey @seeker89 😄
Do you think this is something you would consider implementing in the near future? we particularly feel the this is missing for alerting, where we know some node(s) are unhealthy but can't pinpoint from the metrics.
from goldpinger.
Hey @dannyk81 !
Now, that we've merged #53 in I think we can add some more context to the metrics.
So, just to make sure I understand what your use case is - what extra labels on which metrics do you need ?
from goldpinger.
Great news!
I'll try and summarize our current experience:
-
goldpinger_peers_response_time_*
- these metrics have thepod_ip
andhost_ip
labels of the pod being probed, but not thehost_name
. -
goldpinger_nodes_health_total
- this metric is useful, however currently very difficult to use (for alerting and troubleshooting), since it provides a way to alert when nodes are misbehaving but the challenge is how to identify that node? actually, right now we have an alert in one of our clusters but we can't deduct from the metrics which node is actually misbehaving, we just know that one node is unhealthy (this relates to @akhy's comment I believe)
/edit: I think it would also be great if the host name can be used on Goldpinger's UI and log instead of the host IP, it would make things much easier when trying to troubleshoot things.
from goldpinger.
Some good points here. I'll have a look when I get some free bandwidth. 👍
from goldpinger.
@seeker89 just curious if you had a chance to look into this? 🙏
from goldpinger.
Sorry, not yet. I did set some time for goldpinger next week, might be able to get into this.
from goldpinger.
An example latency check that works well!
- alert: goldpinger-node-latency
expr: |
sum (rate(goldpinger_peers_response_time_s_sum{call_type="ping"}[1m]))
by (goldpinger_instance, host_ip, pod) > 0.040
for: 1m
annotations:
description: |
Goldpinger pod {{ "{{ $labels.pod }}" }} on node {{ "{{ $labels.goldpinger_instance }}" }} cannot reach remote node at IP {{ "{{ $labels.host_ip }}" }} in less than 40 ms!
summary: Node {{ "{{ $labels.host_ip }}" }} likely fubar. Overlay network latency should be less than 40ms!
labels:
severity: critical
from goldpinger.
Related Issues (20)
- Default path not showing UI HOT 2
- Ping Hostname HOT 1
- Support multiple pod networks HOT 3
- Should support IPv4/IPv6 dual-stack
- Readiness probe failed: Get "http://172.16.1.4:8080/healthz": dial tcp 172.16.1.4:8080: i/o timeout (Client.Timeout exceeded while awaiting headers) HOT 7
- Support advanced zap configuration
- Long ping times HOT 1
- prometheus metric shows Node as 100% unhealthy always
- Metrics for tcp probes is absent
- Docker sunset of free Teams HOT 1
- Seting up HTTP_TARGETS_TIMEOUT value make results unstable HOT 1
- goldpinger does not support v6 ping
- New versioning schema for docker tag?
- Expose dnsResults metics through the /metrics endpoint
- Helm charts repo deprecation HOT 3
- Multi-arch docker images HOT 6
- Kuberenetes and Openshift Operator HOT 1
- Clarity on Master vs. Peer Response Time values HOT 1
- When PING_NUMBER is nonzero, there are many nodes that are immediately marked as unhealthy
- Unable to specify name of instances using HOSTNAME env HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from goldpinger.