Giter Club home page Giter Club logo

Comments (10)

seeker89 avatar seeker89 commented on May 11, 2024 1

Super cool project, we've recently deployed this in our clusters and it's proving to be super useful.

Glad we could help !

I don't see why not - it would put some extra strain on Prometheus with the extra time series, but it would indeed make for better graphs and alerts.

👍

from goldpinger.

dannyk81 avatar dannyk81 commented on May 11, 2024 1

wdyt @seeker89?

I tried going through the code yesterday to see if I could work out how to implement this, but to be honest getting a bit lost... it seems like the podIPs (and hostIPs) are used as primary identifiers almost everywhere and passed around in ad-hoc maps, I suppose we'll need to define a struct for that and use it throughout the code?

Generally, I think it would be great if we could substitute pod IPs with pod names and host IPs with host names (probably keeping the IP data as well though, but make it secondary) because having just the IP addresses displayed is rather confusing.

It would be (at least in my pov) so much more human friendly if the metrics and the UI (graph, heatmap, etc..) referenced the pod names + host(node) names, instead of plain IPs.

Sorry if I'm going out of scope here.

from goldpinger.

akhy avatar akhy commented on May 11, 2024

👍

seeing at the source, it seems the code only care about HostIP of ping targets (at least in the prometheus metrics). I believe it'd be more useful when debugging issues if we can somehow give more context of the target nodes.

In the monitoring dashboard/alerting, instead of this

"node A and B are reporting that some other unknown nodes are down"

this would be better

"node X and Y are reported down by 2 nodes"

from goldpinger.

dannyk81 avatar dannyk81 commented on May 11, 2024

Hey @seeker89 😄

Do you think this is something you would consider implementing in the near future? we particularly feel the this is missing for alerting, where we know some node(s) are unhealthy but can't pinpoint from the metrics.

from goldpinger.

seeker89 avatar seeker89 commented on May 11, 2024

Hey @dannyk81 !

Now, that we've merged #53 in I think we can add some more context to the metrics.

So, just to make sure I understand what your use case is - what extra labels on which metrics do you need ?

from goldpinger.

dannyk81 avatar dannyk81 commented on May 11, 2024

Great news!

I'll try and summarize our current experience:

  1. goldpinger_peers_response_time_* - these metrics have the pod_ip and host_ip labels of the pod being probed, but not the host_name.

  2. goldpinger_nodes_health_total - this metric is useful, however currently very difficult to use (for alerting and troubleshooting), since it provides a way to alert when nodes are misbehaving but the challenge is how to identify that node? actually, right now we have an alert in one of our clusters but we can't deduct from the metrics which node is actually misbehaving, we just know that one node is unhealthy (this relates to @akhy's comment I believe)

/edit: I think it would also be great if the host name can be used on Goldpinger's UI and log instead of the host IP, it would make things much easier when trying to troubleshoot things.

from goldpinger.

seeker89 avatar seeker89 commented on May 11, 2024

Some good points here. I'll have a look when I get some free bandwidth. 👍

from goldpinger.

dannyk81 avatar dannyk81 commented on May 11, 2024

@seeker89 just curious if you had a chance to look into this? 🙏

from goldpinger.

seeker89 avatar seeker89 commented on May 11, 2024

Sorry, not yet. I did set some time for goldpinger next week, might be able to get into this.

from goldpinger.

ifelsefi avatar ifelsefi commented on May 11, 2024

An example latency check that works well!

    - alert: goldpinger-node-latency
      expr: |
        sum (rate(goldpinger_peers_response_time_s_sum{call_type="ping"}[1m]))
        by (goldpinger_instance, host_ip, pod) > 0.040
      for: 1m
      annotations:
        description: |
           Goldpinger pod {{ "{{ $labels.pod }}" }} on node {{ "{{ $labels.goldpinger_instance }}" }} cannot reach remote node at IP {{ "{{ $labels.host_ip }}" }} in less than 40 ms!
        summary: Node {{ "{{ $labels.host_ip }}" }} likely fubar.  Overlay network latency should be less than 40ms!
      labels:
        severity: critical

from goldpinger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.