Giter Club home page Giter Club logo

goldpinger's Introduction

Goldpinger

Publish

Goldpinger makes calls between its instances to monitor your networking. It runs as a DaemonSet on Kubernetes and produces Prometheus metrics that can be scraped, visualised and alerted on.

Oh, and it gives you the graph below for your cluster. Check out the video explainer.

๐ŸŽ‰ 1M+ pulls from docker hub!

On the menu

Rationale

We built Goldpinger to troubleshoot, visualise and alert on our networking layer while adopting Kubernetes at Bloomberg. It has since become the go-to tool to see connectivity and slowness issues.

It's small (~16MB), simple and you'll wonder why you hadn't had it before.

If you'd like to know more, you can watch our presentation at Kubecon 2018 Seattle.

Quick start

Getting from sources:

go get github.com/bloomberg/goldpinger/cmd/goldpinger
goldpinger --help

Getting from docker hub:

# get from docker hub
docker pull bloomberg/goldpinger:v3.0.0

Building

The repo comes with two ways of building a docker image: compiling locally, and compiling using a multi-stage Dockerfile image. โš ๏ธ Depending on your docker setup, you might need to prepend the commands below with sudo.

Compiling using a multi-stage Dockerfile

You will need docker version 17.05+ installed to support multi-stage builds.

# Build a local container without publishing
make build

# Build & push the image somewhere
namespace="docker.io/myhandle/" make build-release

This was contributed via @michiel - kudos !

Compiling locally

In order to build Goldpinger, you are going to need go version 1.15+ and docker.

Building from source code consists of compiling the binary and building a Docker image:

# step 0: check out the code
git clone https://github.com/bloomberg/goldpinger.git
cd goldpinger

# step 1: compile the binary for the desired architecture
make bin/goldpinger
# at this stage you should be able to run the binary
./bin/goldpinger --help

# step 2: build the docker image containing the binary
namespace="docker.io/myhandle/" make build

# step 3: push the image somewhere
docker push $(namespace="docker.io/myhandle/" make version)

Installation

Goldpinger works by asking Kubernetes for pods with particular labels (app=goldpinger). While you can deploy Goldpinger in a variety of ways, it works very nicely as a DaemonSet out of the box.

Authentication with Kubernetes API

Goldpinger supports using a kubeconfig (specify with --kubeconfig-path) or service accounts.

Example YAML

Here's an example of what you can do (using the in-cluster authentication to Kubernetes apiserver).

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: goldpinger-serviceaccount
  namespace: default
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: goldpinger
  namespace: default
  labels:
    app: goldpinger
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: goldpinger
  template:
    metadata:
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
      labels:
        app: goldpinger
    spec:
      serviceAccount: goldpinger-serviceaccount
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
      containers:
        - name: goldpinger
          env:
            - name: HOST
              value: "0.0.0.0"
            - name: PORT
              value: "8080"
            # injecting real hostname will make for easier to understand graphs/metrics
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # podIP is used to select a randomized subset of nodes to ping.
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          image: "docker.io/bloomberg/goldpinger:v3.0.0"
          imagePullPolicy: Always
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
          resources:
            limits:
              memory: 80Mi
            requests:
              cpu: 1m
              memory: 40Mi
          ports:
            - containerPort: 8080
              name: http
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: goldpinger
  namespace: default
  labels:
    app: goldpinger
spec:
  type: NodePort
  ports:
    - port: 8080
      nodePort: 30080
      name: http
  selector:
    app: goldpinger

Note, that you will also need to add an RBAC rule to allow Goldpinger to list other pods. If you're just playing around, you can consider a view-all default rule:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: goldpinger-serviceaccount
    namespace: default

You can also see an example of using kubeconfig in the ./extras.

Using with IPv4/IPv6 dual-stack

If your cluster IPv4/IPv6 dual-stack and you want to force IPv6, you can set the IP_VERSIONS environment variable to "6" (default is "4") which will use the IPv6 address on the pod and host.

ipv6

Note on DNS

Note, that on top of resolving the other pods, all instances can also try to resolve arbitrary DNS. This allows you to test your DNS setup.

From --help:

--host-to-resolve=      A host to attempt dns resolve on (space delimited) [$HOSTS_TO_RESOLVE]

So in order to test two domains, we could add an extra env var to the example above:

            - name: HOSTS_TO_RESOLVE
              value: "www.bloomberg.com one.two.three"

and goldpinger should show something like this:

screenshot-DNS-resolution

TCP and HTTP checks to external targets

Instances can also be configured to do simple TCP or HTTP checks on external targets. This is useful for visualizing more nuanced connectivity flows.

      --tcp-targets=             A list of external targets(<host>:<port> or <ip>:<port>) to attempt a TCP check on (space delimited) [$TCP_TARGETS]
      --http-targets=            A  list of external targets(<http or https>://<url>) to attempt an HTTP{S} check on. A 200 HTTP code is considered successful. (space delimited) [$HTTP_TARGETS]
      --tcp-targets-timeout=  The timeout for a tcp check on the provided tcp-targets (default: 500) [$TCP_TARGETS_TIMEOUT]
      --dns-targets-timeout=  The timeout for a tcp check on the provided udp-targets (default: 500) [$DNS_TARGETS_TIMEOUT]
        - name: HTTP_TARGETS
          value: http://bloomberg.com
        - name: TCP_TARGETS
          value: 10.34.5.141:5000 10.34.195.193:6442

the timeouts for the TCP, DNS and HTTP checks can be configured via TCP_TARGETS_TIMEOUT, DNS_TARGETS_TIMEOUT and HTTP_TARGETS_TIMEOUT respectively.

screenshot-tcp-http-checks

Usage

UI

Once you have it running, you can hit any of the nodes (port 30080 in the example above) and see the UI.

You can click on various nodes to gray out the clutter and see more information.

API

The API exposed is via a well-defined Swagger spec.

The spec is used to generate both the server and the client of Goldpinger. If you make changes, you can re-generate them using go-swagger via make swagger

Prometheus

Once running, Goldpinger exposes Prometheus metrics at /metrics. All the metrics are prefixed with goldpinger_ for easy identification.

You can see the metrics by doing a curl http://$POD_ID:80/metrics.

These are probably the droids you are looking for:

goldpinger_peers_response_time_s_*
goldpinger_peers_response_time_s_*
goldpinger_nodes_health_total
goldpinger_stats_total
goldpinger_errors_total

Grafana

You can find an example of a Grafana dashboard that shows what's going on in your cluster in extras. This should get you started, and once you're on the roll, why not โค๏ธ contribute some kickass dashboards for others to use ?

Alert Manager

Once you've gotten your metrics into Prometheus, you have all you need to set useful alerts.

To get you started, here's a rule that will trigger an alert if there are any nodes reported as unhealthy by any instance of Goldpinger.

alert: goldpinger_nodes_unhealthy
expr: sum(goldpinger_nodes_health_total{status="unhealthy"})
  BY (instance, goldpinger_instance) > 0
for: 5m
annotations:
  description: |
    Goldpinger instance {{ $labels.goldpinger_instance }} has been reporting unhealthy nodes for at least 5 minutes.
  summary: Instance {{ $labels.instance }} down

Similarly, why not โค๏ธ contribute some amazing alerts for others to use ?

Chaos Engineering

Goldpinger also makes for a pretty good monitoring tool in when practicing Chaos Engineering. Check out PowerfulSeal, if you'd like to do some Chaos Engineering for Kubernetes.

Authors

Goldpinger was created by Mikolaj Pawlikowski and ported to Go by Chris Green.

Contributions

We โค๏ธ contributions.

Have you had a good experience with Goldpinger ? Why not share some love and contribute code, dashboards and alerts ?

If you're thinking of making some code changes, please be aware that most of the code is auto-generated from the Swagger spec. The spec is used to generate both the server and the client of Goldpinger. If you make changes, you can re-generate them using go-swagger via make swagger.

Before you create that PR, please make sure you read CONTRIBUTING and DCO.

License

Please read the LICENSE file here.

For each version built by travis, there is also an additional version, appended with -vendor, which contains all source code of the dependencies used in goldpinger.

goldpinger's People

Contributors

0xflotus avatar abctaylor avatar akhy avatar angelbarrera92 avatar cgreen12 avatar christianjedrocdt avatar dannyk81 avatar ifooth avatar ivkalita avatar jkroepke avatar johscheuer avatar jpmondet avatar kitfoman avatar kpfleming avatar marcosdiez avatar maxime1907 avatar michiel avatar mtougeron avatar ottoyiu avatar rbtr avatar sandeepmendiratta avatar seeker89 avatar sethp-verica avatar skamboj avatar stuartnelson3 avatar suleimanwa avatar twexler avatar tyler-lloyd avatar ufou avatar wedaly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goldpinger's Issues

Add liveness/readiness endpoints

I think this would be beneficial and will ensure accurate results, i.e. avoid false-positive due to some internal issue in one of the goldpinger pods.

Also test if each node is resolving DNS properly?

Somehow, today I created a new k8s cluster. With goldpinger I could easily check that all of them talk to each other, but it took me a lot of time to find out that one of the nodes was not resolving DNS. (all the others were).

I believe gold pinger would be the perfect tool to check that. It would also check if the nodes are talking to the outside world, which is also a great diagnostic.

Goldpinger helm deployment issue

I have tried deploying goldpinger. Below is what shows in the logs when i deploy the helm chart for goldpinger.

2019/10/17 16:45:05 Metrics setup - see /metrics
2019/10/17 16:45:05 Goldpinger version: 1.0.2 build: Tue Dec 18 21:42:45 UTC 2018
2019/10/17 16:45:05 Kubeconfig not specified, trying to use in cluster config
2019/10/17 16:45:05 Added the static middleware
2019/10/17 16:45:05 Added the prometheus middleware
2019/10/17 16:45:05 All good, starting serving the API
2019/10/17 16:45:05 Serving goldpinger at http://[::]:80
2019/10/17 16:45:24 Shutting down...
2019/10/17 16:45:24 Stopped serving goldpinger at http://[::]:80

please suggest how i can get past the issue and get goldpinger working. Started looking into it yesterday.

thanks

Unable to specify name of instances using HOSTNAME env

Describe the bug
The value of env variable HOSTNAME is not applied to goldpinger instance.

Steps To Reproduce
Deploy goldpinger using yaml example:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: goldpinger-serviceaccount
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: goldpinger
  namespace: default
  labels:
    app: goldpinger
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: goldpinger
  template:
    metadata:
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
      labels:
        app: goldpinger
    spec:
      serviceAccount: goldpinger-serviceaccount
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
      containers:
        - name: goldpinger
          env:
            - name: HOST
              value: "0.0.0.0"
            - name: PORT
              value: "8080"
            # injecting real hostname will make for easier to understand graphs/metrics
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # podIP is used to select a randomized subset of nodes to ping.
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          image: "docker.io/bloomberg/goldpinger:v3.0.0"
          imagePullPolicy: Always
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
          resources:
            limits:
              memory: 80Mi
            requests:
              cpu: 1m
              memory: 40Mi
          ports:
            - containerPort: 8080
              name: http
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: goldpinger
  namespace: default
  labels:
    app: goldpinger
spec:
  type: NodePort
  ports:
    - port: 8080
      nodePort: 30080
      name: http
  selector:
    app: goldpinger

Expected behavior
In UI, goldpinger instances have name of the nodes where they are placed.

Actual behavior
In UI, goldpinger instances have name of the pods where they are placed.

Screenshots
kubectl get nodes -o wide output:
ะกะฝะธะผะพะบ ัะบั€ะฐะฝะฐ 2021-01-22 ะฒ 16 29 01
Goldpinger UI screenshot:
ะกะฝะธะผะพะบ ัะบั€ะฐะฝะฐ 2021-01-22 ะฒ 16 26 12

Environment (please complete the following information):

  • Kubernetes version v1.17.6

Comments

  • If I change PORT env - goldpinger applies this change.
  • If I start a pod with same yaml, but without readiness probes and change image to ubuntu - the environment variable HOSTNAME is passed in container.

Monitoring other endpoints?

Is there an option to check / graph other endpoints?

The general idea is that I'd like to check an external resource, like an external service.

Access Via Ingress

Hi,

Is there an option to configure the deployment to use a parameter like baseURL?

I am trying to access this via an Nginx ingress, I could only get the homepage working but the remaining links are breaking (/raw, /metrics). With the baseURL, all calls will be appended to it.

My ingress config that I used:

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: goldpinger
  namespace: kube-system
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
    nginx.ingress.kubernetes.io/service-upstream: "true"
    nginx.ingress.kubernetes.io/upstream-fail-timeout: "30"
    nginx.ingress.kubernetes.io/limit-connections: "500"
    nginx.ingress.kubernetes.io/limit-rpm: "500"
    nginx.ingress.kubernetes.io/limit-rps: "50"
    nginx.ingress.kubernetes.io/rewrite-target: /
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
  rules:
    - host: xxx.xx.xxx
      http:
        paths:
          - path: /goldpinger
            backend:
              serviceName: goldpinger
              servicePort: 80

Should set GOOS in build/build.sh

I'm on a mac and I just cut a release of goldpinger, the pod is failing because the exectuable is the wrong format. It's built for Darwin right now. In order to build for the Linux kernel you have to set GOOS=linux.

Offline usage

Hello !

First, thanks for this interesting project !

Maybe I did something wrong but it looks like the UI is not usable offline.
I'm not a UI expert but it seems like it's trying to contact Cloudflare (and other CDNs) for some JS files, thus the UI is not fully working on an offline setup.

Would it be possible to embed those JS files needed by the index.html ?
Or maybe add the steps in the Dockerfile to embed those files during the docker build ?

Thanks in advance !

Feature Request - RBAC Access

Not sure if this is on the roadmap or not, but is there a possibility to use the kubernetes RBAC access to allow access to the metrics server? Ideally I should be able to provide the following:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: goldpinger
  namespace: default
-----
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: goldpinger
  namespace: default
rules:
- apiGroups: [""]
  resources:
    - "nodes"
    - "nodes/metrics"
    - "nodes/stats"
    - "nodes/proxy"
    - "pods"
    - "services"
  verbs: ["get", "list"]
-----
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: goldpinger
subjects:
- kind: ServiceAccount
  name: goldpinger
  namespace: default
roleRef:
  kind: ClusterRole
  name: goldpinger
  apiGroup: rbac.authorization.k8s.io
-----
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: goldpinger
  name: goldpinger
  namespace: default
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: goldpinger
      version: 1.0.0
  template:
    metadata:
      labels:
        app: goldpinger
        version: 1.0.0
    spec:
      containers:
      - env:
        - name: HOST
          value: 0.0.0.0
        - name: PORT
          value: "80"
        - name: KUBECONFIG
          value: ./kube/config
        - name: REFRESH_INTERVAL
          value: "30"
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: <REDACTED>
        imagePullPolicy: IfNotPresent
        name: goldpinger
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: goldpinger
      serviceAccountName: goldpinger
      terminationGracePeriodSeconds: 30
  templateGeneration: 3  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Currently providing this will cause the container to crash.

018/12/13 17:54:16 Metrics setup - see /metrics
2018/12/13 17:54:16 Kubeconfig specified in  ./kube/config
2018/12/13 17:54:16 Error getting config  stat ./kube/config: no such file or directory

goldpinger pod CrashLoopBackOff

Hi,
I am trying to run goldpinger in minkube ,but unable to bring goldpinger pod. A pod is crashing. Please find below version of kubernet
rsion: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I build docker file and push to my registry .

Thanks
Vijay

Race condition with data map

On occasion goldpinger will crash. This isn't often, but it ranges anywhere from once a week to once a day. Logs suggest that there is a race condition with writing to and reading from the data map. this results in an (Exit 2)
I think this may be possibly triggered by slow I/O of some sort. But just to be safe we should have some guards in there to help prevent.

kubectl -n kube-system logs -p goldpinger-8rtwn | grep fatal
fatal error: concurrent map read and map write
    lastState:
      terminated:
        containerID: docker://189a097394e2ac3ebc871db42d9c670055b48b5f94aa659af28665ecdef1363f
        exitCode: 2
        finishedAt: 2018-12-13T13:38:26Z
        reason: Error
        startedAt: 2018-12-13T11:47:21Z

Expose dnsResults metics through the /metrics endpoint

I'd love the ability to track dnsResults generated with the HOSTS_TO_RESOLVE parameter as a metric that prometheus can ingest.

It looks like another option to get similar results is to use prometheus's blackbox-exporter...

Multi-arch docker images

Is your feature request related to a problem? Please describe.

I am unable to use the standard docker hub image on a raspberr pi cluster.

Describe the solution you'd like

I'm not sure if travis supports it, but buildx can be used to build multi-arch images than can then be pushed to docker hub.

Describe alternatives you've considered

I have forked the repo and added my own github action to buidl the multi arch image. It would be nice to have the official image support this though.

Readiness probe failed: Get "http://172.16.1.4:8080/healthz": dial tcp 172.16.1.4:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)

Describe the bug
A goldpinger pod gets in crashloopbackoff reporting that the readiness probe failed due to timeout

This may not be a bug per se

But I am able to create pods (we'll call these "toolbox" pods) in the same network as the gold-pinger pods.

The toolbox pods can listen on port 8085, and also can serve up traffic to other toolbox pods

Question is: why might the gold-pinger pods not be able to serve up traffic? Or at least, get blocked while doing so?

Goldpinger describe:

k describe po k8s-podpool-25044293-0-pod-0
Name:         k8s-podpool-25044293-0-pod-0
Namespace:    default
Priority:     0
Node:         k8s-podpool-25044293-0/10.240.0.4
Start Time:   Mon, 28 Mar 2022 13:44:30 -0700
Labels:       app=goldpinger
Annotations:  kubernetes.io/psp: privileged
Status:       Running
IP:           172.16.1.4
IPs:
  IP:  172.16.1.4
Containers:
  goldpinger:
    Container ID:   containerd://cf57961601d1733cedaaef946371e01c3d587441126792643af791e615eebde4
    Image:          docker.io/bloomberg/goldpinger:v3.0.0
    Image ID:       docker.io/bloomberg/goldpinger@sha256:6314bae32e0ec5e993369d6114cf4ade3eb055214ba80d0463d79435c7b2c7eb
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 28 Mar 2022 13:57:47 -0700
      Finished:     Mon, 28 Mar 2022 13:58:06 -0700
    Ready:          False
    Restart Count:  9
    Limits:
      memory:  80Mi
    Requests:
      cpu:      1m
      memory:   40Mi
    Liveness:   http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      HOST:              0.0.0.0
      PORT:              8080
      HOSTNAME:           (v1:spec.nodeName)
      POD_IP:             (v1:status.podIP)
      HOSTS_TO_RESOLVE:  1.1.1.1 8.8.8.8 www.bing.com
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cd5nc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-cd5nc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              dncnode=k8s-podpool-25044293-0
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                 From                             Message
  ----     ------                  ----                ----                             -------
  Normal   Scheduled               <unknown>                                            Successfully assigned default/k8s-podpool-25044293-0-pod-0 to k8s-podpool-25044293-0
  Warning  FailedCreatePodSandBox  16m                 kubelet, k8s-podpool-25044293-0  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b4f6176516b42c0f9f7e8e1c605c7962998d3a01ba2e88cbd88f5c8170394c92": GetMultiTenancyCNIResult failed:GetContainerNetworkConfiguration failed:[Azure cnsclient] Code: 18 , Error: NetworkContainer doesn't exist.
  Normal   Started                 15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Started container goldpinger
  Warning  Unhealthy               15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Readiness probe failed: Get "http://172.16.1.4:8080/healthz": dial tcp 172.16.1.4:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Normal   Killing                 15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Container goldpinger failed liveness probe, will be restarted
  Warning  Unhealthy               15m (x6 over 15m)   kubelet, k8s-podpool-25044293-0  Liveness probe failed: Get "http://172.16.1.4:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               15m (x6 over 15m)   kubelet, k8s-podpool-25044293-0  Readiness probe failed: Get "http://172.16.1.4:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Created                 15m (x3 over 15m)   kubelet, k8s-podpool-25044293-0  Created container goldpinger
  Normal   Pulled                  15m (x3 over 15m)   kubelet, k8s-podpool-25044293-0  Container image "docker.io/bloomberg/goldpinger:v3.0.0" already present on machine
  Warning  BackOff                 65s (x59 over 14m)  kubelet, k8s-podpool-25044293-0  Back-off restarting failed container

Goldpinger logs

k logs k8s-podpool-25044293-0-pod-0
{"level":"info","ts":1648504183.5840726,"caller":"goldpinger/main.go:63","msg":"Goldpinger","version":"v3.0.0","build":"Wed Apr  8 12:36:28 UTC 2020"}
{"level":"info","ts":1648504183.5996177,"caller":"goldpinger/main.go:102","msg":"Kubeconfig not specified, trying to use in cluster config"}
{"level":"info","ts":1648504183.6012673,"caller":"goldpinger/main.go:127","msg":"--ping-number set to 0: pinging all pods"}
{"level":"info","ts":1648504183.602398,"caller":"restapi/configure_goldpinger.go:132","msg":"Added the static middleware"}
{"level":"info","ts":1648504183.60244,"caller":"restapi/configure_goldpinger.go:149","msg":"Added the prometheus middleware"}

Goldpinger pod yaml

 k get po -oyaml k8s-podpool-25044293-0-pod-0
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: privileged
  creationTimestamp: "2022-03-28T20:44:30Z"
  labels:
    app: goldpinger
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
      f:spec:
        f:affinity:
          .: {}
          f:podAntiAffinity:
            .: {}
            f:preferredDuringSchedulingIgnoredDuringExecution: {}
        f:containers:
          k:{"name":"goldpinger"}:
            .: {}
            f:env:
              .: {}
              k:{"name":"HOST"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"HOSTNAME"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"HOSTS_TO_RESOLVE"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"POD_IP"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"PORT"}:
                .: {}
                f:name: {}
                f:value: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":8080,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:name: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:allowPrivilegeEscalation: {}
              f:readOnlyRootFilesystem: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:nodeSelector: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext:
          .: {}
          f:fsGroup: {}
          f:runAsNonRoot: {}
          f:runAsUser: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
    manager: __debug_bin
    operation: Update
    time: "2022-03-28T20:44:30Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"172.16.1.4"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    subresource: status
    time: "2022-03-28T20:44:44Z"
  name: k8s-podpool-25044293-0-pod-0
  namespace: default
  resourceVersion: "910397"
  uid: afd422d4-b742-48c9-887e-0321c421dc78
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - goldpinger
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - env:
    - name: HOST
      value: 0.0.0.0
    - name: PORT
      value: "8080"
    - name: HOSTNAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: HOSTS_TO_RESOLVE
      value: 1.1.1.1 8.8.8.8 www.bing.com
    image: docker.io/bloomberg/goldpinger:v3.0.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: goldpinger
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        memory: 80Mi
      requests:
        cpu: 1m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-cd5nc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k8s-podpool-25044293-0
  nodeSelector:
    dncnode: k8s-podpool-25044293-0
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccount: goldpinger-serviceaccount
  serviceAccountName: goldpinger-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-cd5nc
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    message: 'containers with unready status: [goldpinger]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    message: 'containers with unready status: [goldpinger]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://4e4a3883c9b0b7e472341069ae896b320bafb7c11410b03cbed8b8ecfb7660db
    image: docker.io/bloomberg/goldpinger:v3.0.0
    imageID: docker.io/bloomberg/goldpinger@sha256:6314bae32e0ec5e993369d6114cf4ade3eb055214ba80d0463d79435c7b2c7eb
    lastState:
      terminated:
        containerID: containerd://4e4a3883c9b0b7e472341069ae896b320bafb7c11410b03cbed8b8ecfb7660db
        exitCode: 2
        finishedAt: "2022-03-28T22:01:26Z"
        reason: Error
        startedAt: "2022-03-28T22:01:09Z"
    name: goldpinger
    ready: false
    restartCount: 31
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=goldpinger pod=k8s-podpool-25044293-0-pod-0_default(afd422d4-b742-48c9-887e-0321c421dc78)
        reason: CrashLoopBackOff
  hostIP: 10.240.0.4
  phase: Running
  podIP: 172.16.1.4
  podIPs:
  - ip: 172.16.1.4
  qosClass: Burstable
  startTime: "2022-03-28T20:44:30Z"

Toolbox pod describe:

k describe po toolbox2
Name:         toolbox2
Namespace:    default
Priority:     0
Node:         k8s-podpool-25044293-1/10.240.0.64
Start Time:   Mon, 28 Mar 2022 14:46:31 -0700
Labels:       app=toolbox
Annotations:  kubernetes.io/psp: privileged
Status:       Running
IP:           172.16.1.7
IPs:
  IP:  172.16.1.7
Containers:
  toolbox:
    Container ID:   containerd://257f4c29c8f8f559762ae81bd4babea0a4f6bfd06cc0125dc147d6596b60cd9f
    Image:          matmerr/toolbox:lite
    Image ID:       docker.io/matmerr/toolbox@sha256:e47201c473eb0cfdeccd381a696492ac2f8f2bae62b2d92dc8832bffa341502c
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 28 Mar 2022 14:46:32 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      TCP_PORT:   8085
      UDP_PORT:   8086
      HTTP_PORT:  8087
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4tcqp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-4tcqp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              dncnode=k8s-podpool-25044293-1
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                             Message
  ----    ------     ----       ----                             -------
  Normal  Scheduled  <unknown>                                   Successfully assigned default/toolbox2 to k8s-podpool-25044293-1
  Normal  Pulled     13m        kubelet, k8s-podpool-25044293-1  Container image "matmerr/toolbox:lite" already present on machine
  Normal  Created    13m        kubelet, k8s-podpool-25044293-1  Created container toolbox
  Normal  Started    13m        kubelet, k8s-podpool-25044293-1  Started container toolbox

toolbox logs:

 k logs toolbox2
[HTTP] Listening on 8087
[UDP] Listening on [::]:8086
[TCP] Listening on [::]:8085
[TCP] Received Connection from 172.16.1.6:49240
[TCP] Received Connection from 172.16.1.6:49616

kubectl get po -owide

azure-cni-manager-snrc2        1/1     Running            0                78m   10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
azure-cni-manager-zsxv6        1/1     Running            0                78m   10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-765kd                1/1     Running            0                78m   10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-twhhv                1/1     Running            0                78m   10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
dnc-788c667694-x2bwj           1/1     Running            0                76m   10.240.0.94   k8s-dncpool-25044293-0   <none>           <none>
k8s-podpool-25044293-0-pod-0   0/1     CrashLoopBackOff   29 (4m46s ago)   76m   172.16.1.4    k8s-podpool-25044293-0   <none>           <none>
k8s-podpool-25044293-1-pod-0   0/1     CrashLoopBackOff   29 (4m21s ago)   76m   172.16.1.5    k8s-podpool-25044293-1   <none>           <none>
toolbox1                       1/1     Running            0                65m   172.16.1.6    k8s-podpool-25044293-0   <none>           <none>
toolbox2                       1/1     Running            0                14m   172.16.1.7    k8s-podpool-25044293-1   <none>           <none>

Trying to connecto to goldpinger readiness probe, from toolbox1 pod:

curl -v "http://172.16.1.4:8080/healthz"
*   Trying 172.16.1.4:8080...
* TCP_NODELAY set
* connect to 172.16.1.4 port 8080 failed: Connection refused
* Failed to connect to 172.16.1.4 port 8080: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 172.16.1.4 port 8080: Connection refused

Don't use sudo to build docker image

The docker cli should not need sudo to talk to the docker daemon. On a linux system you can do sudo usermod -aG docker $USER and you'll get access to the docker.sock (check the ownership of the sock to make sure you're doing the right usermod). On mac os x the docker for mac will set up docker properly, no sudo needed.

Open to a PR to fix this?

Support advanced zap configuration

Is your feature request related to a problem? Please describe.
We are using goldpinger on our self hosted kubernetes clusters with fluentd to gather logs into our elasticsearch but the thing is that its not possible to adjust the logger configuration such as the level since its hardcoded thus leaving us with a lot of logs ingested that costs us a certain amount of money.

Describe the solution you'd like
Implement a way to load a custom zap configuration through a json file as its described in their official documentation

Describe alternatives you've considered
Implement command line arguments to configure each component of zap like --zap-level but it will be more work to maintain the link between command line arguments and zap's expected parameters:

  • What if they remove/change the name of one of their config variable?

Clarity on Master vs. Peer Response Time values

I've been trying to interpret the difference between "master_response_time" and "peers_response_time" in the ping results. The closest thing I see is these labels in the Prometheus metrics:

# HELP goldpinger_kube_master_response_time_s Histogram of response times from kubernetes API server, when listing other instances
# HELP goldpinger_peers_response_time_s Histogram of response times from other hosts, when making peer calls

This explanation still isn't very obvious, especially with the results I get from each (I've seen peer response time be lower than master response time at times). One way I've been thinking about it is intranode vs. internode, but that wouldn't make complete sense either since our "master" pod that we're port-forwarding is still reaching pods on other nodes.

It would be great if these definitions could be elaborated on.

Proposal: Allow limiting number of pods a single goldpinger "pings"

For larger clusters, it becomes prohibitively expensive cardinality-wise to scrape and store the metrics from every pod connecting to every other pod (n^2 timeseries growth).

We aren't necessarily interested in seeing that 95% of pods aren't able to reach the 5% that are in a rack with some sort of networking issue, just that each pod is being contacted by a large enough subset of goldpingers that we can detect an issue, while still keeping the amount of prometheus metrics exported at a reasonable size.

Would you consider a flag that would indicate to goldpinger to select X (where X is configurable) nodes, and to ping against those?

If you're interested, we can talk about potential implementation details.

On a side note: Thanks! We've been using goldpinger in smaller k8s clusters and it's detected issues, but we can't use it in any larger clusters, hence the feature request.

Support multiple pod networks

Multiple CNIs can run in parallel, creating multiple data planes in a Kubernetes cluster. Goldfinger should allow me to ping over an arbitrary data plane.

It would seem that POD_IP is the place to set this, but when I set POD_IP for each goldpinger pod to an IP on an alternative dataplane, this setting is lost after discovery; goldpinger finds the peer pods but uses the primary data plane IP addresses.

Not all data planes are discoverable from the Pod API. E.g., Multus data plane specs are in Pod annotations. Meshnet-CNI specs data planes in a CRD that refers to a Pod resource by name.

Default path not showing UI

Describe the bug

Hi guys, big fan of the app - solves a real problem for me and my team in terms of networking monitoring.

When testing out the app I was able to port-forward a pod on my cluster, go to localhost:8080/ and see the UI displaying the topology health however on more recent deployments every time I follow the same procedure the / path redirects me to http://localhost:8080/alive which 404s and is obviously not the UI.

To Reproduce
Steps to reproduce the behavior:

  1. deployed goldpinger with the following manifest
# Source: goldpinger/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: goldpinger
  labels:
    helm.sh/chart: goldpinger-0.1.1
    app: goldpinger
    k8s-app: goldpinger
---
# Source: goldpinger/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: goldpinger
  labels:
    helm.sh/chart: goldpinger-0.1.1
    app: goldpinger
    k8s-app: goldpinger
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: goldpinger
    namespace: goldpinger
---
# Source: goldpinger/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: goldpinger
  labels:
    app: goldpinger
spec:
  type: NodePort
  ports:
    - port: 8080
      name: http
  selector:
    app: goldpinger
---
# Source: goldpinger/templates/app.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: goldpinger
  labels:
    helm.sh/chart: goldpinger-0.1.1
    app: goldpinger
    k8s-app: goldpinger
spec:
  selector:
    matchLabels:
      app: goldpinger
  template:
    metadata:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      labels:
        app: goldpinger
    spec:
      serviceAccountName: goldpinger
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: goldpinger
          env:
            - name: HOST
              value: "0.0.0.0"
            - name: PORT
              value: "8080"
            # injecting real hostname will make for easier to understand graphs/metrics
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # podIP is used to select a randomized subset of nodes to ping.
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: HOSTS_TO_RESOLVE
              value: "www.google.com"
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
          image: "docker.io/bloomberg/goldpinger:v3.0.0"
          imagePullPolicy: Always
          ports:
            - containerPort: 8080
              name: http
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
          resources:
            limits:
              memory: 80Mi
            requests:
              cpu: 1m
              memory: 40Mi
      tolerations:
        - effect: NoExecute
          operator: Exists
        - effect: NoSchedule
          operator: Exists
  1. Portforward via
kubectl -n goldpinger port-forward ds/goldpinger 8080
  1. Go to localhost:8080/ in your browser
  2. See error

Expected behavior

I expected to see the UI showing all the nodes and their connection health. The /metrics endpoints work correctly as expected.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • EKS v1.16.15
  • Chrome (Mac OS)

Additional context
Add any other context about the problem here.

goldpinger UI fails if a node is in bad state.

Describe the bug
When I disable a node (i.e. manually stop a VM, or kills flanneld service, etc), I expect to see red mark on that particular node in goldpinger UI. However, the entire goldpinger UI fails. This happens even if I port-forward other goldpinger pods in healthy nodes.

Is this an expected behavior? If so, doesn't it defeat the purpose of using goldpinger UI when a node is in bad state?

To Reproduce
Steps to reproduce the behavior:

  1. Deploy goldpinger as daemonsets
  2. Make a node unhealthy. (stops the VM, disables flanneld service, etc)
  3. Try to load goldpinger UI.

Expected behavior
Goldpinger UI loads and shows the red marks on the network paths to the affected node.

Screenshots
If applicable, add screenshots to help explain your problem.
When accessing via pod port-forwarding:
Screen Shot 2019-11-07 at 2 53 15 PM

When accessing via service:
Screen Shot 2019-11-07 at 2 56 39 PM

logs from still working goldpinger pod:
Screen Shot 2019-11-07 at 2 57 46 PM

Environment (please complete the following information):

  • Operating System and Version: [e.g. Ubuntu Linux 18.04]
  • Browser [e.g. Firefox, Safari] (if applicable):
    Tried in Chrome and Safari

Additional context
Add any other context about the problem here.

** Deployment configuration **

{
  apiVersion: "apps/v1",
  kind: "DaemonSet",
  metadata: {
    name: "goldpinger",
    namespace: "goldpinger",
    labels: {
      app: "goldpinger",
    },
  },
  spec: {
    updateStrategy: {
      type: "RollingUpdate",
    },
    selector: {
      matchLabels: {
        app: "goldpinger",
      },
    },
    template: {
      metadata: {
        labels: {
          app: "goldpinger",
        },
      },
      spec: {
        serviceAccount: "goldpinger",
        containers: [
          {
            name: "goldpinger",
            env: [
              {
                name: "HOST",
                value: "0.0.0.0",
              },
              {
                name: "PORT",
                value: "80",
              },
              {
                name: "HOSTNAME",
                valueFrom: {
                  fieldRef: {
                    fieldPath: "spec.nodeName",
                  },
                },
              },
            ],
            image: "docker.io/bloomberg/goldpinger:2.0.0",
            ports: [
              {
                containerPort: 80,
                name: "http",
              },
            ],
          },
        ],
      },
    },
  },
}

Fix the example kubernetes yaml files

The example in README is incomplete. The example in extras is out of date (goldpinger version). We should also add an ingress definition as an alternative to NodePort.

Question: is the type NodePort service needed to function?

In the README.md and the example-serviceaccounts.yml

we got

---
apiVersion: v1
kind: Service
metadata:
  name: goldpinger
  namespace: default
  labels:
    app: goldpinger
spec:
  type: NodePort
  ports:
    - port: 80
      nodePort: 30080
      name: http
  selector:
    app: goldpinger

which exposes the service to the outside world over the NodePort.

Question is: does goldpinger need it to function or can we just use a normal ClusterIP service?

Security wise it doesn't seems to be a good default to me.
If I wanted, I also just could use kubectl port-forward or create a nginx resource for it.

Missing "response-time-ms" when querying check api

Describe the bug
Running curl $IP:8080/check on a goldpinger pod some of the returned results are missing field "response-time-ms". Are values less than 1ms just missing?

To Reproduce
Steps to reproduce the behavior:
on a kube cluster with a sufficently low enough network latency you'll see statuc-code 200s returned but no "response-time-ms" field:

# curl 172.20.0.221:8080/check | python -mjson.tool
...
    "podResults": {
        "x.x.x.x": {
            "HostIP": "x.x.x.x",
            "OK": true,
            "response": {
                "boot_time": "2020-01-03T20:51:07.082Z"
            },
            "status-code": 200
        },
...

Expected behavior
The field "response-time-ms" to exist when status-code 200 is returned

Environment (please complete the following information):

  • OpenShift 3.11

Goldpinger is too sensitive to autoscaler activity

Describe the bug

The order of operations for removing a node in Kubernetes is:

  1. Cluster autoscaler removes VM from the underlying cloud provider
  2. Node object enters NotReady state since the Kubelet process stops reporting
  3. Cloud Controller eventually notices that the VM is gone and removes the Node object

The time between 2 and 3 can be quite long (many minutes in some clouds). Goldpinger continuously tries to reach the node during this time causing spikes in Goldpinger metrics.

To Reproduce
Steps to reproduce the behavior:

  1. Overscale a cluster in a cloud with long deletion times such as Azure
  2. Allow cluster to scale down
  3. Observe peer failures in Goldpinger metrics and logs

Expected behavior

Goldpinger should provide a mechanism to filter out NotReady nodes from metric queries to focus on Nodes which are expected to be functioning normally.

Screenshots

Here's an example showing Goldpinger error rates spike as a cluster scaled down over a period of hours.

Screen Shot 2020-05-26 at 4 50 07 PM

Environment (please complete the following information):

  • Operating System and Version: N/A
  • Browser [e.g. Firefox, Safari] (if applicable): N/A

Additional context
Add any other context about the problem here.

goldpinger metrics explanation

Hi ! I find goldpinger very useful, but I totally can't understand his metrics meaning. Can you explain, please?

goldpinger_nodes_health_total{app="goldpinger",controller_revision_hash="6bd97ddd49",goldpinger_instance="node22",instance="10.44.35.135:8080",job="pods",kubernetes_namespace="goldpinger",kubernetes_pod_name="goldpinger-qzpd7",status="unhealthy"} 4

4 as result here. What does that mean exactly?

goldpinger_nodes_health_total{goldpinger_instance="node30",instance="node08",job="goldpingers",status="unhealthy"} 8

8 as result here. What does that mean exactly?

Goldpinger starts so quickly it often gets pods without IPs

Goldpinger starts quickly, which is good, but it has an interesting side effect: often when it first lists pods, kubernetes doesn't list their IPs. With the current implementation, it often takes 30 seconds to restart the updaters goroutines. During these initial 30 seconds, graph and metrics are unhealthy.

It would be good to enhance it, so that if the pods don't have IPs, we retry to get them quicker than on other kinds of errors.

Add TCP checks on top of HTTP checks

I've been trying Goldpinger out recently, and have been finding it useful already. I was looking through the code, and was curious why you chose to do http monitoring, rather than tcp or icmp pings?

The REST API should have a health endpoint

Currently goldpinger's REST API doesn't have a /healthz endpoint, which is not ideal for a kubernetes service as you're forced to check / in a (liveness|readiness)Probe. This isn't a best practice, so a health endpoint should be added.

Should support IPv4/IPv6 dual-stack

Is your feature request related to a problem? Please describe.

Goldpinger should allow for IPv6 IPs to be specified in the config and ping over IPv6 rather than only IPv4.

Describe the solution you'd like

Setting an env var like "PING_IPV6" would use the IPv6 IP address on pods instead of IPv4.

Describe alternatives you've considered

I haven't tested if IPv6 was the default and status.PodIP was an IPv6 address, but that is the only way to use IPv6 addresses with goldpinger. It should support dual-stack clusters that have IPv4/IPv6 addresses in spec.PodIPs

Long Ping Times

Hi, this is looks to be a really neat project and I love the name.

When I tried to install via Helm chart the default Grafana chart showed very long "ping" times; circa 2 seconds for some larger clusters and even a ten node cluster is showing ~1 second ping times.

I set up another service to run requests against the /ping endpoint that Goldpinger exposes and saw the expected low ping times (circa a few mills).

Have you noticed this before with any of your clusters? Any tips on configuration gotchas that I may have missed?

When PING_NUMBER is nonzero, there are many nodes that are immediately marked as unhealthy

Describe the bug
If I start off with 50 healthy nodes and then set PING_NUMBER to 20, I notice that roughly 30 nodes get marked as unhealthy. It appears that 20 nodes ping 20 nodes instead of 50 (all) nodes pinging 20 nodes. When I unset PING_NUMBER, the nodes go back to being healthy.

I think the bug is here. It seems to me that AllPods() should be "checked" but only SelectPods() are actually being checked which results in SelectPods() pinging SelectPods().

Expected behavior
All nodes ping PING_NUMBER nodes and no nodes are marked as unhealthy as a result of applying change to set PING_NUMBER.

Long ping times

Hey all,

Saw a similar post with this problem. When I try and install via Helm chart the default Grafana chart showed very long ping times; circa ~3 seconds for some larger clusters, and even a ten node cluster is showing ~1 to ~2 second ping times.

As one user has done, I made a new build and instrumented out the calls a bit and the reported time seems accurate, but when I hit the /ping endpoints directly from pods with other tools like a timed curl the time seems much more reasonable (~a few mills).

Ping Hostname

I'd like to configure it to ping the hostname, not the IP. This would allow our team to test network connections using the external IP.

[feature request] topology.kubernetes.io values in metrics

It would be awesome if the metrics could include the node's topology.kubernetes.io/region & topology.kubernetes.io/zone (not sure what the non-AWS fields are called). That would help quickly identify patterns in the results. e.g., zone "a" is slower than zone "b" or that cross-zone traffic is what's causing a large spike in response times. Not sure if this is possible within the framework of this tool.

keep up the great work & thank you!

Include PodSpec.nodeName as a label in goldpinger_peers_response_time_s_* metrics

Hey guys!

Happy New Year!!! ๐ŸŽ‰

Super cool project, we've recently deployed this in our clusters and it's proving to be super useful.

I was wondering if it would be possible to add a nodeName label to these (goldpinger_peers_response_time_s_*) metrics so it would be easier to identify the target nodes being probed.

the node name should be available in the PodSpec.nodeName

Kuberenetes and Openshift Operator

Hello, i would like to collaborate with your team for creating a goldpinger operator, for Kubernetes and Openshift.
I have most of the logic already working, based on operator-sdk
I would love if any of you could be assigned to help me.
Thanks!
have a great week.
Dan

REFRESH_INTERVAL should have a sane default

Currently if REFRESH_INTERVAL is not set in the environment, goldpinger will not update/refresh, as the zero-value of an int is 0. This seems counter-intuitive as a default. I think we should set this to 30s, as in the examples.

Windows Container support

Is your feature request related to a problem? Please describe.
I'm testing running Kubernetes with Windows nodes, so of course i need this tool, but there is no instruction of how to build and run this in a Windows container.

Describe the solution you'd like
A instruction of how to build this on a windows container and preferably a pre-build container on docker-hub
A instruction of how to deploy this on a mixed Linux/Windows Kubernetes cluster.

Describe alternatives you've considered
Building this myself, but then i need to understand how go things are build.

Additional context

Helm charts repo deprecation

Is your feature request related to a problem? Please describe.
Due to deprecating helm/charts repository goldpinger helm chart should be moved somewhere. Do bloomberg maintainers want to create own repo for helm charts or put helm chart in the goldpinger repo?

Describe the solution you'd like
The chart should be moved to another repo.

Describe alternatives you've considered

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.