Giter Club home page Giter Club logo

fury-kubernetes-monitoring's Introduction

Kubernetes Fury Monitoring

Release License Slack

Kubernetes Fury Monitoring provides a fully-fledged monitoring stack for the Kubernetes Fury Distribution (KFD). This module extends and improves upon the Kube-Prometheus project.

If you are new to KFD please refer to the official documentation on how to get started with KFD.

Overview

This module is designed to give you full control and visibility over your cluster operations. Metrics from the cluster and the applications are collected and clean analytics are offered via a visualization platform, Grafana.

The centerpiece of this module is the [prometheus-operator], which offers the easy deployment of the following as controllers:

  • Prometheus: An open-source monitoring and alerting toolkit for cloud-native applications
  • Alertmanager: Manages alerts sent by the Prometheus server and route them through receiver integrations such as email, Slack, or PagerDuty
  • ServiceMonitor: Declaratively specifies how groups of services should be monitored, by automatically generating Prometheus scrape configuration based on the definition

Since the export of certain metrics can be heavily cloud-provider specific, we provide a bunch of cloud-provider specific configurations. The setups we currently support include:

  • Google Kubernetes Engine (GKE)
  • Azure Kubernetes Service (AKS)
  • Elastic Kubernetes Service (EKS)
  • on-premises or self-managed cloud clusters

Most of the components in this module are deployed in namespace monitoring, unless the functionality requires permissions that force it to be deployed in the namespace kube-system.

Packages

Kubernetes Fury Monitoring provides the following packages:

Package Version Description
prometheus-operator 0.67.1 Operator to deploy and manage Prometheus and related resources
prometheus-operated 2.46.0 Prometheus instance deployed with Prometheus Operator's CRD
alertmanager-operated 0.26.0 Alertmanager instance deployed with Prometheus Operator's CRD
blackbox-exporter 0.24.0 Prometheus exporter that allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC.
grafana 9.5.5 Grafana deployment to query and visualize metrics collected by Prometheus
karma 0.113 Karma deployment to visualize alerts sent by AlertManager
kube-proxy-metrics 0.14.0 RBAC proxy to securely expose kube-proxy metrics
kube-state-metrics 2.9.2 Service that generates metrics from Kubernetes API objects
node-exporter 1.6.1 Prometheus exporter for hardware and OS metrics exposed by *NIX kernels
prometheus-adapter 0.11.1 Kubernetes resource metrics, custom metrics, and external metrics APIs implementation.
thanos 0.34.0 Thanos is a high-availability Prometheus setup that provides long term storage via an external object store
x509-exporter 3.12.0 Provides monitoring for certificates
mimir 2.11.0 Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.
haproxy N.A. Grafana dashboards and prometheus rules (alerts) for HAproxy.

Integration with cloud providers

One of the following components can be used to enable service monitoring in each cloud environment:

Component Description
aks-sm ServiceMonitor to collect Kubernetes components metrics from AKS
gke-sm ServiceMonitor to collect Kubernetes components metrics from GKE
eks-sm ServiceMonitor to collect Kubernetes components metrics from EKS
kubeadm-sm ServiceMonitors, Prometheus rules and alerts for Kubernetes components of self-managed or on-premises clusters

Please refer to the individual package documentation for further details.

Compatibility

Kubernetes Version Compatibility Notes
1.27.x No known issues
1.28.x No known issues
1.29.x No known issues

Check the compatibility matrix for additional information about previous releases of the modules.

Usage

Prerequisites

Tool Version Description
furyctl >=0.25.0 The recommended tool to download and manage KFD modules and their packages. To learn more about furyctl read the official documentation.
kustomize >=3.5.3 Packages are customized using kustomize. To learn how to create your customization layer with kustomize, please refer to the repository.

Deployment

  1. List the packages you want to deploy and their version in a Furyfile.yml
versions:
 monitoring: v3.2.0

bases:
    - name: monitoring/prometheus-operator
    - name: monitoring/prometheus-operated
    - name: monitoring/alertmanager-operated
    - name: monitoring/blackbox-exporter
    - name: monitoring/kube-proxy-metrics
    - name: monitoring/kube-state-metrics
    - name: monitoring/grafana
    - name: monitoring/node-exporter
    - name: monitoring/prometheus-adapter
    - name: monitoring/x509-exporter

Along with the primary components, include one of the following components, based on your cloud provider for service monitoring:

  • ServiceMonitor for AWS EKS cluster
  ...
  - name: monitoring/eks-sm
  • ServiceMonitor for Azure AKS cluster
  ...
  - name: monitoring/aks-sm
  • ServiceMonitor for GCP GKE cluster
  ...
  - name: monitoring/gke-sm
  • ServiceMonitor for on-premises and self-managed cluster
  ...
  - name: monitoring/kubeadm-sm

See furyctl documentation for additional details about Furyfile.yml format.

  1. Execute furyctl legacy vendor -H to download the packages

  2. Inspect the download packages under ./vendor/katalog/monitoring.

  3. To deploy the packages to your cluster, define a kustomization.yaml with the following content:

bases:
    - ./vendor/katalog/monitoring/prometheus-operator
    - ./vendor/katalog/monitoring/prometheus-operated
    - ./vendor/katalog/monitoring/alertmanager-operated
    - ./vendor/katalog/monitoring/blackbox-exporter
    - ./vendor/katalog/monitoring/kube-proxy-metrics
    - ./vendor/katalog/monitoring/kube-state-metrics
    - ./vendor/katalog/monitoring/grafana
    - ./vendor/katalog/monitoring/node-exporter
    - ./vendor/katalog/monitoring/prometheus-adapter
    - ./vendor/katalog/monitoring/x509-exporter

Include in the kustomization also the ServiceMonitor package specific to each service provider as follows:

  • For AWS EKS
  ...
  - ./vendor/katalog/monitoring/eks-sm
  • For GCP GKE
  ...
  - ./vendor/katalog/monitoring/gke-sm
  • For Azure AKS
  ...
  - ./vendor/katalog/monitoring/aks-sm
  • For on-premises and self-managed
  ...
  - ./vendor/katalog/monitoring/kubeadm-sm
  1. To deploy the packages to your cluster, execute:
kustomize build . | kubectl apply -f - --server-side

Examples

To see examples of how to customize Fury Kubernetes Monitoring packages, please go to the examples directory.

Contributing

Before contributing, please read first the Contributing Guidelines.

Reporting Issues

In case you experience any problems with the module, please open a new issue.

License

This module is open-source and it's released under the following LICENSE

fury-kubernetes-monitoring's People

Contributors

angelbarrera92 avatar beratio avatar deepzima avatar gitirabassi avatar giusepperotella avatar jnardiello avatar kandros avatar lnovara avatar luigibarbato avatar lzecca78 avatar mimnix avatar nandajavarma avatar nohant avatar nutellinoit avatar phisco avatar ralgozino avatar sbruzzese902 avatar sighup-c-3po avatar simonemex avatar smerlos avatar valecontenti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fury-kubernetes-monitoring's Issues

machine-id value monitor to assure unique values around cluster nodes

After an outage, i've opened an issue with @angelbarrera92 on the node-exporter project, to implements the machine-id collector also : prometheus/node_exporter#1546 .
Long story short, basically already exists a metacollector that can be adapter for our usecase , and this is called the textfile collector.
As said in the issue referred above, it can be configured for this purpose by following these steps:

  • add the following parameter to the node-exporter binary : ---collector.textfile.directory=/var/textfile-dir where /var/textfile-dir can be whatever dir you want

  • create inside the directory used (in our example /var/textfile-dir) a file machine-id.prom with the following content: systemd_machine_id{id="abc"} 1 where i guess "abc" will be the label printed in the prometheus time-series db.

I think it just work out-of-the-box, probably is necessary to adjust the id key, but nothing else.
After this will be mandatory add a ServiceMonitor to identify that every machine-id must be unique in the cluster

node-exporter targetdown

We are facing the same problem in different cluster, every single one of the has begun having this alert after the kubernetes upgrade to 1.21, and going even to the 1.23 the alert is still there.

error faced and symptoms
the log of the pod are full of this: node-exporter-gsg5p kube-rbac-proxy http: proxy error: context canceled
prometheus see pod as down and start alerting for targetdown: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10`

module version
the version for the monitoring is usually the v1.14.3.

kubenertes version
from 1.21 upwards

test done for solution
removed the runasnonroot true,
removed any limits on the resources
checked if there were throttling problems
edited scrapeTimeout and interval

solution that has worked
remove the container kube-rbac-proxy from the node-exporter pod

Improve v2.0.0 release notes

Since the grafana service changes, we need to delete it before upgrading. We could add it in the notes, in addition to delete the grafana deployment also delete the service before apply the new version.

CPUThrottlingHigh alert too easily triggered

The CPUThrottlingHigh alert is used to notify if the Kubelet is throttling the pod's CPU usage for more than 25% of time in the last 5 minutes. While this might indicate wrong CPU limits there are particular class of pods (e.g. node-exporter) that are more subject to throttling than other.

My proposal is either to drop this alert or to increase the threshold to, at least, 50-75% given the narrow time window.

What's your opinion about this?

I am also attaching some issues from upstream projects.

Refs:

reduce prometheus memory usage

we spent some day working on a prometheus installation and we found that we "leave" too many labels for the metrics, some are even duplicates.

Proposal
by adding the following configuration we can remove some of the label which have high cardinality and aren't used in dashboards or rules from the various FURY modules

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
spec:
  writeRelabelConfigs:
    - job_name: id-label
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'id'
          action: drop
    - job_name: container_id-label
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'container_id'
          action: drop
    - job_name: le-label
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'le'
          action: drop
    - job_name: service-name-label
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'service-name'
          action: drop
    - job_name: address-label
        metric_relabel_configs:
        - source_labels: [__name__]
          regex: 'address'
          action: drop

In order to further save memory it will be a good idea to choose between the service and job label, since those two are replicated but somewhere its used job, somewhere else is used service.
By choosing which one to use everywhere we can drastically reduce the cardinality by removing the other as per the example posted before.

What do you think about this?

prometheus adapter ApiService for custom and external metrics

The prometheus adapter currently does not create external and custom metrics for HPA

In the implementation, we only define the possibility of using the most common metrics based on CPU and mem. Still, in the readme, we describe the possibility of using aggregate metrics based on other resource types Kubernetes based(custom) and external or not of type kubernetes(external) :
The Prometheus adapter provides an implementation of Kubernetes
resource metrics,
custom metrics, and
external metrics APIs.

Prometheus-operated: ClusterRole - cannot list resources

Kubernetes Version: v1.24.8
fury-kubernetes-monitoring: v2.0.0
furyctl: Furyctl version 0.8.0
Furyfile.yml

versions:
 monitoring: v2.0.0
bases:
    - name: monitoring/prometheus-operator
    - name: monitoring/prometheus-operated
    - name: monitoring/alertmanager-operated
    - name: monitoring/kube-state-metrics
    - name: monitoring/grafana
    - name: monitoring/node-exporter
    - name: monitoring/kubeadm-sm

In the default configuration prometheus-operated has insufficient permissions in its ClusterRole.
https://github.com/sighupio/fury-kubernetes-monitoring/blob/v2.0.0/katalog/prometheus-operated/clusterRole.yaml

Logs from prometheus pod:

ts=2022-12-01T12:50:47.605Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.605Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"monitoring\""
ts=2022-12-01T12:50:47.607Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)"
ts=2022-12-01T12:50:47.607Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)"

After updating the prometheus-operated/clusterRole.yaml with below file it started working corectly

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - services
  - pods
  - endpoints
  verbs:
  - get
  - list
  - watch

examples are outdated

The examples provided in the examples folder are outdated, they won't work with the latest versions of Furyctl and of the module.

At least those for Alert Manager (the ones I checked).

KubeletDown alert always firing

The KubeletDown alert is always firing.

The alert's expression is absent(up{job="kubelet",metrics_path="/metrics"} == 1) but since the up time series is lacking the metrics_path label, this makes the absent function always return a 1-element vector thus making the alert trigger.

I will provide a PR to fix this.

UX: Use Folders in Grafana

Currently, all dashboards deployed by default end up inside the "Default" folder in Grafana's UI, ending up being a little messy to discover the dashboards or find a specific one.

It would be better if we organize the dashboards in folders. We could, for example, have a folder for each KFD module or group them with some other criteria.

Alertmanager adds namespace label to its configuration

Hello guys! During a KFD upgrade, I noticed this weird behavior using the current AlertManagerConfig resources.

Inspecting the /etc/alertmanager/config/alertmanager.yaml inside alertmanager pods I get this:

route:
  routes:
  - receiver: monitoring/deadmanswitch/healthchecks
    matchers:
    - alertname="DeadMansSwitch"
    - namespace="monitoring"
    continue: true
  - receiver: monitoring/infra/infra
    group_by:
    - alertname
    matchers:
    - namespace=~"calico-api|calico-system|cert-manager|gatekeeper-system|ingress-nginx|kube-system|logging|monitoring|pomerium|tigera-operator|vmware-system-csi"
    - alertname=~"TargetDown|KubePersistentVolumeFillingUp|NginxIngressCertificateExpiration|NginxIngressLatencyTooHigh|NginxIngressFailedReload|NginxIngressFailureRate|NginxIngressDown"
    - namespace="monitoring"
    continue: true
  - receiver: monitoring/k8s/k8s
    group_by:
    - alertname
    matchers:
    - alertname=~"NodeNetworkInterfaceFlapping|NodeClockSkewDetected|KubeAPIDown|KubeletDown|KubeletTooManyPods|KubeNodeNotReady|KubeClientCertificateExpiration|NodeFileDescriptorLimit|NodeFilesystemAlmostOutOfFiles|NodeFilesystemAlmostOutOfSpace|NodeFilesystemFilesFillingUp|NodeFilesystemSpaceFillingUp|NodeRAIDDegraded|NodeRAIDDiskFailure|etcdInsufficientMembers|etcdMembersDown|etcdNoLeader|CoreDNSPanic|CoreDNSHealthRequestsLatency|KubeControllerManagerDown|KubeSchedulerDown|PrometheusRuleFailures|AlertmanagerClusterDown|AlertmanagerConfigInconsistent|AlertmanagerFailedReload|X509CertificateExpiration|X509CertificateRenewal"
    - namespace="monitoring"
    continue: true

As you can see, alertmanager adds a namespace = monitoring label, so the result is that I don't get the alerts delivered correctly to the target URL, because they don't have that label set.

My solution, for now, is to add the spec.enforcedNamespaceLabel = "namespace" in Prometheus resource, so every alert generated by Prometheus is injected with a label namespace = <namespace>.

However, this solves only the DeadMansSwitch issue, at the moment I'm not sure if the other alert groups are working fine. I will give you more details whenever possible.

Upgrade components

Prometheus-operator: v0.28.0
prometheus: v2.7.1
alertmanager: v0.16.0

Kubernetes Cluster Grafana Dashboard should move to Calico

Currently the aforementioned dashboard is showing some metrics specific to Weave-net in one of the main dashboards ("Kubernetes Cluster) which have no data point if it's not deployed.

Given that Calico is now our default standard CNI we should move this graphs to a separate dashboard if these are still relevant and add Calico related relevant metrics if necessary to the main dashboard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.