awslabs / kubernetes-iteration-toolkit Goto Github PK

License: Apache License 2.0

Go 88.95% Shell 1.30% HCL 0.06% TypeScript 8.26% JavaScript 0.05% Makefile 1.14% Dockerfile 0.11% Ruby 0.13%

kubernetes-iteration-toolkit's Introduction

kitctl test coverage when new modifications are committed to this repository

What is Kubernetes Iteration Toolkit?

What is KIT?

KIT is a set of decoupled tools designed to accelerate the development of Kubernetes through testing. It combines a variety of open source projects to define an opinionated way to rapidly configure and test Kubernetes components on AWS.

Why did we build KIT?

The EKS Scalability team is responsible for improving performance across the Kubernetes stack. We started our journey by manually running tests against modified dev clusters. This helped us to identify some bottlenecks, but results were difficult to demonstrate and reproduce. We wanted to increase the velocity of our discoveries, as well as our confidence in our results. We set out to build automation to help us configure cluster components, execute well known test workloads, and analyze the results. This evolved into KIT, and we’re ready to share it to help accelerate testing in other teams.

What can I do with KIT?

KIT can help you run scale tests against a KIT cluster or an EKS cluster, collect logs and metrics from the cluster control plane and nodes to help analyze the performance for a Kubernetes cluster. KIT comes with a set of tools like Karpenter, ELB controller, Prometheus, Grafana and Tekton etc. installed and configured to manage cluster lifecycle, run tests and collect results.

What are KIT Environments?

KIT Environments provide an opinionated testing environment with support for test workflow execution, analysis, and observability. Developers can use kitctl cli to create a personal or shared testing environment for oneshot or periodic tests. KIT Environments consists of a management Kubernetes cluster that come preinstalled with a suite of Kubernetes operators that enable the execution of the tests, help analyse the test results easily, and persists logs and control plane metrics for the test runs.

Additionally, KIT Environments provide a library of predefined Tasks to configure clusters, generate load, and analyze results. For example, you can combine the “MegaXL KIT cluster” task and “upstream pod density load generator” task to reproduce the scalability team’s MegaXL test results. You can then swap in the “EKS Cluster” task and verify the results as improvements are merged into EKS. You can also parameterize existing tasks or define your own to meet your use cases.

What are KIT clusters?

KIT clusters enables developers to declaratively configure eks-like clusters with arbitrary modifications. Using a Kubernetes CRD, you can modify the EC2 instance types, container image, environment variables, or command line arguments of any cluster component. These configurations can be checked into git and reproduced for periodic regression testing or against new test scenarios.

KIT clusters are implemented using Kubernetes primitives like deployments, statefulsets, and services. More advanced use cases can be achieved by implementing a new feature in the KIT cluster Operator and exposing it as a new parameter in the CRD. You can install the KIT cluster Operator on any Kubernetes cluster or with kitctl bootstrap.

How do I get started with KIT?

KIT-v0.1 (alpha) is available now. You can get started with kitctl by following these instructions How To get started with KIT

Note: KIT is an alpha project things are changing and evolving rapidly. If you run into any issues, feel free to open an issue.

kubernetes-iteration-toolkit's People

Contributors

Stargazers

Watchers

kubernetes-iteration-toolkit's Issues

[operator] Add unit-tests for control plane flag merge

Operator merges the flag from default pod spec and the custom flags provided by user in control plane CRD

[operator] Configure cloud provider in KCM

KCM currently has no info about the cloud provider. KCM should be configured to be able to talk to cloud provider APIs.
As a result once the nodes are deleted in EC2 corresponding node objects in the kit cluster are not deleted

[operator] Allow higher size for ASG.

Today Kit operator limits the ASG size to 1000, which is unrealistic for Megaxl customers.

[operator] Sync control plane and etcd config with EKS config

[operator] KIT cluster with Cert controller (approver/signer) to talk to kubelets/nodes

[testing] needs documentation about how to write new tests

Split out tests in the separate repo
Have sample tests which are periodically run against a KIT cluster and owned by the scalability team
Documents steps to configure Tekton to test repos

Move to managed IAM policy for awslb service account for testbed addon.

Testbed:

Either moved to a managed policy like AWSLoadBalancerControllerIAMPolicy when it's available in IAM (or) scope down the policies of AWS LB IRSA role using this - https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.2.0/docs/install/iam_policy.json

[operator] control plane service name too long

The control plane service name includes - <cluster-name>-controlplane-endpoint and the port referenced is <cluster-name>-controlplane-endpoint-port. In some cases cluster-name can be too long, which creates issues when creating DNS name >63 char.

[operator] Ability to specify volume type and IOPS through CRD spec for etcd

storage is GP2 for etcd with default IOPS KIT operator gives is 100 which is far from the EKS default we have today. No wonder, we run into slow disk issues on etcd. We need ability to mention this from KIT CRD spec.

[cli] setup Github actions to build the CLI

[testing] Needs how to get started documentation

[operator] Missing documentation for KIT operator users

**Additional Requirements that we uncovered for KIT operator needs documentation as part of it's installation process for users using KIT operator to not run into these below issues here :

NAT Subnet IGW routing requirement
Karpenter dependency for KIT and min version like 0.4.1
Custom tags to subnets required as part for KIT/karpenter to launch nodes to schedule KIT CP pods and KIT DP pods
DNS issue for dataplane nodes created by KIT where pods have wrong DNS IP in resolv.conf

All these need to be documented as part of README ?, as users need to know this at the time of installation of KIT operator for users wanting to use KIT on KOPS, Testbed etc

[operator] Enhancements for dataplane

Add a field in Dataplane spec to allow user to provide custom AMIs for worker nodes in KIT cluster. Ref- https://github.com/awslabs/kubernetes-iteration-toolkit/pull/47/files#r730299802
Add a field in Dataplane spec to allow user to provider different instance types for worker nodes in KIT.
Recycle instance every 30 days to take advantage for EC2 instance discounts.

[operator] Add taints/ node selectors to nodes provisioned for KIT control plane

[operator] Ability to configure and publish KIT Controlplane components logs to CW etc

Improvements to KIT

ipv6 mode for running guest clusters (more kit clusters per env)
Use clusterip from Tekton tasks (avoid slow nlb)
Don't wait in helm (avoid hanging on cluster creation)
Rename substrate->environment
Simplify overly parameterized pipelines

[operator] API server crashloops for version 1.21

After adding the --insecure-port for monitoring in 1.21 API server crashes with this error-

Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
Flag --insecure-bind-address has been deprecated, This flag has no effect now and will be removed in v1.24.
Error: invalid port value 8080: only zero is allowed

[operator] Updating KIT Controlplane spec brings up all pods at once

When controlplane spec is updated with args, KIT operator recycles all of the System components like apiserver etc at once, this will lead to a outage during the update process.

See below all the apiserver pods for this controlplane are unavailable during update process:

dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods | grep c0955120-57bc-420f-bec0-f79d28d86fbe
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b57t5d2   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b58v2wt   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b5zqqzg   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-bfnrm   1/1     Running             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-gf4lp   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-ms4vq   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-0                     1/1     Running             0          100m
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-1                     0/1     Pending             0          5s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-2                     0/1     Error               1          62s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787gwpl4   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787m57zq   1/1     Running             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787p82fz   0/1     Pending             0          64s

dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods | grep c0955120-57bc-420f-bec0-f79d28d86fbe
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b57t5d2   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b58v2wt   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b5zqqzg   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-bfnrm   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-gf4lp   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-ms4vq   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-0                     1/1     Running     0          3m25s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-1                     1/1     Running     2          5m15s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-2                     1/1     Running     4          6m12s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787gwpl4   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787m57zq   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787p82fz   1/1     Running     0          6m14s

Expected: Pods should be updated based on rolling update strategy and should not disrupt current state of cluster.

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node

If we apply the default ControlPlane object, we will launch one apiserver pod and Karpenter will launch one control plane node for this pod. If we scale up the number of replicas in the apiserver deployment from one to three, the two new apiserver pods will be launched on the existing node rather than being assigned their own nodes. Note that creating the ControlPlane object initially with three replicas would result in three nodes being launched.

Reproduction:

Create a default ControlPlane object which defaults to one apiserver replica:

$ cat <<EOF | kubectl apply -f -
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  name: ${GUEST_CLUSTER_NAME} # Desired Cluster name
spec: {}
EOF

Edit the object and increase number of apiserver replicas from one to three:

$ kubectl get controlplane example -o yaml
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kit.k8s.sh/v1alpha1","kind":"ControlPlane","metadata":{"annotations":{},"name":"example","namespace":"default"},"spec":{}}
  creationTimestamp: "2022-01-28T21:38:40Z"
  finalizers:
  - kit.k8s.sh/control-plane
  generation: 2
  name: example
  namespace: default
  resourceVersion: "925126"
  uid: 98474866-0d03-443d-a501-bff16a7f8a9b
spec:
  etcd:
    replicas: 3
  kubernetesVersion: "1.19"
  master:
    apiServer:
      replicas: 3
status:
...

Check placement of apiserver pods:

$ kubectl describe node ip-192-168-191-222.us-west-2.compute.internal
...
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  default                     example-apiserver-7f588874d9-6qjtq    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-apiserver-7f588874d9-7ldgm    1 (1%)        0 (0%)      0 (0%)           0 (0%)         2m3s
  default                     example-apiserver-7f588874d9-j7l7d    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-authenticator-wsvxp           0 (0%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-controller-manager-hqrck      1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-scheduler-5vf8g               1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  kube-system                 aws-node-7h2kr                        25m (0%)      0 (0%)      0 (0%)           0 (0%)         77s
  kube-system                 kube-proxy-qwwvb                      100m (0%)     0 (0%)      0 (0%)           0 (0%)         77s

[monitoring] Add kube-state-metrics to scrape guest clusters

[operator] Configure health-checks for KIT control-plane components

Health-checks are required for KIT control-plane components to not mask issues.

Feature : Move all KIT CP nodes to Private subnets for cleaner segragation

[operator] Pending tasks for EBS support in etcd

Update the readme to include steps to install EBS CSI driver
PVCs are not cleaned up after deleting the control plane object
We will need to clean them in operator, until this feature is available https://kubernetes.io/blog/2021/12/16/kubernetes-1-23-statefulset-pvc-auto-deletion/
[Document] PVCs are not configurable for an existing cluster, user will need to delete the cluster and recreate the cluster desired PVCs
Statefulsets don't allow updating PVC spec - spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden"
Have an option in the control plane spec to let the user create etcd with host volume mounted to the pods

[testing] Limit Tekton dashboard access

Expose APIserver duration metrics and enable logs analytic

Our customer would like us to provide the following features:

Expose the APIserver request duration metrics through Prometheus.
Enable log data analytics

By default APIserver metrics endpoints exposes apiserver_request_duration_seconds_bucket, however, such metrics are not shown in Prometheus. This is likely due to the fact that prometheus add-on helm chart is configured to disable apiserver scraping. How do we enable apiserver scraping for guest cluster will be the tricky part.

To support log analytics, the kit substrate should also deploy ELK stack to collect logs and provide elasticsearch for data analysis.

[kitcli] Make uploading config to S3 more robust

Ref: #113 (comment)

[operator] aws-iam-authenticator DS names needs to be cluster specific

Because we name DS aws-iam-authenticator server on kit controlplane as aws-iam-authenticator for each KIT control-plane , it's causing issues such as recycling of authenticator pods (as a result, it's causing high STS and ec2DescibeInstance calls )when we deploy more than one KIT controlplane CRD on host cluster testbed.
We need to have a unique name for each aws-iam-authenticator component to prevent such restarts from happening.

dev-dsk-hakuna-2c-dfc2b72b % kubectl get ds 
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                     AGE
aws-iam-authenticator   3         3         3       3            3           kit.k8s.sh/app=s-e74df950-2d04-42c7-9614-af831b0a3ef3-apiserver   3d4h

dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods -o wide | grep aws-iam-authenticator 
aws-iam-authenticator-frhhh                                       1/1     Running     0          3m41s   10.22.237.119   ip-10-22-237-119.us-west-2.compute.internal   <none>           <none>
aws-iam-authenticator-h55sg                                       1/1     Running     0          3m41s   10.22.127.7     ip-10-22-127-7.us-west-2.compute.internal     <none>           <none>
aws-iam-authenticator-k5x55                                       1/1     Running     0          3m41s   10.22.201.10    ip-10-22-201-10.us-west-2.compute.internal    <none>           <none>

[cli] Create a substrate cluster

Provision AWS infrastructure required to create Kubernetes clusters
Create a substrate node with Kubernetes control plane
Deploy addons such as
- kube-proxy,
- DNS
- CNI
Deploy required controllers such as Karpenter, KIT-operator

To run the tests successfully we will also need the following tools deployed-

Prometheus and Grafana stack
Tekton and Flux to sync the tests

[operator] kit operator shouldn't be calling AWS APIs directly

[operator] Add more context to the errors returned

[operator] secrets encryption in kit clusters

Deploy https://github.com/kubernetes-sigs/aws-encryption-provider to encrypt secrets in the cluster created by kit-operator

[operator] Embed Karpenter contraints API in Dataplane spec

Explore adding v1alpha5. Constraints to the dataplane spec for nodes creation.

[Tekton]Enchance validate cluster step to validate cluster before kicking off test

[Tekton] Enhance Validate cluster step to validate cluster before kicking off test

[operator] Improve how args for a pod spec are merged

Currently, we are using strategicpatch.StrategicMergePatch if a user needs to update a flag in etcd or any other component. They will need to provide all the flags in the controlPlane CR instances along with the updated value.

We should be able to merge the flags and use the user provided values for the flags.

[operator] KIT Controller restarting continuously, it couldn't come up healthy

[operator] KIT Controller restarting continuously, it couldn't come up healthy.

I1123 15:10:54.158002       1 request.go:655] Throttling request took 1.0440526s, request: GET:https://172.20.0.1:443/apis/authorization.k8s.io/v1?timeout=32s
2021-11-23T15:10:54.261Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a mutating webhook, admission.Defaulter interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=ControlPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a validating webhook, admission.Validator interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=ControlPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a mutating webhook, admission.Defaulter interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=DataPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a validating webhook, admission.Validator interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=DataPlane"}
I1123 15:10:54.263057       1 leaderelection.go:243] attempting to acquire leader lease kit/kit-leader-election...
2021-11-23T15:10:54.263Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I1123 15:11:11.689185       1 leaderelection.go:253] successfully acquired lease kit/kit-leader-election
2021-11-23T15:11:11.689Z	INFO	controller-runtime.manager.controller.dataplane	Starting EventSource	{"reconciler group": "kit.k8s.sh", "reconciler kind": "DataPlane", "source": "kind source: /, Kind="}
2021-11-23T15:11:11.689Z	INFO	controller-runtime.manager.controller.control-plane	Starting EventSource	{"reconciler group": "kit.k8s.sh", "reconciler kind": "ControlPlane", "source": "kind source: /, Kind="}
I1123 15:11:12.738906       1 request.go:655] Throttling request took 1.047619315s, request: GET:https://172.20.0.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s
2021-11-23T15:11:12.840Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "DataPlane.kit.k8s.sh", "error": "no matches for kind \"DataPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.489Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "ControlPlane.kit.k8s.sh", "error": "no matches for kind \"ControlPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.490Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "no matches for kind \"ControlPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.490Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "leader election lost"}
panic: Unable to start manager, no matches for kind "DataPlane" in version "kit.k8s.sh/v1alpha1"

goroutine 1 [running]:
main.main()
	github.com/awslabs/kit/operator/cmd/controller/main.go:64 +0x86a

[cli] Ability to provision a KIT cluster

Depends on #102. Once this task is complete, we need to add functionality to kitcli to be able to provision KIT clusters

[monitoring] Scrape tenant control plane metrics

Configure Prometheus to be able to scrape service and system metrics from tenant control plane components (API server, KCM, scheduler)

[operator] Need image field for Dataplane and Control plane components in the Spec

[operator] Need image field for Dataplane and Control plane components in the Spec instead of hardcoding them in code in ImageProvider.go, so it would help us from changing code every time we want to test with a new image.

[operator] etcd cluster should be resilient to node/pod failures

Creating this issue for tracking purposes.

There are two ways to configure the etcd pod-

When etcd data directory /var/lib/etcd node volume is mounted to the etcd pod, following issues are seen -
- When an etcd cluster is created and quorum is established, If a node gets terminated new node will be created by karpenter and will have a different peer/member ID and quorum is not established.
- When an etcd cluster is created and quorum is established, and if statefulset is deleted and recreated on the same set of nodes, new etcd pods come up with different IDs and quorum is never established.
When etcd data directory /var/lib/etcd is not mounted to the etcd pod
- If a pod gets restarted, quorum is not established and we see the following error -
  member 86adba6a3f2b2193 has already been bootstrapped

[monitoring] Deploy Prometheus stack to the substrate cluster

Configure Prometheus to have persistent storage, in case the pod dies we shouldn't be losing metrics

[CLI] Install tekton, flux, monitoring

To run the tests successfully we will also need the following tools deployed-

Prometheus and Grafana stack
Install Tekton
Install Flux
Configure Tekton

[operator] service cluster IP range can conflict

Currently the service cluster IP range in KIT cluster is 10.96.0.0/12, and DNS cluster IP is 10.100.0.10.

A user might have the same underlying subnet in VPC and it can conflict. At some point we should revisit this and detect the VPC cidr block where cluster is running and select another service cluster IP range like 172.16.0.0/12 and DNS cluster IP to be 172.16.0.10

kitctl should resync S3 config before generating

There are cases when someone might loose existing config files on filesyste, running the bootstrap command should first sync these files from S3 to filesystem

[Operator] Dataplane spec is mandating to have allocationStrategy in the spec

Dataplane spec is mandating to have allocationStrategy in the spec. If not provided it will run into the following issue:

message: "creating auto scaling group for s-9a7d13d2-e635-4418-8852-7d528e66fa69,
     ValidationError: 1 validation error detected: Value '' at 'mixedInstancesPolicy.launchStrategy.onDemandAllocationStrategy'
     failed to satisfy constraint: Member must have length greater than or equal
     to 1\n\tstatus code: 400, request id: 270dd3de-8aa9-4ff2-baeb-81ac9cfd1fd0"
   status: "False"
   type: Active

We need to look into why defaulting doesn't work. If we plan to remove defaulting, we need to update the example spec here] with allocation strategy

[testing] Sync configuration from Github to tekton

[ ] Install Flux or leverage Tekton to sync these configs/tests

[monitoring] Export prometheus metrics

We need a way to be able to export and retrieve the metrics from various test runs

[operator] Pod Disruption Budgets

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets

https://karpenter.sh/v0.6.0/tasks/deprovisioning/#disruption-budgets

KIT system restricts to have only one KIT cluster per host cluster.

KIT Limits to have only LB per tested/hostcluster, which means indirectly its restricting us to have only one KIT cluster per testbed/hostcluster because creating multiple KIT clusters require multiple KIT CP LB and it won’t succeed because of the AWS-EKS-LB tags issue (while discovering subnets, it requires only one security group to have kubernetes.io/<cluster-name : owned tag.

[operator] document difference in etcd config between KIT and EKS.

[monitoring] Grafana dashboard access

We can access Grafana dashboard using the kube-proxy but in scenarios where API server is under load, this becomes a challenge.

Configure load balancer service to expose Grafana dashboard for the cluster