Giter Club home page Giter Club logo

kubernetes-iteration-toolkit's Introduction

kitctl kitctl test coverage when new modifications are committed to this repository

What is Kubernetes Iteration Toolkit?

What is KIT?

KIT is a set of decoupled tools designed to accelerate the development of Kubernetes through testing. It combines a variety of open source projects to define an opinionated way to rapidly configure and test Kubernetes components on AWS.

Why did we build KIT?

The EKS Scalability team is responsible for improving performance across the Kubernetes stack. We started our journey by manually running tests against modified dev clusters. This helped us to identify some bottlenecks, but results were difficult to demonstrate and reproduce. We wanted to increase the velocity of our discoveries, as well as our confidence in our results. We set out to build automation to help us configure cluster components, execute well known test workloads, and analyze the results. This evolved into KIT, and we’re ready to share it to help accelerate testing in other teams.

What can I do with KIT?

KIT can help you run scale tests against a KIT cluster or an EKS cluster, collect logs and metrics from the cluster control plane and nodes to help analyze the performance for a Kubernetes cluster. KIT comes with a set of tools like Karpenter, ELB controller, Prometheus, Grafana and Tekton etc. installed and configured to manage cluster lifecycle, run tests and collect results.

What are KIT Environments?

KIT Environments provide an opinionated testing environment with support for test workflow execution, analysis, and observability. Developers can use kitctl cli to create a personal or shared testing environment for oneshot or periodic tests. KIT Environments consists of a management Kubernetes cluster that come preinstalled with a suite of Kubernetes operators that enable the execution of the tests, help analyse the test results easily, and persists logs and control plane metrics for the test runs.

Additionally, KIT Environments provide a library of predefined Tasks to configure clusters, generate load, and analyze results. For example, you can combine the “MegaXL KIT cluster” task and “upstream pod density load generator” task to reproduce the scalability team’s MegaXL test results. You can then swap in the “EKS Cluster” task and verify the results as improvements are merged into EKS. You can also parameterize existing tasks or define your own to meet your use cases.

What are KIT clusters?

KIT clusters enables developers to declaratively configure eks-like clusters with arbitrary modifications. Using a Kubernetes CRD, you can modify the EC2 instance types, container image, environment variables, or command line arguments of any cluster component. These configurations can be checked into git and reproduced for periodic regression testing or against new test scenarios.

KIT clusters are implemented using Kubernetes primitives like deployments, statefulsets, and services. More advanced use cases can be achieved by implementing a new feature in the KIT cluster Operator and exposing it as a new parameter in the CRD. You can install the KIT cluster Operator on any Kubernetes cluster or with kitctl bootstrap.

How do I get started with KIT?

KIT-v0.1 (alpha) is available now. You can get started with kitctl by following these instructions How To get started with KIT

Note: KIT is an alpha project things are changing and evolving rapidly. If you run into any issues, feel free to open an issue.

kubernetes-iteration-toolkit's People

Contributors

adityavenneti avatar ashishranjan738 avatar bwagner5 avatar dependabot[bot] avatar ellistarn avatar engedaam avatar felix-zhe-huang avatar ganiredi avatar hakuna-matatah avatar haugenj avatar ivelichkovich avatar jonathan-innis avatar melnikalex avatar mengqiy avatar natherz97 avatar oliviassss avatar orsenthil avatar prateekgogia avatar premdass avatar riskyadventure avatar rschalo avatar smahendarkar avatar spring1843 avatar xdu31 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubernetes-iteration-toolkit's Issues

[operator] Configure cloud provider in KCM

KCM currently has no info about the cloud provider. KCM should be configured to be able to talk to cloud provider APIs.
As a result once the nodes are deleted in EC2 corresponding node objects in the kit cluster are not deleted

[operator] control plane service name too long

The control plane service name includes - <cluster-name>-controlplane-endpoint and the port referenced is <cluster-name>-controlplane-endpoint-port. In some cases cluster-name can be too long, which creates issues when creating DNS name >63 char.

[operator] Missing documentation for KIT operator users

**Additional Requirements that we uncovered for KIT operator needs documentation as part of it's installation process for users using KIT operator to not run into these below issues here :

  • NAT Subnet IGW routing requirement
  • Karpenter dependency for KIT and min version like 0.4.1
  • Custom tags to subnets required as part for KIT/karpenter to launch nodes to schedule KIT CP pods and KIT DP pods
  • DNS issue for dataplane nodes created by KIT where pods have wrong DNS IP in resolv.conf

All these need to be documented as part of README ?, as users need to know this at the time of installation of KIT operator for users wanting to use KIT on KOPS, Testbed etc

Improvements to KIT

  • ipv6 mode for running guest clusters (more kit clusters per env)
  • Use clusterip from Tekton tasks (avoid slow nlb)
  • Don't wait in helm (avoid hanging on cluster creation)
  • Rename substrate->environment
  • Simplify overly parameterized pipelines

[operator] API server crashloops for version 1.21

After adding the --insecure-port for monitoring in 1.21 API server crashes with this error-

Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
Flag --insecure-bind-address has been deprecated, This flag has no effect now and will be removed in v1.24.
Error: invalid port value 8080: only zero is allowed

[operator] Updating KIT Controlplane spec brings up all pods at once

When controlplane spec is updated with args, KIT operator recycles all of the System components like apiserver etc at once, this will lead to a outage during the update process.

See below all the apiserver pods for this controlplane are unavailable during update process:

dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods | grep c0955120-57bc-420f-bec0-f79d28d86fbe
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b57t5d2   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b58v2wt   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b5zqqzg   0/1     ContainerCreating   0          54s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-bfnrm   1/1     Running             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-gf4lp   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-ms4vq   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-0                     1/1     Running             0          100m
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-1                     0/1     Pending             0          5s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-2                     0/1     Error               1          62s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787gwpl4   0/1     Pending             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787m57zq   1/1     Running             0          64s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787p82fz   0/1     Pending             0          64s


dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods | grep c0955120-57bc-420f-bec0-f79d28d86fbe
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b57t5d2   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b58v2wt   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-apiserver-d6cb488b5zqqzg   1/1     Running     3          6m4s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-bfnrm   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-gf4lp   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-controller-manager-ms4vq   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-0                     1/1     Running     0          3m25s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-1                     1/1     Running     2          5m15s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-etcd-2                     1/1     Running     4          6m12s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787gwpl4   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787m57zq   1/1     Running     0          6m14s
s-c0955120-57bc-420f-bec0-f79d28d86fbe-scheduler-7557cb787p82fz   1/1     Running     0          6m14s

Expected: Pods should be updated based on rolling update strategy and should not disrupt current state of cluster.

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node

If we apply the default ControlPlane object, we will launch one apiserver pod and Karpenter will launch one control plane node for this pod. If we scale up the number of replicas in the apiserver deployment from one to three, the two new apiserver pods will be launched on the existing node rather than being assigned their own nodes. Note that creating the ControlPlane object initially with three replicas would result in three nodes being launched.

Reproduction:

  1. Create a default ControlPlane object which defaults to one apiserver replica:
$ cat <<EOF | kubectl apply -f -
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  name: ${GUEST_CLUSTER_NAME} # Desired Cluster name
spec: {}
EOF
  1. Edit the object and increase number of apiserver replicas from one to three:
$ kubectl get controlplane example -o yaml
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kit.k8s.sh/v1alpha1","kind":"ControlPlane","metadata":{"annotations":{},"name":"example","namespace":"default"},"spec":{}}
  creationTimestamp: "2022-01-28T21:38:40Z"
  finalizers:
  - kit.k8s.sh/control-plane
  generation: 2
  name: example
  namespace: default
  resourceVersion: "925126"
  uid: 98474866-0d03-443d-a501-bff16a7f8a9b
spec:
  etcd:
    replicas: 3
  kubernetesVersion: "1.19"
  master:
    apiServer:
      replicas: 3
status:
...
  1. Check placement of apiserver pods:
$ kubectl describe node ip-192-168-191-222.us-west-2.compute.internal
...
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  default                     example-apiserver-7f588874d9-6qjtq    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-apiserver-7f588874d9-7ldgm    1 (1%)        0 (0%)      0 (0%)           0 (0%)         2m3s
  default                     example-apiserver-7f588874d9-j7l7d    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-authenticator-wsvxp           0 (0%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-controller-manager-hqrck      1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-scheduler-5vf8g               1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  kube-system                 aws-node-7h2kr                        25m (0%)      0 (0%)      0 (0%)           0 (0%)         77s
  kube-system                 kube-proxy-qwwvb                      100m (0%)     0 (0%)      0 (0%)           0 (0%)         77s

[operator] Pending tasks for EBS support in etcd

  • Update the readme to include steps to install EBS CSI driver
  • PVCs are not cleaned up after deleting the control plane object
    We will need to clean them in operator, until this feature is available https://kubernetes.io/blog/2021/12/16/kubernetes-1-23-statefulset-pvc-auto-deletion/
  • [Document] PVCs are not configurable for an existing cluster, user will need to delete the cluster and recreate the cluster desired PVCs
    Statefulsets don't allow updating PVC spec - spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden"
  • Have an option in the control plane spec to let the user create etcd with host volume mounted to the pods

Expose APIserver duration metrics and enable logs analytic

Our customer would like us to provide the following features:

  • Expose the APIserver request duration metrics through Prometheus.
  • Enable log data analytics

By default APIserver metrics endpoints exposes apiserver_request_duration_seconds_bucket, however, such metrics are not shown in Prometheus. This is likely due to the fact that prometheus add-on helm chart is configured to disable apiserver scraping. How do we enable apiserver scraping for guest cluster will be the tricky part.

To support log analytics, the kit substrate should also deploy ELK stack to collect logs and provide elasticsearch for data analysis.

[operator] aws-iam-authenticator DS names needs to be cluster specific

Because we name DS aws-iam-authenticator server on kit controlplane as aws-iam-authenticator for each KIT control-plane , it's causing issues such as recycling of authenticator pods (as a result, it's causing high STS and ec2DescibeInstance calls )when we deploy more than one KIT controlplane CRD on host cluster testbed.
We need to have a unique name for each aws-iam-authenticator component to prevent such restarts from happening.

dev-dsk-hakuna-2c-dfc2b72b % kubectl get ds 
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                     AGE
aws-iam-authenticator   3         3         3       3            3           kit.k8s.sh/app=s-e74df950-2d04-42c7-9614-af831b0a3ef3-apiserver   3d4h
dev-dsk-hakuna-2c-dfc2b72b % kubectl get pods -o wide | grep aws-iam-authenticator 
aws-iam-authenticator-frhhh                                       1/1     Running     0          3m41s   10.22.237.119   ip-10-22-237-119.us-west-2.compute.internal   <none>           <none>
aws-iam-authenticator-h55sg                                       1/1     Running     0          3m41s   10.22.127.7     ip-10-22-127-7.us-west-2.compute.internal     <none>           <none>
aws-iam-authenticator-k5x55                                       1/1     Running     0          3m41s   10.22.201.10    ip-10-22-201-10.us-west-2.compute.internal    <none>           <none>

[cli] Create a substrate cluster

  • Provision AWS infrastructure required to create Kubernetes clusters
  • Create a substrate node with Kubernetes control plane
  • Deploy addons such as
    • kube-proxy,
    • DNS
    • CNI
  • Deploy required controllers such as Karpenter, KIT-operator

To run the tests successfully we will also need the following tools deployed-

  • Prometheus and Grafana stack
  • Tekton and Flux to sync the tests

[operator] Improve how args for a pod spec are merged

Currently, we are using strategicpatch.StrategicMergePatch if a user needs to update a flag in etcd or any other component. They will need to provide all the flags in the controlPlane CR instances along with the updated value.

We should be able to merge the flags and use the user provided values for the flags.

[operator] KIT Controller restarting continuously, it couldn't come up healthy

[operator] KIT Controller restarting continuously, it couldn't come up healthy.

I1123 15:10:54.158002       1 request.go:655] Throttling request took 1.0440526s, request: GET:https://172.20.0.1:443/apis/authorization.k8s.io/v1?timeout=32s
2021-11-23T15:10:54.261Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a mutating webhook, admission.Defaulter interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=ControlPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a validating webhook, admission.Validator interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=ControlPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a mutating webhook, admission.Defaulter interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=DataPlane"}
2021-11-23T15:10:54.262Z	INFO	controller-runtime.builder	skip registering a validating webhook, admission.Validator interface is not implemented	{"GVK": "kit.k8s.sh/v1alpha1, Kind=DataPlane"}
I1123 15:10:54.263057       1 leaderelection.go:243] attempting to acquire leader lease kit/kit-leader-election...
2021-11-23T15:10:54.263Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I1123 15:11:11.689185       1 leaderelection.go:253] successfully acquired lease kit/kit-leader-election
2021-11-23T15:11:11.689Z	INFO	controller-runtime.manager.controller.dataplane	Starting EventSource	{"reconciler group": "kit.k8s.sh", "reconciler kind": "DataPlane", "source": "kind source: /, Kind="}
2021-11-23T15:11:11.689Z	INFO	controller-runtime.manager.controller.control-plane	Starting EventSource	{"reconciler group": "kit.k8s.sh", "reconciler kind": "ControlPlane", "source": "kind source: /, Kind="}
I1123 15:11:12.738906       1 request.go:655] Throttling request took 1.047619315s, request: GET:https://172.20.0.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s
2021-11-23T15:11:12.840Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "DataPlane.kit.k8s.sh", "error": "no matches for kind \"DataPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.489Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "ControlPlane.kit.k8s.sh", "error": "no matches for kind \"ControlPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.490Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "no matches for kind \"ControlPlane\" in version \"kit.k8s.sh/v1alpha1\""}
2021-11-23T15:11:15.490Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "leader election lost"}
panic: Unable to start manager, no matches for kind "DataPlane" in version "kit.k8s.sh/v1alpha1"

goroutine 1 [running]:
main.main()
	github.com/awslabs/kit/operator/cmd/controller/main.go:64 +0x86a

[operator] etcd cluster should be resilient to node/pod failures

Creating this issue for tracking purposes.

There are two ways to configure the etcd pod-

  • When etcd data directory /var/lib/etcd node volume is mounted to the etcd pod, following issues are seen -

    • When an etcd cluster is created and quorum is established, If a node gets terminated new node will be created by karpenter and will have a different peer/member ID and quorum is not established.
    • When an etcd cluster is created and quorum is established, and if statefulset is deleted and recreated on the same set of nodes, new etcd pods come up with different IDs and quorum is never established.
  • When etcd data directory /var/lib/etcd is not mounted to the etcd pod

    • If a pod gets restarted, quorum is not established and we see the following error -
      member 86adba6a3f2b2193 has already been bootstrapped

[operator] service cluster IP range can conflict

Currently the service cluster IP range in KIT cluster is 10.96.0.0/12, and DNS cluster IP is 10.100.0.10.

A user might have the same underlying subnet in VPC and it can conflict. At some point we should revisit this and detect the VPC cidr block where cluster is running and select another service cluster IP range like 172.16.0.0/12 and DNS cluster IP to be 172.16.0.10

[Operator] Dataplane spec is mandating to have allocationStrategy in the spec

  • Dataplane spec is mandating to have allocationStrategy in the spec. If not provided it will run into the following issue:
message: "creating auto scaling group for s-9a7d13d2-e635-4418-8852-7d528e66fa69,
     ValidationError: 1 validation error detected: Value '' at 'mixedInstancesPolicy.launchStrategy.onDemandAllocationStrategy'
     failed to satisfy constraint: Member must have length greater than or equal
     to 1\n\tstatus code: 400, request id: 270dd3de-8aa9-4ff2-baeb-81ac9cfd1fd0"
   status: "False"
   type: Active

We need to look into why defaulting doesn't work. If we plan to remove defaulting, we need to update the example spec here] with allocation strategy

KIT system restricts to have only one KIT cluster per host cluster.

KIT Limits to have only LB per tested/hostcluster, which means indirectly its restricting us to have only one KIT cluster per testbed/hostcluster because creating multiple KIT clusters require multiple KIT CP LB and it won’t succeed because of the AWS-EKS-LB tags issue (while discovering subnets, it requires only one security group to have kubernetes.io/<cluster-name : owned tag.

[monitoring] Grafana dashboard access

We can access Grafana dashboard using the kube-proxy but in scenarios where API server is under load, this becomes a challenge.

  • Configure load balancer service to expose Grafana dashboard for the cluster

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.