Giter Club home page Giter Club logo

gardener / gardener Goto Github PK

View Code? Open in Web Editor NEW
2.7K 75.0 445.0 183 MB

Kubernetes-native system managing the full lifecycle of conformant Kubernetes clusters as a service on Alicloud, AWS, Azure, GCP, OpenStack, vSphere, KubeVirt, Hetzner, EquinixMetal, MetalStack, and OnMetal with minimal TCO.

Home Page: https://gardener.cloud

License: Apache License 2.0

Makefile 0.23% Shell 1.15% Smarty 0.27% Go 98.22% Dockerfile 0.01% Python 0.04% Mustache 0.07%
kubernetes gardener golang aws azure gcp openstack cluster alicloud metalstack

gardener's Introduction

Gardener Logo

REUSE status CI Build status Slack channel #gardener Go Report Card GoDoc CII Best Practices

Gardener implements the automated management and operation of Kubernetes clusters as a service and provides a fully validated extensibility framework that can be adjusted to any programmatic cloud or infrastructure provider.

Gardener is 100% Kubernetes-native and exposes its own Cluster API to create homogeneous clusters on all supported infrastructures. This API differs from SIG Cluster Lifecycle's Cluster API that only harmonizes how to get to clusters, while Gardener's Cluster API goes one step further and also harmonizes the make-up of the clusters themselves. That means, Gardener gives you homogeneous clusters with exactly the same bill of material, configuration and behavior on all supported infrastructures, which you can see further down below in the section on our K8s Conformance Test Coverage.

In 2020, SIG Cluster Lifecycle's Cluster API made a huge step forward with v1alpha3 and the newly added support for declarative control plane management. This made it possible to integrate managed services like GKE or Gardener. We would be more than happy, if the community would be interested, to contribute a Gardener control plane provider. For more information on the relation between Gardener API and SIG Cluster Lifecycle's Cluster API, please see here.

Gardener's main principle is to leverage Kubernetes concepts for all of its tasks.

In essence, Gardener is an extension API server that comes along with a bundle of custom controllers. It introduces new API objects in an existing Kubernetes cluster (which is called garden cluster) in order to use them for the management of end-user Kubernetes clusters (which are called shoot clusters). These shoot clusters are described via declarative cluster specifications which are observed by the controllers. They will bring up the clusters, reconcile their state, perform automated updates and make sure they are always up and running.

To accomplish these tasks reliably and to offer a high quality of service, Gardener controls the main components of a Kubernetes cluster (etcd, API server, controller manager, scheduler). These so-called control plane components are hosted in Kubernetes clusters themselves (which are called seed clusters). This is the main difference compared to many other OSS cluster provisioning tools: The shoot clusters do not have dedicated master VMs. Instead, the control plane is deployed as a native Kubernetes workload into the seeds (the architecture is commonly referred to as kubeception or inception design). This does not only effectively reduce the total cost of ownership but also allows easier implementations for "day-2 operations" (like cluster updates or robustness) by relying on all the mature Kubernetes features and capabilities.

Gardener reuses the identical Kubernetes design to span a scalable multi-cloud and multi-cluster landscape. Such familiarity with known concepts has proven to quickly ease the initial learning curve and accelerate developer productivity:

  • Kubernetes API Server = Gardener API Server
  • Kubernetes Controller Manager = Gardener Controller Manager
  • Kubernetes Scheduler = Gardener Scheduler
  • Kubelet = Gardenlet
  • Node = Seed cluster
  • Pod = Shoot cluster

Please find more information regarding the concepts and a detailed description of the architecture in our Gardener Wiki and our blog posts on kubernetes.io: Gardener - the Kubernetes Botanist (17.5.2018) and Gardener Project Update (2.12.2019).


K8s Conformance Test Coverage certified kubernetes logo

Gardener takes part in the Certified Kubernetes Conformance Program to attest its compatibility with the K8s conformance testsuite. Currently Gardener is certified for K8s versions up to v1.29, see the conformance spreadsheet.

Continuous conformance test results of the latest stable Gardener release are uploaded regularly to the CNCF test grid:

Provider/K8s v1.29 v1.28 v1.27 v1.26 v1.25
AWS Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Azure Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
GCP Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
OpenStack Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Alicloud Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Equinix Metal N/A N/A N/A N/A N/A
vSphere N/A N/A N/A N/A N/A

Get an overview of the test results at testgrid.

Start using or developing the Gardener locally

See our documentation in the /docs repository, please find the index here.

Setting up your own Gardener landscape in the Cloud

The quickest way to test drive Gardener is to install it virtually onto an existing Kubernetes cluster, just like you would install any other Kubernetes-ready application. You can do this with our Gardener Helm Chart.

Alternatively you can use our garden setup project to create a fully configured Gardener landscape which also includes our Gardener Dashboard.

Feedback and Support

Feedback and contributions are always welcome!

All channels for getting in touch or learning about our project are listed under the community section. We are cordially inviting interested parties to join our bi-weekly meetings.

Please report bugs or suggestions about our Kubernetes clusters as such or the Gardener itself as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn More!

Please find further resources about our project here:

gardener's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener's Issues

Make Backup Status Available

Issue by vlerenc
Tuesday Oct 24, 2017 at 00:35 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/161


Right now, the user (nor operator for that matter) has access to the status (success or failure) of the backups, their full list with date/time (taken, took) or the size. As a first step towards on-demand restore, this would be helpful.

It is not clear, whether this should be handled within the operator (as additional status information) or put into additional CRD "backup" resources that are then read from the seed cluster/shoot namespace. Other implementation proposals are welcome.

Project Member RoleBindings in Seed Clusters

Story

  • As operator I want access to the shoot cluster namespace within the seed cluster, so that I can do my ops work.

Motivation

In order to run operations for all clusters within a project, the future operators (we for now) need access to the shoot cluster namespace within the seed cluster for everything in it. At present, access is only possible via the admin credentials.

Acceptance Criteria

  • Project members have access to the shoot cluster namespace within the seed cluster

Implementation Proposal

Put seed cluster under RBAC and grant all project members for a cluster appropriate role bindings in the corresponding seed cluster and shoot cluster namespace therein.

Open Questions

How to handle changes to the project member list? Is the Gardener really checking those in the Garden cluster and duplicating them across all seed clusters and shoot cluster namespaces?

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Enable Local Storage

Story

  • As user I want local storage, so that I get optimal performance for workload depends on it/can cope with it.

Motivation

Network-attached storage carries only that far, but some applications require local storage (Hadoop). This software, as it must expect hardware failures, usually copes well with/tolerates such failures.

Acceptance Criteria

  • Local storage (under quota)
  • When pods die, the scheduler tries to bring them up on the same node or deletes the data otherwise

Resources

See https://github.com/vishh/community/blob/ba62a3f6cb9a301e95c4b64b9052455bdac9a3fe/contributors/design-proposals/local-storage-overview.md and community progress tracker kubernetes/enhancements#245.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?

Multi-Tenancy

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/46


Stories

  • As user I dont want others to see or interact with (get logs, kill pod) containers deployed by me, so that nobody can spy on me.
  • As user I dont want to accidentally break containers deployed by others, so that I dont harm anybody.
  • As operator I want to utilise the hardware as best as possible, so that I can offer a competitive price.
  • As operator I want to manage as few as possible clusters, so that I can do a better job on these clusters.

Motivation

In order to not having to deploy too many individual clusters, we ought to be able to let multiple teams (later maybe even customers) work concurrently on one cluster.

Acceptance Criteria

  • No non-admin user can deploy privileged containers
  • Users from one namespace cant see or interact with (get logs, kill pod) containers deployed into other namespaces
  • Containers from one namespace cant access other containers in any direct way

Implementation Proposal

RBAC, security and network policies (defined per namespace) are the concepts that need to be explored at the time of this writing.

ℹ️ Migrated from Jira issue KUBE-23

Garden, Seed & Shoot Cluster Disaster Recovery

Story

  • As provider I need a plan for disaster recovery, so that I can resurrect lost infrastructure and restore my service.

Acceptance Criteria

Please find out what we need to do when we completely lose a shoot or even seed or soil cluster:

  • Total loss and recreation of only the shoot cluster worker environment: What do we need to do to reconstruct the worker environment if we lose it from the VPC and instance profiles down to the VMs (most likely a no-brainer, just rerun the provisioning)
  • Total loss and recreation of the shoot cluster control plane: What do we need to do to reconstruct the shoot cluster control plane (assuming we lost everything from the namespace down to the etcd persistence, but have a backup of the latter)
  • Total loss and move to another soil or seed cluster: What do we need to do to move to/ reconstruct the soil or seed cluster (assuming we lost everything from the cluster itself down to the etcd persistence, but have a backup of the latter)
  • Total loss and move to another garden cluster: What do we need to do to move to/reconstruct the garden cluster (assuming we lost everything from the cluster itself down to the etcd persistence, but have a backup of the latter)

Please manually execute the reconstruction of the above scenarios, validate the reconstructed clusters behave properly, and document the procedure in this ticket.

Implementation Proposal

The idea is to learn how to move a control plane from one soil/seed cluster to another.

Resources

Maybe Heptio Ark (disaster recovery) may be interesting to check out?

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?

Spot / Low-Priotity / Preemptible VM Support

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/49


Story

  • As operator I want to use AWS spot, Azure Low-Priotity or GCP Preemptible VM instances, so that my landscape runs at lower costs.

Motivation

Money, sure, but also some form of chaos monkey that should help train the application developers that all resources will eventually fail.

Acceptance Criteria

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-95

Resilience & Robustness Hardening Concept

Check out chaos tools like (ask Juergen Schneider for input):

If so, how can we use #31 and one or more of the tools above for resilience & robustness hardening and integrate it into integration testing (for example enabled on staging to introduce random chance)? Please provide a concept paper for resilience & robustness hardening.

Investigate Multi-Tenancy

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/47


See KUBE-23 and find out what we truly need in order to implement that epic. Key issues are security (RBAC, security policies, exploits) and isolation (disk quota, network policies, fair share of resources), but also things that will become necessary in a shared environment (we know of metering, but there are certainly more topics that will require work in a shared environment). Please report results here.

Hints:

ℹ️ Migrated from Jira issue KUBE-41

Backup & Restore

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:43 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/7


Stories

  • As operator I want to restore a broken cluster, so that I can ensure business continuity.

Motivation

We must ensure that if disaster strikes (human error or hardware failure alike), we can recreate the entire cluster.

Acceptance Criteria

  • Delete the etcd volumes, then recreate them from a backup
  • Delete the entire cluster, then recreate it from a backup

ℹ️ Migrated from Jira issue KUBE-17

Add Prometheus Monitoring

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:43 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/8


Stories

  • As operator I want internal monitoring, so that I can check on my cluster(s).
  • As operator I want alerts on issues available in internal monitoring, so that I dont have to check on my cluster(s) manually all the time.

Motivation

We need means to monitor our own cluster(s).

Acceptance Criteria

  • All key metrics are exposed to admins in a remotely accessible UI
    • Number of nodes, pods, containers, services (running, failing)
    • Corresponding infrastructure components
    • Cluster load, free memory, free disk space
    • Key data is put onto a dashboard
  • Historic data is kept for the past two weeks
  • When we reach certain thresholds, alerts are sent to pre-configured e-mail address(es)

ℹ️ Migrated from Jira issue KUBE-21

Workflow Jobs Testing Capacity Shortage on Seed Clusters

We had to terminate the cluster auto scaler on the seeds because:

  • An older Calico version is deployed there and we know that it results in issues after lots of scale-up/down due to bugs that fail to cleanup IP ranges
  • It still lacks (was said to be implemented for v1.8 but wasn't) the feature to reserve excess capacity and our seed clusters started breathing with the integration tests, which resulted in many issues as pods take minutes to be scheduled

However, now we have a fixed number of workers in the seeds. This means we have to scale manually, but lack indication thereof.

Therefore we need workflow jobs per seed that test for the used vs. the available capacity and warn the operator of a coming shortage.

Reserve Excess Cluster Capacity

Issue by vlerenc
Tuesday Aug 08, 2017 at 05:59 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/71


Story

  • As user I want to be able to schedule my workload (if of reasonable size) in my shoot cluster in seconds, even if all nodes of the shoot cluster are busy, so that I am more productive.
  • As operator I want to be able to create new shoot cluster control planes in the seed clusters in seconds, even if all nodes of the seed cluster are busy, so that I am more productive.

Motivation

The cluster autoscaler seems to react only when pods can't be scheduled on any of the current nodes due to insufficient resources. That means, it takes minutes after the event until the pod can be scheduled.

Acceptance Criteria

  • We run the cluster with excess capacity, a configurable combination of an absolute excess capacity minimum and a percentage, e.g. (calculation here in "nodes", but may be CPU/Memory, too, if that's easier/better): max(1, 5% of the current number of nodes)

Implementation Proposal

None, but if this can't be done with the current cluster autoscaler and nobody works on it, consider whether a contribution would be possible.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?

Improve Logging / GroupBy Cluster and Operation + Show As Task List

Story

  • As operator (of the week) I want to access cluster operations logs (create, update/reconcile, delete) in a convenient way, so that I can easily see what the outcome of each operation was without trying to grep the information out of the overall/global Gardener logs.

Motivation

We should improve the way how we interact/access the Gardener/cluster logs. As an operator (of the week) I frequently have to check what the Gardener says/logs on a certain cluster and operation. Now with the reconciler, the logs grew and with all the other planned features, they will even grow more. Of course, we will have a logging stack eventually, but that's not what this story is about. Usually I need to know something specific about a particular cluster, for a particular operation at a particular date/time. A logging stack and UI will allow me to set filters, but it won't be semantic in this sense. Also, we can't give away the entire operator logs via a logging stack to the entirety of the people that will later use the Gardener dashboard as operators (there must be strong isolation, so that we can have only one Gardener instance running).

Acceptance Criteria

  • Keep track of logs per operation

Implementation Proposal

What springs to mind is something similar to the Bosh tasks and logs. There, the user nicely sees what the Bosh director does and did: A creation of a certain deployment, a deletion of one, any scan & fix tasks that happened on it, etc. Information (collection of logs) on that granularity (cluster + operation) would be a great improvement over a large shared operator log where I have the access rights issue and can only filter by properties instead of seeing (in a list) which operations where carried out when on one or all of my clusters in my project.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Automated Updates

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/53


Stories

  • As user I want the latest version with all security patches and bug fixes, so that my cluster is safe and sound.
  • As provider I can update the OS, so that I can deploy important security patches and bug fixes quickly.
  • As provider I can update Kubernetes and its components, so that I can deploy important security patches and bug fixes quickly.

Motivation

See above.

Acceptance Criteria

  • Changes to the OS can be rolled through all self-controlled landscapes
  • Changes to Kubernetes and its components can be rolled through all self-controlled landscapes
  • CI process that keeps track of patches, automatically pre-validates them, and prepares a PR for review

ℹ️ Migrated from Jira issue KUBE-16

Garden Operator CI

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/24


Stories

  • As contributor I depend on a continuous integration/delivery pipeline, so that I neither waste my time with repetitive tasks nor can I make (fatal) mistakes.

Motivation

Newly hired or joined colleagues or contributors from outside the group of colleagues who primarily work on a repo/subject/topic have a hard time contributing, especially when sitting in remote locations, but in order to sustainably grow, we need everybody to be happy and help our cause.

Acceptance Criteria

  • Continuous integration (acting on PRs/Merges) including the following steps (in chronological order, not the order of importance):
    • Optional (can be done in phase 4): Formal checks like referenced issue
    • Optional (can be done in phase 3): Lint checks
    • Mandatory (must be done in phase 1): Build
    • Optional (can be done in phase 3): Unit tests
    • Optional (can be done in phase 3): Acceptance tests
    • Optional (can be done in phase 2): Deployment to staging
    • Optional (can be done in phase 2): Integration tests on staging
    • Mandatory (must be done in phase 1): Deployment to production
    • Mandatory (already done in phase 0): Integration tests on production

Details (not complete, to be extended by all of us as we go along)

  • Deployment of Jenkins into new cluster
  • Create secrets with kubernetes-jenkins password in new cluster (ideally automate that with e.g. jenkins-setup repo/script or the Jenkins-tools job/script), mount it into the pod/container template and adapt JenkinsFile accordingly
  • Deploy into Garden DEV cluster into separate NS (watching only this NS) and clone and run the garden-it against this NS
  • Run Jenkins pipeline also on PRs
  • Update PR during pipeline execution on certain stages, e.g. Lint tests passed, Unit tests passed, Integration tests passed, etc.
  • Exact plan of actual pipeline and how a change shall be prepared, what happens to the pull request, where it is tested, how the build result is promoted and deployed to production
  • In-depth evaluation of similar tools like Concourse or Drone which are more promising than Jenkins + Kubernetes-Plugin

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?

ℹ️ Migrated from Jira issue KUBE-149

Condense Backups

Issue by vlerenc
Tuesday Oct 24, 2017 at 00:29 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/160


Right now we take only backups for the past 7 days (once a day), but it would be better to have a more flexible backup plan and condense older backups:

  • Run backups hourly
  • Keep only the last 24 hourly backups and of all other backups only the last backup in a day
  • Keep only the last 7 daily backups and of all other backups only the last backup in a week
  • Keep only the last 4 weekly backups

Provision via ACS

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/16


Stories

  • As provider I want to provision Kubernetes via ACS, so that we dont have to provision Kubernetes ourselves on Azure (when SAP or customers need to run it side-by-side) and benefit from a probably more competitive solution than whatever we can do (in terms of full management and pricing).

Motivation

Do not engage in Kubernetes provisioning when there are competitive native solutions, especially when customers already run on that infrastructure and use/manage/operate the same native solutions.

Acceptance Criteria

  • ACS Kubernetes cluster can be created via the Gardener and offers about the same functionality as what we have on AWS (to be clarified)

ℹ️ Migrated from Jira issue KUBE-83

Multi-Cloud

Stories

  • As provider I want to provision on AWS, Azure, GCP and OpenStack, so that Kubernetes runs in the Public Hyper-Scale Clouds where our customers already have their footprint.

Motivation

We shall go where our customers are, i.e. multi-cloud and also our own DCs. We shall offer this in an operator-friendly way with simplified operations.

Acceptance Criteria

Clusters can be created, operated and destroyed (following the seed & shoot cluster approach):

  • Kubernetes on AWS (covering the same functionality as our kops-based solution)
  • Kubernetes on Azure to (almost) the same extent as we have it on AWS
  • Kubernetes on GCP to (almost) the same extent as we have it on AWS
  • Kubernetes on OpenStack to (almost) the same extent as we have it on AWS

Pods/Infrastructure Not Reconciled After Restore

Issue by vlerenc
Tuesday Nov 07, 2017 at 16:06 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/186


We seem to have issues when we restore a backup: Pods/containers that were created after the last backup turn into unmanaged zombies once the backup is restored, i.e. the kubelet doesn't kill them even though it should know that it created them once, but they are no longer scheduled for that node.

The assumption is that we may have similar issues with PVs or LBs.

Therefore we should first check whether we have indeed reconciliation issues with PVs or LBs. Next we should check what else (that we control) might be misaligned/turn into zombies. Finally we should come up with a plan how to address these issues. Maybe there are already issues filed by others or we need to file new ones. Maybe we need to contribute or we can mitigate with minimal effort (e.g. a rolling update will resolve the pod/container issue from above).

As for the customer load in the shoot cluster as such: If Kubernetes isn't addressing this issue, we can't. We should therefore inform the customers, in general (a priori) and when it happens (on occasion). Other than that, I see no way we could reconcile whatever the customer is running as cluster workload (could be custom resource definitions with all kinds of side effects or anything else). We simply don't know and can't handle this situation in a clean way.

Alert Volume

Issue by vlerenc
Tuesday Oct 24, 2017 at 07:03 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/163


Please investigate the current alert volume (at present far too many for manual inspection). Are we having too many alerts fired or too early? If not, what's wrong with the clusters, what do we need to change, which features are missing?

The key point is: How can we improve either (alerts or clusters) and lower the alert volume to a reasonable amount that an operator can/should investigate (at present, that's simply not possible, giving the sheer alert volume).

Add Cluster Logging

Stories

  • As operator I want cluster logging, so that I can check on my cluster(s).

Motivation

We need means to get to the logs of our own (cluster)s, but we can influence/bend this BLI. The goal is to improve the work of an operator (of the week), not to provide a 100 % solution. Certain logs are much more important than others, e.g. etcd and machine controller logs are kind of critical, API server is important as well (but may generate a lot of logs, of which I can't tell at the moment if we can hold them all at a reasonable price), but others are less important or don't need to preserved for long. If that's the case, we may not have to stuff them into the logging stack (which makes it less expensive) as we always have the Docker logs on the machines themselves. If e.g. networking fails, it fails immediately. There was never the need to go back two weeks to dig out some old logs, so maybe we do not need to preserve the daemon sets logs in our shoot cluster. Keep that in mind to reasonably form this BLI.

Acceptance Criteria

  • All pod/container logs are exposed to admins in a remotely accessible UI
    • Shoot cluster control plane in the seed cluster (if we need, we can cut that down to the more important pods like e.g. etcd and machine controller)
    • Shoot cluster addons in the shoot cluster (if we need, we can omit that)
  • Historic data is kept for the past two weeks (and truncated if it exceeds a certain unreasonable size)
  • CPU & Memory footprint is reasonable/small (if not we may have to consider having only one stack per seed; in that case we might also rethink our monitoring stack approach)

Out Of Scope

  • Like with the monitoring data we accept the loss of all logs when the Gardener has to drain a seed of its shoot clusters or loses its seed cluster for good and must move the shoot cluster control plane to another seed cluster

Implementation Proposal

Please check out the EFK stack (Elasticsearch + fluentd + Kibana; see various blogs, e.g. https://logz.io/blog/kubernetes-log-analysis). Is it what we should use?

If yes, should we handle it like the monitoring stack:

  • Aim at a logging stack that is co-deployed with every shoot cluster control plane and monitoring stack
  • Investigate what needs to be done technically to get to the logs in the shoot cluster namespace in the seed and the kube-system namespace in the shoot
  • Find our what needs to be done to contain the costs, i.e. keep historic data for the past two weeks and truncate it if it exceeds a certain unreasonable size

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Hibernate/Wake-Up Clusters

Story

  • As user/operator I want to hibernate clusters, so that I can save costs when I don't need the clusters anymore/temporarily.

Motivation

We have seen that users try to hibernate their clusters to save costs (we were also asked how to do that). That's actually great and we haven't seen this form of responsibility ever elsewhere.

However, users have no means at present to do so in a clean way. One user manually tweaked the ASG, something that turned off all nodes and hence also the cluster auto-scaler, which turned the cluster into NotReady in our alert dashboard/flagged the cluster, which again created unnecessary ops effort. With the machine controller manager that path will be blocked (we will grasp more control of the cluster to provide our SLAs) and we will revert or rather make such tweaks impossible.

Acceptance Criteria

  • Cluster can be scaled down to 0 machines across all pools
  • No false monitoring alerts fired

Implementation Proposal

Use the machine controller manager to achieve this. Alerts are temporarily disabled (VPN down, daemon sets down, nodes gone).

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Automated Cluster Updates

Stories

  • As user I want the latest version with all security patches and bug fixes, so that my cluster is safe and sound.
  • As provider I can update the OS, so that I can deploy important security patches and bug fixes quickly.
  • As provider I can update Kubernetes and its components, so that I can deploy important security patches and bug fixes quickly.

Motivation

See above.

Acceptance Criteria

  • Modification of shoot CRDs with a kubernetesVersion (e.g. v1.5), and an operatorVersion (that was used to create or last update the cluster, e.g. v1.35.0)
  • Rolling update when:
    • OS version gets updated
    • Kubelet version (only for new Kubernetes minor versions, otherwise by updating the cloud-config secrets)
  • Idempotent cluster update (Terraform & Control Plane) in all other cases (operator changes, Kubernetes patches, image or configuration changes)
  • Support multiple shoot cluster Kubernetes versions

Further Considerations

  • Kubernetes version upgrades (e.g. v1.5->v1.6) must be approved and scheduled by the end-user.
  • We agreed to postpone CoreOS Container Linux FastPatch updates (#37), because the nodes require a restart also with FastPatch and this isn't worth the effort on an IaaS platform where we can simply run a rolling-update of the nodes.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

CoreOS Container Linux FastPatch Updates

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/37


Story

  • As operator I want OS patches (functional and security) to be deployed as fast and automated as possible and ideally without any significant downtime, so that my cluster remains sound and safe.

Motivation

While we can do rolling updates on IaaS hyperscale providers (for now), we shouldnt do that respectively waste too much time doing it on bare metal.

Acceptance Criteria

  • Update shoot cluster worker nodes using FastPatch (a.k.a. a/b channel updates without replacing the VM)
    • Shoot cluster TPR should hold exact garden operator version (which includes configuration such as CoreOS Container Linux image tag and version)
    • When a new garden operator gets deployed, it checks existing shoot clusters and updates them according to a canary process
      • At first only one cluster is updated, validating that it properly works after update
      • Then all other clusters are updated in parallel according to a max_in_flight property

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-170

Provision via GKE

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/14


Stories

  • As provider I want to provision Kubernetes via GKE, so that we dont have to provision Kubernetes ourselves on GCP (when SAP or customers need to run it side-by-side) and benefit from a presumably more competitive solution than whatever we can do (in terms of full management and pricing).

Motivation

Do not engage in Kubernetes provisioning when there are competitive native solutions, especially when customers already run on that infrastructure and use/manage/operate the same native solutions.

Acceptance Criteria

  • GKE Kubernetes cluster can be created via the Gardener and offers about the same functionality as what we have on AWS (to be clarified)

ℹ️ Migrated from Jira issue KUBE-84

Improve Prometheus Grafana Dashboards

Story

  • As operator I want useful Prometheus Grafana dashboards, so that I have best observability for my ops tasks.

Motivation

As discussed in the internal review, please take some time to finalize the dashboards, so that we have a good starting point/it is clear what purpose the dashboards serve.

Acceptance Criteria

  • Remove duplicate information (some duplicate information makes sense to be repeated in different dashboards, though)
  • Find better names
    • All Nodes -> Cluster Overview (as-is)
    • Kubernetes Cluster Monitoring -> Node Details (rename Cluster gauges into Node gauges, because that's what they are)
    • Deployments -> Kubernetes Deployments (we saw the data was missing for the API server)
    • Pod -> Kubernetes Pods
    • Resource Requests -> Resource Requests (the metrics are maybe wrong; let's either fix or remove them)

In addition we should also:

  • Provide guide how to tweak a dashboard, create or remove them

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Optimise Cluster VPN

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/44


Story

  • As provider I want to have a fast, reliable and safe cluster VPN, so that I get all these benefits for this crucial component in the seed & shoot approach.

Motivation

Performance, simplicity, and security.

Acceptance Criteria

  • Investigate if there is a more elegant approach to SSH for the cluster VPN between the API server in the seed cluster and the workers in the shoot cluster
  • Secure access with security groups

Related Resources

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-160

Broken Node Detection & Retirement

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/29


Story

  • As operator I want broken nodes to be removed automatically, so that nothing gets scheduled there anymore, the existing workload is drained/moved, the nodes dont count into my budget (in general/financially and also into the auto-scaler).

Motivation

Cost saving, business continuity, depending on the reason for the retirement also security.

Acceptance Criteria

  • Nodes are retired if:
    • Kubelet fails to call home
    • Kubelet indicates health issues
    • Ephemeral disk is full
  • Retirement means (cordoning; actually implicit with next step), draining, and finally terminating the VM so that a new one can be recreated by the scaling group

Implementation Proposal

See e.g. KUBE-70.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-236

GPU Support

Stories

  • As user I want to utilise GPUs if available, so that I can implement GPU-relevant applications like ML.

Motivation

See above.

Acceptance Criteria

  • Users can require nodes with GPUs
  • The GPU is actually available to (non-privileged) containers
  • CUDA usage for ML is possible

Open Questions

  • While CPUs are well compressible, how about GPUs (shader cores)?
  • How do we provision the drivers into the kops-managed nodes (AMI image is fixed, no means yet to apply direct changes through kops, so would daemon sets help)?

Resources

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ImagePullPolicy Admission Controller

Story

  • As operator I want to control which container image I allow (black-/whitelist), so that I can ensure the safety of my Kubernetes cluster.

Motivation

Security.

Acceptance Criteria

  • Admission controller that can be switched on or off:
    • Allow images from certain registries
    • Allow images with certain signatures

Implementation Proposal

Implement an image policy webhook.

Resources

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Robustness and Resilience

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/54


Stories

  • As provider I want to ship a robust and resilient system, so that I have less complaints.

Motivation

The less complaints we get, the more productive our users probably are and the more we can concentrate on adding valuable features.

Acceptance Criteria

  • Vague: The system behaves correctly, also under stress

ℹ️ Migrated from Jira issue KUBE-86

ETCD Backup & Restore

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/31


Story

  • As provider I want etcd backups for my shoot clusters, so that I can restore them should they be lost.
  • Later: As provider I want etcd backups to be restored automatically should the PV be definitely lost or the current etcd can't start from the current state.

Motivation

Business continuity.

Acceptance Criteria

  • Regular etcd backups are taken by the garden operator for all shoot clusters and are made known to the Gardener UI via TPRs/CRDs
    • AWS
    • Azure
    • GCP
    • OpenStack

Resources

Open Questions

How frequent can we backup the data? Can we replicate the data on a lower level with a sidecar deployment or would we need an etcd cluster then? Is there something like point in time recovery for etcd?

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-158

Shoot Cluster Control Plane Auto-Scaling

Story

  • As provider I want my shoot cluster control planes to scale automatically (within certain limits), so that my shoot clusters remain operational and respond fast.

Motivation

Response time behaviour. Avoiding issues like #79. Generally being able to run larger and very large clusters safely (also important for our own shoot'ed seed clusters).

Acceptance Criteria

  • XS (2 nodes), S (10 nodes), M (50 nodes), L (100 nodes) and XL clusters (250 nodes) with actual workload run safely without downtime for a week (hopefully longer, but that's the KPI and then we will free the resources).
  • Control plane shows good metrics-based utilisation (no large over-provisioning)
  • #Minimal control plane unavailability (that is actually the greatest problem/threat, that we, when scaling, cause downtimes, most critically on API server and ETCD)

Implementation Proposal

  • Use metrics-based horizontal pod autoscaling (HPA) for all shoot cluster control plane components that can be scaled horizontally like the API server (also brings HA as a side factor).
  • Use metrics-based vertical pod autoscaling (VPA) for all shoot cluster control plane components that cannot be scaled horizontally like the scheduler, controller manager and most importantly ETCD.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Enable GPU Usage

Story

  • As user I want to run GPU-dependent applications, so that I can use libraries such as Theano or Tensorflow on GPUs.

Motivation

See above.

Acceptance Criteria

  • Users can require nodes with GPUs
  • The GPU is actually available to (non-privileged) containers
  • CUDA usage for ML is possible
  • Theano or Tensorflow can consume available node GPUs

Open Questions

  • While CPUs are well compressible, how about GPUs (shader cores)?
  • How do we provision the drivers into the OS images (image is fixed, so would daemon sets help)?

Resources

See https://vishh.github.io/docs/user-guide/gpus for details. See also https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?

Compliance and Performance Testing

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:44 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/26


Story

  • As provider I want to be sure our shoot clusters are Kubernetes compliant and show a sufficient performance, so that I dont deliver crap to our consumers.

Motivation

We are still in an early phase and need to further validate our seed/shoot approach, before we roll it out.

Acceptance Criteria

  • Manually run, then automate in Jenkins CI:
    • E2E tests against a shoot cluster pass
    • Performance and scalability of a shoot cluster is comparable to our kops-based seed clusters or a Tectonic cluster

Resources

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

ℹ️ Migrated from Jira issue KUBE-184

Investigate ALB Ingress (In Contrast to NGINX Ingress)

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/45


Investigate https://coreos.com/blog/alb-ingress-controller-intro. Is it any good? Can it replace nginx ingress on AWS as the more natural and IaaS-specific solution? What impact on costs will it have? Would you advise to switch to it? Please report results here.

Background: Vasu suggested to investigate into the subject in https://git.infra.removed/kubernetes/landscape-template/issues/4. See comments there. Basically, we must weight optimised support for individual infrastructures against the time we need for research, implementation and maintenance of optimised IaaS-specific solutions while we have no issue with the general purpose solution we have right now.

ℹ️ Migrated from Jira issue KUBE-111

Pod Container/Local Disk Quota

Issue by kubernetes-jenkins
Wednesday Jul 19, 2017 at 18:45 GMT
Originally opened as https://git.removed/kubernetes-attic/garden-operator/issues/48


Story

  • As operator I want to limit the impact of containers that consume all disk space, so that they dont put the entire node in danger or out of operation.

Motivation

At present any container that consumes its implicitly created root volume can break down the entire node as there is no quota in place (Dockers /var/lib/docker is hosted on the ephemeral root disk at present together with everything else).

Acceptance Criteria

  • Ideally constrict a container root volume to a certain quota, so that only the container itself is impacted if it runs out of quota
  • Alternatively, if thats not possible, discuss whether mounting /var/lib/docker separately would help (then however, all containers would be affected and would that improve anything at all, e.g. not sure how operable Docker will be if /var/lib/docker is put together on a single mount point and has no disk space left; will it respond to remote queries and will it be possible to remove a container and its implicit volumes?)
  • Alternatively, if also thats not possible or doesnt make sense, check whether a saturated node would automatically be dumped and recreated if it becomes unresponsive or find out what must be done to achieve this (if nodes are cattle, we still need to terminate the node so that it can be recreated by the autoscaling group)

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Developer documentation: Have you updated our documentation and/or the CHANGELOG.txt?

ℹ️ Migrated from Jira issue KUBE-31

Shoot Cluster Control Plane High-Availability

Story

  • As provider I want my shoot cluster control planes to be HA (within certain limits), so that my shoot clusters remain operational.

Motivation

Stability and scalability of our clusters (see also performance tests here https://jam4.sapjam.com/blogs/show/vKBhw5503MHfMQuId0b7t0).

Acceptance Criteria

  • Find out whether our assumption works out that we don't have to run control plane components in HA (i.e. single pods are OK):
    • How fast is a critical control plane pod rescheduled when the cluster is updated/nodes are rolled out
    • How fast is a critical control plane pod rescheduled when a node fails
  • Decide whether we can stick to our single pod philosophy arguing that Kubernetes is fast enough to reschedule a pod (depends on knowing fast enough that a pod failed) or we have to give up this philosophy and run it in multiple pods with anti-affinity

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.