Giter Club home page Giter Club logo

gardener-extension-os-gardenlinux's Introduction

REUSE status CI Build status Go Report Card

This controller operates on the OperatingSystemConfig resource in the extensions.gardener.cloud/v1alpha1 API group.

It manages those objects that are requesting...

  • Garden Linux OS configuration (.spec.type=gardenlinux):

    ---
    apiVersion: extensions.gardener.cloud/v1alpha1
    kind: OperatingSystemConfig
    metadata:
      name: pool-01-original
      namespace: default
    spec:
      type: gardenlinux
      units:
        ...
      files:
        ...

    Please find a concrete example in the example folder.

  • MemoryOne on Garden Linux configuration (spec.type=memoryone-gardenlinux):

    ---
    apiVersion: extensions.gardener.cloud/v1alpha1
    kind: OperatingSystemConfig
    metadata:
      name: pool-01-original
      namespace: default
    spec:
      type: memoryone-gardenlinux
      units:
        ...
      files:
        ...
      providerConfig:
        apiVersion: memoryone-gardenlinux.os.extensions.gardener.cloud/v1alpha1
        kind: OperatingSystemConfiguration
        memoryTopology: "2"
        systemMemory: "6x"

    Please find a concrete example in the example folder.

After reconciliation the resulting data will be stored in a secret within the same namespace (as the config itself might contain confidential data). The name of the secret will be written into the resource's .status field:

...
status:
  ...
  cloudConfig:
    secretRef:
      name: osc-result-pool-01-original
      namespace: default
  command: /usr/bin/env bash <path>
  units:
  - docker-monitor.service
  - kubelet-monitor.service
  - kubelet.service

The secret has one data key cloud_config that stores the generation.

An example for a ControllerRegistration resource that can be used to register this controller to Gardener can be found here.

Please find more information regarding the extensibility concepts and a detailed proposal here.


How to start using or developing this extension controller locally

You can run the controller locally on your machine by executing make start. Please make sure to have the kubeconfig to the cluster you want to connect to ready in the ./dev/kubeconfig file. Static code checks and tests can be executed by running make verify. We are using Go modules for Golang package dependency management and Ginkgo/Gomega for testing.

Feedback and Support

Feedback and contributions are always welcome. Please report bugs or suggestions as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn more!

Please find further resources about out project here:

gardener-extension-os-gardenlinux's People

Contributors

achimweigel avatar acumino avatar aleksandarsavchev avatar beckermax avatar ccwienk avatar danatsap avatar danielfoehrkn avatar dependabot[bot] avatar dimityrmirchev avatar gardener-robot-ci-1 avatar gardener-robot-ci-2 avatar gardener-robot-ci-3 avatar ialidzhikov avatar jordanjordanov avatar jschicktanz avatar kon-angelo avatar kostov6 avatar mrbatschner avatar msohn avatar n-boshnakov avatar oliver-goetz avatar plkokanov avatar raphaelvogel avatar rfranzke avatar shafeeqes avatar stoyanr avatar timebertt avatar timuthy avatar voelzmo avatar vpnachev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-extension-os-gardenlinux's Issues

Existing nodes get cgroup driver configured inconsistently on extension upgrade to 0.23.0

How to categorize this issue?

/area os
/kind bug
/os garden-linux

What happened:
We got into what looks like the same situation described in #98 after upgrading the extension to 0.23.0, but it's not a race condition for new nodes as was the conclusion there so I opted for a new issue.

For us existing nodes somehow got their cgroups driver changed and now pods can't spawn.

Warning FailedCreatePodSandBox 112s (x1613 over 41m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of for mat "slice:prefix:name" for systemd cgroups, got "/kubepods/burstable/podc8032134-6f4d-4eab-a086-bafd11ef6dd2/4f4f22b56ac4bc4c998c28ba608f35f8ad733b08b7e1c52 5e8f70d2c928639ee" instead: unknown

Something about the "Don't touch existing nodes" check in #133 did not work. We rolled back to 0.22.0 which fixed the problem. I'll update this issue with further logs and information when I can reproduce it on a test seed.

What you expected to happen:
Upgrade of the extension not to break existing nodes.

How to reproduce it (as minimally and precisely as possible):
Upgrade os-gardenlinux from 0.22.0 to 0.23.0 (while updating gardener/gardenlet from 1.86.1 to 1.87.0 at the same time)

Anything else we need to know?:

Environment:

  • Gardener version (if relevant): 1.87.0
  • Extension version: 0.23.0

cgroup driver configured inconsistently for containerd and kubelet

How to categorize this issue?

/area os
/kind bug
/os garden-linux

What happened:

The scripts introduced in PR #82 and PR #85 are supposed to configure containerd's cgroup driver according to the cgroup version of the underlying host and to make sure that the kubelet is configured with the corresponding cgroup driver. Both reconfigurations happen by running a script in the respectives component's ExecStartPre= before it comes up.

Since these changes were released to Gardener landscapes, they proved to be reliable for most of the time. However, we came across at least one instance of a new node with Garden Linux 934.7 and hence with cgroup v2 being scheduled for an existing worker pool which failed to join the cluster due to an inconsistent cgroup driver configuration: containerd ran with the systemd cgroup driver but the kubelet still ran with the cgroupfs cgroup driver.

From the logs of the internal support ticket:

2023-05-10 06:50:55 | {"_entry":"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format \"slice:prefix:name\" for systemd cgroups, got \"/kubepods/burstable/pod8d7a925b-ba2f-4fcd-9eb7-6a933858f3df/d6f8417b97f66a8e2fd5b7c23cf01222a1b7b476c29c12b1d557c219061e96db\" instead: unknown","count":1,"firstTimestamp":"2023-05-10T06:50:55Z","lastTimestamp":"2023-05-10T06:50:55Z","namespace":"kube-system","object":"Pod/egress-filter-applier-sx4d4","origin":"shoot","reason":"FailedCreatePodSandBox","source":"kubelet","sourceHost":"shoot--hyperspace--prod-int-worker-gqla1-z1-67c55-ntm74","type":"Warning"}
2023-05-10 06:50:55 | {"_entry":"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format \"slice:prefix:name\" for systemd cgroups, got \"/kubepods/burstable/pod2eed3a6f-3e07-42f9-9219-ba72f7f66c47/d85c081a0e7821d8e5a853ee48e97bc0bf2cebb8050b4ff9ac0a3ca9e10ae50b\" instead: unknown","count":1,"firstTimestamp":"2023-05-10T06:50:55Z","lastTimestamp":"2023-05-10T06:50:55Z","namespace":"kube-system","object":"Pod/apiserver-proxy-6fdfb","origin":"shoot","reason":"FailedCreatePodSandBox","source":"kubelet","sourceHost":"shoot--hyperspace--prod-int-worker-gqla1-z1-67c55-ntm74","type":"Warning"}Show context
2023-05-10 06:50:55 | {"_entry":"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format \"slice:prefix:name\" for systemd cgroups, got \"/kubepods/burstable/pod0afb0879-f205-4b66-bc0b-28ff62c898b2/167ccf567fa2cd2a438b18b63b158c0bb21c859559e93924f2f2d1234c35d17b\" instead: unknown","count":1,"firstTimestamp":"2023-05-10T06:50:55Z","lastTimestamp":"2023-05-10T06:50:55Z","namespace":"kube-system","object":"Pod/node-local-dns-hhns4","origin":"shoot","reason":"FailedCreatePodSandBox","source":"kubelet","sourceHost":"shoot--hyperspace--prod-int-worker-gqla1-z1-67c55-ntm74","type":"Warning"}
2023-05-10 06:50:55 | {"_entry":"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format \"slice:prefix:name\" for systemd cgroups, got \"/kubepods/burstable/poda592f80e-1070-482b-94db-1b070e836f7a/930ac369e7035000c8810e6c72b720ff73485834269738d930d18ef96b489874\" instead: unknown","count":1,"firstTimestamp":"2023-05-10T06:50:55Z","lastTimestamp":"2023-05-10T06:50:55Z","namespace":"kube-system","object":"Pod/csi-driver-node-gm8p4","origin":"shoot","reason":"FailedCreatePodSandBox","source":"kubelet","sourceHost":"shoot--hyperspace--prod-int-worker-gqla1-z1-67c55-ntm74","type":"Warning"}

These log message are a strong pointer to the kubelet creating a classic cgroupfs hierarchy which containerd then has not clue about what to do with.

What you expected to happen:

Containerd and kubelet are always running with the same cgroup driver and log message like the ones given above do not occur.

How to reproduce it (as minimally and precisely as possible):

Steps to reproduce need to be investigated. It seems that this is caused by a timing issue or a race condition.

Use systemd cgroup driver on cgroupsv2

How to categorize this issue?

/area os
/kind enhancement
/os garden-linux

What would you like to be added:

Gardenlinux nodes should be configured with the systemd cgroup driver for GL using cgroupsv2.

Why is this needed:

  • Recommended by Kubernetes documentation (but does not state a reason why, and AFAIK there are no inherent problems with using the cgroupfs driver on cgroupsv2)
  • We have seen problems with nested docker-in-docker type setups (also for containerd). If the container runtime uses cgroupfs (default for OS without running systemd e.g in a dind-pod), then the container runtime decides to place the started containers not under the parent pod's cgroup, but under a separate sibling cgroup. This circumvents any cgroup settings on the parent pod's cgroup (e.g memory limits) to apply to the containers started by the nested container runtime.
    • Using systemd as cgroup driver, the container runtimes behave correctly (reason unknown, implementation detail in dockerd and containerd).

For the implementation, make sure:

  • Ensure the systemd cgroup driver is set for both the kubelet and the container runtime. Be aware the Gardener generates the kubelet configuration in the Gardenlet (not in the OS extension). Hence, need to either change the kubelet config centrally to use the systemd driver, or provide a GL specific mechanism e.g via systemd service to adjust the kubelet config on the host.
  • Make sure that GL 576.XX that has a hybrid cgroup setup works with systemd as the cgroup driver. If this works: GL 576.XX is already > 1 year old, the extension can just enforce the systemd derived for all nodes.
  • Ensure node pools are rolled upon switch of the cgroup driver.

Support containerd in AliCloud Mainland

How to categorize this issue?

/area os
/kind enhancement
/priority blocker

What would you like to be added:
Add pause container image configure option in Containerd configuration.

Why is this needed:
Currently, when enable containerd in shoot on AliCloud(Shanghai region). Node will fail to start because addon pods(like CNI addons) can't be created. Error logs are below:

Nov 16 02:59:45 iZuf637vfm5lxo70hy8gv7Z containerd[3057]: time="2020-11-16T02:59:45.564237241Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:kube-proxy-zltgm,Uid:6fb22380-50b2-41e1-ad42-7461929d73fa,Namespace:kube-system,Attempt:0,}"
Nov 16 02:59:51 iZuf637vfm5lxo70hy8gv7Z containerd[3057]: time="2020-11-16T02:59:51.564223616Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:node-exporter-kb6zc,Uid:b74f37a9-c234-4388-b30e-3f46d9e9af01,Namespace:kube-system,Attempt:0,}"
Nov 16 03:00:13 iZuf637vfm5lxo70hy8gv7Z containerd[3057]: time="2020-11-16T03:00:13.567166810Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:calico-node-h5jwg,Uid:6863bee4-e0fc-432b-96f1-51c56906cb11,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp 74.125.203.82:443: i/o timeout"

It apparently in China mainland k8s.gcr.io/pause:3.2 image can't be pulled.

Though --pod-infra-container-image is configured in kubelet.service, it doesn't take effect by warning:

Nov 16 05:55:53 iZuf67ypw1zih2l0iu5keuZ kubelet[4336]: W1116 05:55:53.385305    4336 server.go:191] Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead

Kubelet Startup Crashes systemd on Garden Linux

What happened:

systemd crashes when the node starts. While this is most likely a systemd issue (that unfortunately we are not able to fix in a short period of time) we need a workaround. It also appears problematic to run docker containers in that restricted context and at that time where docker might not be fully started.

This happens with Garden Linux and only on GCP (AWS is fine). It happens with various systemd versions (244, 245).

What you expected to happen:

systemd does not crash

How to reproduce it (as minimally and precisely as possible):

Create a gardenlinux cluster on GCP (but there is no need to).

Anything else we need to know?:

Garden Linux ticket (still private): gardenlinux/gardenlinux#17

In the kubelet.service file replace the docker run command with the two following lines (this has been tested).

#ExecStartPre=/usr/bin/docker run --rm -v /opt/bin:/opt/bin:rw eu.gcr.io/sap-se-gcr-k8s-public/modified3/k8s_gcr_io/hyperkube:v1.15.11 /bin/sh -c "cp /usr/local/bin/kubelet /opt/bin"
ExecStartPre=wget -O /opt/bin/kubelet https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubelet
ExecStartPre=chmod 755 /opt/bin/kubelet

Alternatively the following should also work (but not tested):

ExecStartPre=sh -c 'docker cp $(docker create ):/hyperkube /opt/bin/kubelet'

Environment:

  • Gardener version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

make SELinux enforcing status configurable

How to categorize this issue?

/area os
/kind enhancement
/priority 4
/os garden-linux

What would you like to be added:
Future releases of GardenLinux will come with SELinux tools and policies (packages selinux-basics and selinux-policy-default). Having these packages in Garden Linux will make the system capable of enforcing SELinux policies but by default, SELinux will be in permissive mode.
It should be possible to enable SELinux (i.e. set it to enforcing) individually for shoots or worker pools through the this OS extension.

Why is this needed:
Some workloads might require SELinux and some workloads might fail on worker nodes that have SELinux enabled. For those, it should be possible to enable/disable SELinux for individual shoots or even for worker pools.

Boot chost garden-linux with vsmp memory_one

How to categorize this issue?
/area os
/Enhancement Request
/vsmp memory_one
/os garden-linux

See also https://ddm-hana.slack.com/archives/C9CEBQPGE/p1687161450772459

Support boot chost garden-linux with vsmp memory_one (like the currently existing SuSE with memory_one).

Currently when trying to start a memory_one stencil with garden-linux, the operation fails, and we see in our logs that cloud-init tries to run scripts that belong to SUSE (e.g. – “zypper” and “rpm” that fail)

Hugepages of nodes must be configurable in order for pods to use them

In Kubernetes, pods can request and make use of memory in hugepages since a long time (Beta since K8S 1.10, GA since 1.14)

apiVersion: v1
kind: Pod
metadata:
  name: huge-pages-example
spec:
  containers:
  - name: example
    image: fedora:latest
    resources:
      limits:
        hugepages-2Mi: 100Mi
        hugepages-1Gi: 2Gi

Gardener is currently lacking the capability to configure the hugepages on operating system level.
Ideally, the configuration should be possible on Gardener in a uniform way across all operating systems.

Other K8S offerings, for example Amazon EKS, offer possibilities to call OS commands before the kubelet of a node is started (see preBootstrapCommands below).

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: test-hugepages
  region: eu-central-1
managedNodeGroups:
  - name: my-nodes-group
    instanceType: c5d.large
    desiredCapacity: 1
    minSize: 1
    maxSize: 2  
    volumeSize: 50
    volumeType: gp2
    preBootstrapCommands:
      - sudo echo 64 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Such a configuration would enable hugepage configuration in a completely generic way.

As an alternative, a configuration that is handled by the OS-specific controllers of Gardener could offer the same functionality in a more controlled way.

See also the same issue for SUSE Linux: gardener/gardener-extension-os-suse-chost#64

In-Place Node OS Updates

Story

  • As user I want to update my node operating system in-place with only a restart, but no rolling update required

Motivation

We were approached by parties that, for performance reasons, use locally attached disks with data. While the data is replicated, rolling a node and re-building/syncing the local data may take hours. Doing that during a cluster rolling update may therefore take many days, which is difficult for them.

This feature is also useful in cases where the used machine type is scarce (very special machine types) and it isn't easy / guaranteed to get new machines (no reserved instances).

GardenLinux is currently developing the capability to do that. Reminiscent of CoreOS' FastPatch updates, it will have 2 partitions, run on one, prepare the other one, reboot into the other one. Persistent data is stored on yet another and preserved. This may not work with every update, but with many. The GardenLinux developers expect full rolling only to happen later every 1-2 years, but all other updates could be handled in-place once they and we are done.

This ticket here is about Gardener's part, because we do not support in-place OS updates as of now and do need to think it through and do it then, if feasible. Just for historic reference, please see here one of our very first Gardener tickets when we implemented full automated cluster updates (no. 14 for K8s v1.5 -> v1.6 - time flies) and decided at first against FastPatch (gardener/gardener#14).

Labels

/area os
/kind enhancement
/os garden-linux
/topology shoot

Acceptance Criteria

  • Node OS updates (probably of something like patch versions to also fit our Kubernetes versioning concept) is done without rolling the nodes
  • Ideally, the "dead time" where the kubelet stops posting until it reposts (99 percentile) is shorter than the default machineHealthTimeout of 10m (even better, shorter than the default KCM nodeMonitorGracePeriod of 40s), but that can tweaked (including pod tolerations) by the cluster admins, if not sufficient (still it would be great to achieve a.) if not b.) since "it was said", the rebooting shall take place in seconds)
  • ...

Enhancement/Implementation Proposal (optional)

This will require a GEP (https://github.com/gardener/gardener/tree/master/docs/proposals) as conceptional and core changes will be necessary and everything else up until the update of the versioning guide/docs. The question is also what the main actor is, i.e. will we handle this use case like we handle Kubernetes patch updates, i.e. carried out by the maintenance controller? That's probably preferred for multiple reasons (means to opt out, shoot spec lists exact version, time scatter/jiggle resp. coordinated update, etc.) over the OS doing it itself.

Further Considerations

  • Rolling updates, as side-effects, help with some security obligations (regular fresh start), help building robust solutions (avoiding pet VMs), and the rolling update acts as some sort of safety net: Only when the new node is registered and ready, the old node will be drained and subsequently terminated. In-place updates obviously do not offer this.
  • Because this is not generally desirable (only in certain cases, e.g. with nodes with local disks or of scarce machine types), it would be best to make the update policy (rolling or in-place) configurable per worker pool, which would require more changes. The maintenance section as of today is for the entire cluster.

Resources (optional)

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Adopt Garden Linux os extension due to removal of legacy configuration

How to categorize this issue?

/area os
/kind enhancement
/priority 2
/os garden-linux

What would you like to be added:

In one of the next releases we would like to make some changes to Garden Linux which will affect the Garden Linux extension and the way Kubernetes is configured. In the past we have disabled features that caused incompatibilities with Kubernetes at that time. Those disabled features are:

  1. nftables
  2. cgroupv2
  3. usage of systemd-resolved as resolver

We want to remove the code or configuration that disables these features. We would like to change that because it is legacy and in the meantime Kubernetes has provided support. The operating system extension should disable these feature if still required for older Kubernetes versions. These are the implications and required configuration changes for older clusters:

1. nftables

nftables is supported in Kubernetes since version 1.17, see gardenlinux/gardenlinux#193 for details. Support for some Gardener owned components might still be missing. To switch back to the iptables legacy implementation run the following commands during the bootstrap process:

update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy
systemctl mask nftables.service

2. croupv2

cgroupv2 is supported by Kubernetes starting with Kubernetes 1.19. To switch to cgroupv1 add the following kernel parameter (in /etc/kernel/cmdline.d/80-cgroup.cfg) and reboot.

CMDLINE_LINUX="$CMDLINE_LINUX systemd.unified_cgroup_hierarchy=0" 

There appears to be no other way for Docker to change this.

3. resolv.conf

Garden Linux uses systemd-resolved which is a caching and validating DNS/DNSSEC stub resolver, as well as an LLMNR and MulticastDNS resolver and responder. The functionality is not used when /etc/resolv.conf is linked to /run/systemd/resolve/resolv.conf. For the local system it is better to link to /run/systemd/resolve/stub-resolv.conf.

The problem is that this configuration does not work with kubernetes as the local resolver (which listens on 127.0.0.53 is not routable from within pods. The kubelet has the configuration option to link a different resolv.conf file with the --resolv-conf option (e.g. the /run/systemd/resolve/resolv.conf can be used). Some infrastrucutres might require additional settings (see for example gardener/gardener-extension-provider-openstack#340).

Why is this needed:

see above

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.