rancher / system-upgrade-controller Goto Github PK

View Code? Open in Web Editor NEW

688.0 20.0 85.0 21.47 MB

In your Kubernetes, upgrading your nodes

License: Apache License 2.0

Makefile 0.35% Go 88.09% Dockerfile 1.55% Shell 10.01%

kubernetes custom-resource-definition crd-controller upgrades

system-upgrade-controller's Introduction

System Upgrade Controller

Introduction

This project aims to provide a general-purpose, Kubernetes-native upgrade controller (for nodes). It introduces a new CRD, the Plan, for defining any and all of your upgrade policies/requirements. A Plan is an outstanding intent to mutate nodes in your cluster. For up-to-date details on defining a plan please review v1/types.go.

Presentations and Recordings

April 14, 2020

CNCF Member Webinar: Declarative Host Upgrades From Within Kubernetes

Slides
Video

March 4, 2020

Rancher Online Meetup: Automating K3s Cluster Upgrades

Video

Considerations

Purporting to support general-purpose node upgrades (essentially, arbitrary mutations) this controller attempts minimal imposition of opinion. Our design constraints, such as they are:

content delivery via container image a.k.a. container command pattern
operator-overridable command(s)
a very privileged job/pod/container:
- host IPC, NET, and PID
- CAP_SYS_BOOT
- host root file-system mounted at /host (read/write)
optional opt-in/opt-out via node labels
optional cordon/drain a la kubectl

Additionally, one should take care when defining upgrades by ensuring that such are idempotent--there be dragons.

Deploying

The most up-to-date manifest is usually manifests/system-upgrade-controller.yaml but since release v0.4.0 a manifest specific to the release has been created and uploaded to the release artifacts page. See releases/download/v0.4.0/system-upgrade-controller.yaml

But in the time-honored tradition of curl ${script} | sudo sh - here is a nice one-liner:

# Y.O.L.O.
kubectl apply -k github.com/rancher/system-upgrade-controller

Example Plans

examples/k3s-upgrade.yaml
- Demonstrates upgrading k3s itself.
examples/ubuntu/bionic.yaml
- Demonstrates upgrading, apt-get style, arbitrary packages at pinned versions.
examples/ubuntu/bionic/linux-kernel-aws.yaml
- Demonstrates upgrading the kernel on Ubuntu 18.04 EC2 instances on AWS.
examples/ubuntu/bionic/linux-kernel-virtual-hwe-18.04.yaml
- Demonstrates upgrading the kernel on Ubuntu 18.04 (to the HWE version) on generic virtual machines.

Below is an example Plan developed for k3OS that implements something like an rsync of content from the container image to the host, preceded by a remount if necessary, immediately followed by a reboot.

---
apiVersion: upgrade.cattle.io/v1
kind: Plan

metadata:
  # This `name` should be short but descriptive.
  name: k3os-latest

  # The same `namespace` as is used for the system-upgrade-controller Deployment.
  namespace: k3os-system

spec:
  # The maximum number of concurrent nodes to apply this update on.
  concurrency: 1

  # The value for `channel` is assumed to be a URL that returns HTTP 302 with the last path element of the value
  # returned in the Location header assumed to be an image tag (after munging "+" to "-").
  channel: https://github.com/rancher/k3os/releases/latest

  # Providing a value for `version` will prevent polling/resolution of the `channel` if specified.
  version: v0.10.0

  # Select which nodes this plan can be applied to.
  nodeSelector:
    matchExpressions:
      # This limits application of this upgrade only to nodes that have opted in by applying this label.
      # Additionally, a value of `disabled` for this label on a node will cause the controller to skip over the node.
      # NOTICE THAT THE NAME PORTION OF THIS LABEL MATCHES THE PLAN NAME. This is related to the fact that the
      # system-upgrade-controller will tag the node with this very label having the value of the applied plan.status.latestHash.
      - {key: plan.upgrade.cattle.io/k3os-latest, operator: Exists}
      # This label is set by k3OS, therefore a node without it should not apply this upgrade.
      - {key: k3os.io/mode, operator: Exists}
      # Additionally, do not attempt to upgrade nodes booted from "live" CDROM.
      - {key: k3os.io/mode, operator: NotIn, values: ["live"]}

  # The service account for the pod to use. As with normal pods, if not specified the `default` service account from the namespace will be assigned.
  # See https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
  serviceAccountName: k3os-upgrade

  # Specify which node taints should be tolerated by pods applying the upgrade.
  # Anything specified here is appended to the default of:
  # - {key: node.kubernetes.io/unschedulable, effect: NoSchedule, operator: Exists}
  tolerations:
  - {key: kubernetes.io/arch, effect: NoSchedule, operator: Equal, value: amd64}
  - {key: kubernetes.io/arch, effect: NoSchedule, operator: Equal, value: arm64}
  - {key: kubernetes.io/arch, effect: NoSchedule, operator: Equal, value: s390x}

  # The prepare init container, if specified, is run before cordon/drain which is run before the upgrade container.
  # Shares the same format as the `upgrade` container.
  prepare:
    # If not present, the tag portion of the image will be the value from `.status.latestVersion` a.k.a. the resolved version for this plan.
    image: alpine:3.18
    command: [sh, -c]
    args: ["echo '### ENV ###'; env | sort; echo '### RUN ###'; find /run/system-upgrade | sort"]

  # If left unspecified, no drain will be performed.
  # See:
  # - https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
  # - https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain
  drain:
    # deleteLocalData: true  # default
    # ignoreDaemonSets: true # default
    force: true
    # Use `disableEviction == true` and/or `skipWaitForDeleteTimeout > 0` to prevent upgrades from hanging on small clusters.
    # disableEviction: false # default, only available with kubectl >= 1.18
    # skipWaitForDeleteTimeout: 0 # default, only available with kubectl >= 1.18

  # If `drain` is specified, the value for `cordon` is ignored.
  # If neither `drain` nor `cordon` are specified and the node is marked as `schedulable=false` it will not be marked as `schedulable=true` when the apply job completes.
  cordon: true

  upgrade:
    # If not present, the tag portion of the image will be the value from `.status.latestVersion` a.k.a. the resolved version for this plan.
    image: rancher/k3os
    command: [k3os, --debug]
    # It is safe to specify `--kernel` on overlay installations as the destination path will not exist and so the
    # upgrade of the kernel component will be skipped (with a warning in the log).
    args:
      - upgrade
      - --kernel
      - --rootfs
      - --remount
      - --sync
      - --reboot
      - --lock-file=/host/run/k3os/upgrade.lock
      - --source=/k3os/system
      - --destination=/host/k3os/system

Building

make

Running

Use ./bin/system-upgrade-controller.

Also see manifests/system-upgrade-controller.yaml that spells out what a "typical" deployment might look like with default environment variables that parameterize various operational aspects of the controller and the resources spawned by it.

Testing

Integration tests are bundled as a Sonobuoy plugin that expects to be run within a pod. To verify locally:

make e2e

This will, via Dapper, stand up a local cluster (using docker-compose) and then run the Sonobuoy plugin against/within it. The Sonobuoy results are parsed and a Status: passed results in a clean exit, whereas Status: failed exits non-zero.

Alternatively, if you have a working cluster and Sonobuoy installation, provided you've pushed the images (consider building with something like make REPO=dweomer TAG=dev), then you can run the e2e tests thusly:

sonobuoy run --plugin dist/artifacts/system-upgrade-controller-e2e-tests.yaml --wait
sonobuoy results $(sonobuoy retrieve)

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

system-upgrade-controller's People

Contributors

Stargazers

Watchers

Forkers

dweomer rmweir ibuildthecloud galal-hussein daxmc99 masterzen dclark athosone moduspwnens devopstoday11 monzelmasry rksr robkooper vegesana erkki elmariofredo paulfantom dacrystal nooproblem amineaffane oats87 timtorchen brandond albertogviana suse-kevinklinger zdzichu kinarashah curx luthermonson cmurphy marcusianlevine strongmonkey snasovich robinmccorkell ezc longtian lu1as danmanstx sylvainol thomasferrandiz anarkis kchobantonov harrisonwaffel harshanarayana mmensonen valtechmobility pawe patrickjahns jorikseldeslachts fritterhoff santhoshdaivajna 0ffer manuelbuil jzandbergen tsde rishi-anand spectrocloud rancher-max jstewart612 aplanas jiaqiluo leijie31415926 jefedavis hurzelpurzel jrodonnell brooksn vardhaman22 auston-ivison-suse luis-sousa-pinto jhoelzel buroa 1898andco matttrach jpenner123 sfackler 831jsh sisheogorath onno204 vojtechpastyrik bfbachmann justdan96 damdo harsimranmaan vestigej

system-upgrade-controller's Issues

sometimes generated jobs exceed concurrency

Version
v0.3.0-rc1

Platform/Architecture
linux/amd64

Describe the bug
In working on the Ubuntu examples I have noticed that sometimes more jobs are generated than specified by the concurrency.

To Reproduce
With a concurrency of 2, spin up 5 nodes. Sometimes you will get 4 jobs.

Expected behavior
Only a concurrency number of upgrade jobs are running.

Actual behavior
Sometimes a number of jobs greater than the concurrency value for a plan are generated for that plan.

Additional context
n/a

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages"

Version
rancher/system-upgrade-controller:v0.5.0
rancher/kubectl:v1.18.2
on
rancher/k3s: v1.18.2-k3s1

Platform/Architecture
arm64 (pine64: rock64pro master, rock64 nodes)

Describe the bug
After a succesfull upgrade run the, the upgrade agent pods are deleted, but the upgrade controller running on the maste keeps complaing with error syncing messages:
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Operation cannot be fulfilled on "system-upgrade-controller": delaying object set, requeuing"
"1 controller.go:135] error syncing 'system-upgrade/k3s-server': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp: lookup update.k3s.io on 10.43.0.10:53: server misbehaving, requeuing" (NB: 10.43.0.10 is actual kube-dns pod address)
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp 10.0.0.1:443: connect: network is unreachable, requeuing"

To Reproduce
Have not tried to reproduce as this would most likely mean downgrading the k3s cluster and redoing upgrade .

Expected behavior
Would expect only messages polling the channel for upgrades or plan changes.

Actual behavior
Upgrade controller started upgrade, concurrency 1 on agents. Have seen to be expected behaviour: containers downloading executing, upgrading master/server, and on switching to agents, draining, upgrading. Succesfully restarted all jobs. All well it seems, except these continous flooding of error syncing in log.

Additional context
Added controller deployment and plan yaml's.
1.system-upgrade-controller.txt
2.k3s_upgrade-plan.txt

Can't find Plan CRD

I can't find Plan CustomResourceDefinition object in the manifests folder. Installation fails on an RKE cluster with the object not found error for Plan CRD.

need an easy way to indicate that a plan has been disabled

Is your feature request related to a problem? Please describe.
When importing k3s into Rancher we would need to disable any existing upgrade plans so as to avoid clashing with Rancher-imposed plans.

Describe the solution you'd like
The controller should treat any plan in the same namespace with the annotation upgrade.cattle.io/disabled=true as disabled which means no job should be scheduled for it.

Describe alternatives you've considered

Add a new nodeSelector criteria to existing plans such as upgrade.cattle.io/enabled=true or upgrade.cattle.io/managed-by=rancher that nodes should not normally have.
Set existing plan's concurrency to zero

Additional context
rancher/rancher#25371 (comment)

Many logs stating "error syncing" DesiredSet - Replace Wait batch/v1, Kind=Job

Version

0.2.0

Describe the bug

The following lines are logged many times during Plan execution:

E0206 21:05:57.849250       1 controller.go:135] error syncing 'system-upgrade/ubuntu-latest': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/upgrade-node-10-with-ubuntu-latest-at-cea9cf1a0a721c805e3-42e4b for system-upgrade-controller system-upgrade/ubuntu-latest, requeuing

To Reproduce

There is nothing special to do to replicate it, happens during normal Plan operations.

Expected behavior

Either no such messages are logged if they are not an issue or the issue is solved.

Nil pointer exception when there is no upgrade field

Version

0.2.0

Platform/Architecture

linux amd64
Describe the bug

bug

To Reproduce

apply a plan without an upgrade field

Expected behavior

a message indicating it is a no op, or that upgrade is blank

Actual behavior

nil pointer exception

develop examples for general purpose applications

If it's really a general purpose upgrade controller it should work with (and provide examples for):

ubuntu (see #19)
suse/sles (see #128)
centos
alpine

For each system above provide:

a reasonably secure, default installation for the controller
- should include an enumeration of the rbac needs for the default user in the controller namespace (aka whatever drain needs; delete pods, edit nodes)
a useable Plan
kubectl snippets (or other script assets) for managing application of Plan(s)

Error: WithLatestTag(): invalid tag format (on rancher/k3s-upgrade)

Version
not using system-upgrade-controller binary, but applying manifest 0.6.1

Platform/Architecture
K3S v1.17.4+k3s1

Describe the bug
After applying server-plan:

time="2020-06-24T09:11:05Z" level=error msg="WithLatestTag(): invalid tag format (on rancher/k3s-upgrade)"
E0624 09:11:05.321244 1 controller.go:135] error syncing 'system-upgrade/server-plan': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/stable: x509: certificate signed by unknown authority, handler system-upgrade-controller: failed to create system-upgrade/apply-server-plan-on-dvkube007188.nl.novamedia.com-with- batch/v1, Kind=Job for system-upgrade-controller system-upgrade/server-plan: Job.batch "apply-server-plan-on-dvkube007188.nl.novamedia.com-with-" is invalid: [metadata.name: Invalid value: "apply-server-plan-on-dvkube007188.nl.novamedia.com-with-": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)'), spec.template.labels: Invalid value: "apply-server-plan-on-dvkube007188.nl.novamedia.com-with-": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')], requeuing

To Reproduce
Create controller:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.6.1/system-upgrade-controller.yaml

server-plan.yaml:

kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/master
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  channel: https://update.k3s.io/v1-release/channels/stable

Add plan:

kubectl apply -f server-plan.yaml

prevent job creation for unresolved plans

Version
v0.5.0 , v0.6.1

Describe the bug
Plan generating handler doesn't check if the plan is resolved or failed, which results in generation of invalid job names.

Expected behavior
No jobs should be created.

Actual behavior
Plan generating handler tries to create jobs with invalid names, so they fail to be created

Additional context
k3s-io/k3s#1954
#94

error getting latest release because "x509: certificate signed by unknown authority"

Version
system-upgrade-controller version v0.6.2 (9ed50a5)

Platform/Architecture
linux-amd64

Describe the bug
After applying a Plan, omitting the version field and setting the channel field to a URL that resolves to a release of k3 as described in the documentation, the controller will complain:

W1010 07:58:53.832507       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-10-10T07:58:53Z" level=info msg="Creating CRD plans.upgrade.cattle.io"
time="2020-10-10T07:58:54Z" level=info msg="Waiting for CRD plans.upgrade.cattle.io to become available"
time="2020-10-10T07:58:54Z" level=info msg="Done waiting for CRD plans.upgrade.cattle.io to become available"
time="2020-10-10T07:58:54Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2020-10-10T07:58:54Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller"
time="2020-10-10T07:58:54Z" level=info msg="Starting /v1, Kind=Node controller"
time="2020-10-10T07:58:54Z" level=info msg="Starting batch/v1, Kind=Job controller"
E1010 08:07:35.109182       1 controller.go:135] error syncing 'system-upgrade/server-plan': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: x509: certificate signed by unknown authority, requeuing

This repeats continuously.

To Reproduce

Install the system upgrade controller:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.6.2/system-upgrade-controller.yaml

Apply the following plan:

---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/master
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  channel: https://update.k3s.io/v1-release/channels/latest

Expected behavior
It should find the latest release.

Actual behavior
It doesn't find the latest release due to "certificate signed by unknown authority".

Additional context
I saw a similar issue in #94. The system does have /etc/ssl/, and the system is up to date.

Upgrading Fedora CoreOS hosts

This is more a prospective ticket than an actual feature request, and to gather insight about using SUC in this context.

There's currently a lack of option regarding upgrading a fleet of Fedora CoreOS based kubernetes cluster, whereas we had coreos/container-linux-update-operator for CoreOS.

FCOS uses the coreos/zincati tools check with a Cincinnati server for new upgrades and to download/apply them. The process can be controlled by another protocol called FleetLock that can ensure only one host reboots at a time. The problem is that this solution is not kubernetes aware, unlike SUC is. That's the reason when we migrated to FCOS that we based our auto-upgrade system on SUC.

I've used it successfully to upgrade FCOS hosts by using an upgrade script that calls the rpm-ostree tool (the underlying upgrade command). Unfortunately, this doesn't work completely (see coreos/fedora-coreos-tracker#536 for a discussion of the issue).
Notably the issue is that the last step of my naive upgrade script automatically executes a reboot which kills the container. The job is consequently marked as in error, and when the node reboots, the container is rescheduled again (hopefully not doing anything). This increases the time it takes to rollout an upgrade. I have yet to find a solution to this issue, because there's a race between making sure the machine reboots (to apply the update) and signalling that the update has been performed correctly to SUC.

The other issue is that I have to maintain manually the version number to upgrade to in the Plan. So for instance if there's a new FCOS version, I manually update thespec.version plan field to trigger the upgrade.

My initial plan was to develop a service that would on one side implement the SUC channel system and on the other side the Cincinnati protocol so that plan would be triggered when the Cincinnati server would report the existence of a new version.

In retrospect, I'm wondering if this shouldn't be part of SUC itself, instead of being in another service. Would you accept a PR to implement a configurable channel system, where one of the implementation would be the Cincinnati protocol ?

In short, beside my questions above, I'm wondering how we can better connect FCOS upgrade tools (Zincati, etc) to SUC to build a powerful k8s based FCOS upgrade system.

Thanks,

/cc @lucab

Handle k3s semver (v1.17.0+k3s.1)

Version

v0.2.0
Platform/Architecture

all
Describe the bug

When I try to upgrade to v1.17.0+k3s.1 the upgrade fails.

To Reproduce

Create a plan such as this

 apiVersion: upgrade.cattle.io/v1
  kind: Plan
  metadata:
    annotations:
    generation: 3
    name: k3s-up
    namespace: system-upgrade
  spec:
    concurrency: 1
    drain:
      force: true
    nodeSelector:
      matchExpressions:
      - key: k3s-upgrade
        operator: Exists
    serviceAccountName: system-upgrade
    upgrade:
      image: rancher/k3s-upgrade
    version: v1.17.0+k3s.1

Expected behavior

Upgrade succeeds

Actual behavior

Image pulls fails since rancher/k3s-upgrade:v1.17.0+k3s.1 is not a valid image

develop default/example plan for self-upgrades

it's upgrades all the way down. 🐢

pod names are identical for different plans with the same hash

Version
v0.3.0

Describe the bug
When two plans are created and they happen to have the same hashes:

latestversion
serviceaccount
secrets

A job with the same name will be attempted to run which creates an issue

To Reproduce
create two plans with the different names but the same versions and service accounts

Expected behavior
two separate jobs should run

Actual behavior
only one job will run which belong to the first plan created

support tolerations for Plans

Is your feature request related to a problem? Please describe.

As an operator of a mixed-architecture kubernetes cluster, I rely on node taints to control scheduling of workloads to arm (arm64) based nodes (the reason for this is that there are still a lot of things that don't support arm/arm64). I leverage tolerations to control scheduling of 'safe' workloads that will actually run on arm-based nodes.

Describe the solution you'd like

I want the ability to specify node tolerations when defining system-upgrade-controller Plan(s),
so that I can upgrade my arm/arm64 based k3s nodes.

Describe alternatives you've considered

I tried adding tolerations to the system-upgrade-controller Deployment spec in hopes that the associated upgrade jobs created based on the Plans would somehow inherit this toleration, but they did not.

re-home kubectl image

I don't think it belongs under this repo.

upgrade pod init fails on agent node

Version
system-upgrade-controller version v0.4.0 (437dfc5)

Platform/Architecture
linux-arm64

Describe the bug

$ k logs -n system-upgrade apply-agent-plan-on-kpi3-with-ba5956d01658fc0365-qsc6h -c prepare
pgrep is /usr/bin/pgrep
+ verify_system
+ type pgrep
+ '[' -x /host/sbin/openrc-run ]
+ '[' -d /host/run/systemd ]
+ HAS_SYSTEMD=true
+ return
+ setup_verify_arch
+ '[' -z  ]
+ uname -m
+ ARCH=aarch64
+ ARCH=arm64
+ SUFFIX=-arm64
+ get_k3s_process_info
+ pgrep k3s-
+ K3S_PID=328
+ '[' -z 328 ]
+ cat /host/proc/328/cmdline
+ awk '{print $1}'
+ head -n 1
+ K3S_BIN_PATH=/usr/local/bin/k3s
+ '[' -z /usr/local/bin/k3s ]
+ return
+ replace_binary
+ NEW_BINARY=/bin/k3s-arm64
+ '[' '!' -f /bin/k3s-arm64 ]
+ fatal 'The new binary /bin/k3s-arm64 doesn'"'"'t exist'
+ echo '[ERROR] ' 'The new binary /bin/k3s-arm64 doesn'"'"'t exist'
[ERROR]  The new binary /bin/k3s-arm64 doesn't exist
+ exit 1

To Reproduce
k3s cluster on rpi4 using Debian Buster 10, aarch64. Installed with k3sup.

Expected behavior
Init pods should succeed

Actual behavior
init pod fails on agent node

Additional context

$ k get plan -n system-upgrade agent-plan -o yaml

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"upgrade.cattle.io/v1","kind":"Plan","metadata":{"annotations":{},"name":"agent-plan","namespace":"system-upgrade"},"spec":{"channel":"https://update.k3s.io/v1-release/channels/stable","concurrency":1,"cordon":true,"nodeSelector":{"matchExpressions":[{"key":"node-role.kubernetes.io/master","operator":"DoesNotExist"}]},"prepare":{"args":["prepare","server-plan"],"image":"rancher/k3s-upgrade"},"serviceAccountName":"system-upgrade","upgrade":{"image":"rancher/k3s-upgrade"}}}
  creationTimestamp: "2020-07-10T21:53:18Z"
  generation: 1
  managedFields:
  - apiVersion: upgrade.cattle.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:channel: {}
        f:concurrency: {}
        f:cordon: {}
        f:nodeSelector:
          .: {}
          f:matchExpressions: {}
        f:prepare:
          .: {}
          f:args: {}
          f:image: {}
        f:serviceAccountName: {}
        f:upgrade:
          .: {}
          f:image: {}
    manager: kubectl
    operation: Update
    time: "2020-07-10T21:53:18Z"
  - apiVersion: upgrade.cattle.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:applying: {}
        f:conditions: {}
        f:latestHash: {}
        f:latestVersion: {}
    manager: system-upgrade-controller
    operation: Update
    time: "2020-07-11T00:53:20Z"
  name: agent-plan
  namespace: system-upgrade
  resourceVersion: "3518652"
  selfLink: /apis/upgrade.cattle.io/v1/namespaces/system-upgrade/plans/agent-plan
  uid: 93247109-e292-4634-9235-afeb1983b715
spec:
  channel: https://update.k3s.io/v1-release/channels/stable
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/master
      operator: DoesNotExist
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/k3s-upgrade
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
status:
  applying:
  - kpi3
  conditions:
  - lastUpdateTime: "2020-07-11T00:53:20Z"
    reason: Channel
    status: "True"
    type: LatestResolved
  latestHash: ba5956d01658fc0365f57050a4ba7c2ab414e5eccc5b4f0d2b8799b2
  latestVersion: v1.18.4-k3s1

Controller fails to receive latest version info for k3os: "i/o timeout" on arm SBCs

Version

system-upgrade-controller version v0.5.0 (5671782)
Tested earlier with v0.4.0, with same results.

Platform/Architecture
arm

Describe the bug
Controller fails to receive version information from default k3os release channel (https://github.com/rancher/k3os/releases/latest) and subsequently can't create API objects. Manual query for the URL works and returns 302 as expected.

To Reproduce

Install k3OS v0.10.0 using "overlay" method on nodes (Odroid C2)
Run k3s version v1.17.4+k3s1 as a service from Armbian on master (BananaPi M1)
Install upgrade contoller and default k3os upgrade plan following https://github.com/rancher/k3os#pre-v090 - Note: controller install was necessary, it did not exist by default, probably due to the fact that master is not running k3os.

Expected behavior
Node upgrades start

Actual behavior
Cotroller fails to fetch version information from release channel

Additional context

Debug logs:

time="2020-06-14T21:25:40Z" level=info msg="Starting /v1, Kind=Node controller" func="github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run" file="/go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:100"
time="2020-06-14T21:25:40Z" level=info msg="Starting /v1, Kind=Secret controller" func="github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run" file="/go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:100"
time="2020-06-14T21:25:40Z" level=info msg="Starting batch/v1, Kind=Job controller" func="github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run" file="/go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:100"
time="2020-06-14T21:25:40Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller" func="github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run" file="/go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:100"
time="2020-06-14T21:25:40Z" level=debug msg="PLAN STATUS HANDLER: plan=k3os-system/k3os-latest@6048727, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func1" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:29"
time="2020-06-14T21:25:40Z" level=debug msg="Preparing to resolve \"https://github.com/rancher/k3os/releases/latest\"" func=github.com/rancher/system-upgrade-controller/pkg/upgrade/plan.ResolveChannel file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/plan/plan.go:95"
time="2020-06-14T21:25:40Z" level=debug msg="Sending &{Method:GET URL:https://github.com/rancher/k3os/releases/latest Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[X-SUC-Cluster-ID:[c908f738-a03c-4665-b26e-6189d688297d]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:github.com Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0x3bb1400}" func=github.com/rancher/system-upgrade-controller/pkg/upgrade/plan.ResolveChannel file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/plan/plan.go:106"
time="2020-06-14T21:26:10Z" level=debug msg="PLAN GENERATING HANDLER: plan=k3os-system/k3os-latest@6048727, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func2" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:67"
time="2020-06-14T21:26:10Z" level=debug msg="concurrentNodeNames = [\"my-first-node\"]" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func2" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:72"
time="2020-06-14T21:26:10Z" level=error msg="WithLatestTag(): invalid tag format (on rancher/k3os)" func=github.com/rancher/system-upgrade-controller/pkg/upgrade/container.WithLatestTag.func1 file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/container/container.go:56"
time="2020-06-14T21:26:10Z" level=debug msg="PLAN STATUS HANDLER: plan=k3os-system/k3os-latest@6048727, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func1" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:29"
time="2020-06-14T21:26:10Z" level=debug msg="Preparing to resolve \"https://github.com/rancher/k3os/releases/latest\"" func=github.com/rancher/system-upgrade-controller/pkg/upgrade/plan.ResolveChannel file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/plan/plan.go:95"
E0614 21:26:10.778128       1 controller.go:135] error syncing 'k3os-system/k3os-latest': handler system-upgrade-controller: Get https://github.com/rancher/k3os/releases/latest: dial tcp: i/o timeout, handler system-upgrade-controller: failed to create k3os-system/apply-k3os-latest-on-my-first-node-with- batch/v1, Kind=Job for system-upgrade-controller k3os-system/k3os-latest: Job.batch "apply-k3os-latest-on-my-first-node-with-" is invalid: [metadata.name: Invalid value: "apply-k3os-latest-on-my-first-node-with-": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "apply-k3os-latest-on-my-first-node-with-": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')], requeuing
time="2020-06-14T21:26:10Z" level=debug msg="Sending &{Method:GET URL:https://github.com/rancher/k3os/releases/latest Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[X-SUC-Cluster-ID:[c908f738-a03c-4665-b26e-6189d688297d]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:github.com Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0x3bb1400}" func=github.com/rancher/system-upgrade-controller/pkg/upgrade/plan.ResolveChannel file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/plan/plan.go:106"

controller keeps retrying plan even once plan is delete

Version

v0.3.0-m4

Platform/Architecture

linux amd64

Describe the bug

If an apply-k3s pod is erroring, the controller continues to attempt to run it even after the plan is deleted. If the pod is delete another one is created. I don't see any deployment or replicaset for this. This continues even if another plan is applied with the same name as the deleted plan.

To Reproduce

apply a plan with an incorrect image name
wait for apply-k3s pod to be created and error
delete plan and delete pod
4*. delete another plan with name of deleted plan that has proper image name

Expected behavior

Controller stops trying to execute plan and does not start a new pod if deleted.
Controller only run the new plan if same name is used.

Actual behavior

Controller keeps trying to execute plan.
Controller attempts to apply old plan and new plan (separate pods).

remove unnecessary toleration from controller deployment

The node.kubernetes.io/unschedulable toleration in the deployment manifest is development cruft, get rid of it. Why? Because:

in a multi-master situation, it could cause a rescheduled deployment pod to land on an unschedulable node

service account for upgrades

Is your feature request related to a problem? Please describe.
Right now the service account that the controller runs with (having cluster-admin role binding) is passed off to the pods that run upgrades. This is far to much permission and should be paired down to just enough to run the pods (at a minimum, whatever kubectl drain needs).

Describe the solution you'd like
Ability to specify a service account name for the upgrade job/pod to run with.

Describe alternatives you've considered
n/a

Additional context
n/a

Fail when update docker package

Version
0.6.1

Platform/Architecture
Ubuntu

Describe the bug
job will fail when updating docker package

To Reproduce
apt-get update
apt-get upgrade -y "docker-ce update avaliable"

Expected behavior
Update docker package without crashing

Actual behavior
Docker daemon crash

Additional context
Jun 23 14:41:21 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:21.552045334Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 14:41:21 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:21.559750482Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.258549430Z" level=info msg="Container 1dad5e30dd6e75b2e8c5d9dcd89a064db8ebf6186b6753e95dd5bf9d3a801fed failed to exit within 10 seconds of signal 15 - using the force"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.259538935Z" level=info msg="Container cc195360ad0c57786b0e0bd7e1d9a5e5b3e2121da6ce2b4cfcc846dd1e0704a3 failed to exit within 10 seconds of signal 15 - using the force"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.267219221Z" level=info msg="Container b188d2811e8ab6ff7375fd932a255571c804c5b29341e6e106d965a57e0b7fd7 failed to exit within 10 seconds of signal 15 - using the force"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.351366864Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.397157353Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.495799768Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.549875577Z" level=info msg="stopping event stream following graceful shutdown" error="" module=libcontainerd namespace=moby
Jun 23 14:41:31 worker-dcg-01 dockerd[1593]: time="2020-06-23T14:41:31.550094969Z" level=info msg="Daemon shutdown complete"

can trigger number of apply jobs greater than plan concurrency

Version

v0.3.0+

Platform/Architecture

All.

Describe the bug

#38 was only partially mitigated. Refactor plan.SelectConcurrentNodeNames() to add one node at a time to the ${plan}.status.applying

To Reproduce

It depends on internal timing but the most reliable way to reproduce is setup a Plan that is ready to apply but prevented by lack of a label then apply the gating label to multiple nodes in a single invocation of kubectl label.

Expected behavior

Number of running apply jobs for a plan in a cluster does not exceed ${plan}.spec.concurrency

Actual behavior

The number of running apply jobs for a plan in a cluster will typically exceed plan ${plan}.spec.concurrency roughly 30% of the time.

Additional context

See #39

Tolerations are ignored

Version
rancher/system-upgrade-controller:v0.4.0

Describe the bug
The plan tolerations are ignored. The jobs created from the plan do not have the tolerations specified in the plan definition.

To Reproduce
Create a new plan and add the following tolerations:

  tolerations:
  - effect: NoSchedule
    operator: Exists
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/controlplane
    operator: Exists
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists

Deploy the plan and attempt to patch a Kubernetes control plane + etcd node (which should have the following taints):

Taints:             node-role.kubernetes.io/etcd=true:NoExecute
                    node-role.kubernetes.io/controlplane=true:NoSchedule

The tolerations section in the job that is created from the plan looks like this (all the tolerations defined in the plan are ignored):

      tolerations:
      - effect: NoSchedule
        key: node.kubernetes.io/unschedulable
        operator: Exists

Therefore the pod created from the job remains in pending state forever, with the following error:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  16s (x2 over 16s)  default-scheduler  0/12 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 11 node(s) didn't match node selector.

Expected behavior
The tolerations should be set as defined in the plan so that the job can be executed.

Actual behavior
The pod remains in pending state forever.

Deleting a plan when the plan is failing should return nodes to schedulable

Is your feature request related to a problem? Please describe.

After my upgrade plan failed due to rancher/k3s-upgrade#22 I deleted the plan, with the hope that the nodes would be returned to status Ready. The controller continued to schedule jobs and the node remained with status Ready,SchedulingDisabled.

Describe the solution you'd like
The controller should see that attempts to upgrade are failing and return the node back to status Ready. No more attempts to schedule the upgrade should be made until the upgrade plan is modified.

Describe alternatives you've considered
I manually ran kubectl uncordon my-node and all was fine.

Additional context

prepare prior to cordon/drain+upgrade

Is your feature request related to a problem? Please describe.
Would like to (optionally) run a process that must exit successfully before invoking the (optional) cordon/drain and then moving on to the upgrade process.

Describe the solution you'd like
Add prepare "container spec" that looks a lot like the current upgrade spec.

Describe alternatives you've considered
n/a

Additional context
n/a

update manifest with develops developed for rancher/rancher k3s upgrades

SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT: "99"
SYSTEM_UPGRADE_JOB_IMAGE_PULL_POLICY: Always

provide guidance when upgrading leveraging drain with kubectl 1.18+

Version
v0.6.2

Platform/Architecture
amd64

Describe the bug
Lacking guidance in the readme and examples it is exceedingly easy to set oneself up with never-ending upgrades due to pod disruption budgets (when using kubectl 1.18+).

To Reproduce
Setup a k3s cluster with one server and two agents and apply the examples/k3s-upgrade,yaml. With the default array of services (specifically local-path-storage and traefik) hung upgrades are a common occurence.

Expected behavior
The example upgrade completes successfully and in a timely manner.

Actual behavior
The example upgrade will often become hung or take hours instead of minutes.

Additional context
k3s-io/k3s#2248

plans pre v0.4.0 do not get deleted automatically and can cause periodic flapping

Version

v0.4.0

Platform/Architecture

linux/amd64

Describe the bug
Since v0.4.0 plans are created, by default, with the upgrade.cattle.io/ttl-seconds-after-finished annotation to allow the SUC to replicate the TTL Controller for Finished Resources style of garbage collection for the Jobs that it creates to apply Plans. Jobs created by SUC prior to v0.4.0 were left to be garbage-collected manually and it turns out that these legacy Jobs confuse the new enqueue-or-delete cleanup logic such that the controller will delete Jobs with the upgrade.cattle.io/ttl-seconds-after-finished annotation but then next cycle it will process the legacy jobs as complete and label the node as per their stale Plan Latest Hash. This causes a Job to be created to apply the Plan again (because of a mismatch in the value of the plan.upgrade.cattle.io/${plan.name} label) and if it has drain configured it will terminate non-DaemonSet Pods

To Reproduce

install k3OS v0.9.1
kubectl edit -n k3os-system plan/k3os-latest # set version=v0.10.0
kubectl label node -l k3os.io/mode plan.upgrade.cattle.io/k3os-latest=enabled --overwrite
watch the apply-k3os-latest-on-* Job get re-created and run approximately every 15 minutes

Expected behavior
The upgrade job should apply only once.

Actual behavior
The upgrade job will be applied periodically ad nauseum.

Additional context
rancher/k3os#432

system-upgrade-controller CrashLoopBackOff

Version
0.4.0

Platform/Architecture
rpi3 (armv7)

Describe the bug
Following this doc: https://rancher.com/docs/k3s/latest/en/upgrades/automated/#install-the-system-upgrade-controller I get CrashLoopBackOff on system-upgrade-controller pod.
Pod logs:

W0328 06:42:10.648220       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-03-28T06:42:11Z" level=fatal msg="Error starting: Get https://10.43.0.1:443/api/v1/namespaces/kube-system: dial tcp 10.43.0.1:443: connect: connection refused"

Upgrade fails on k3s high availability with embedded db

Version
v0.5.0

Platform/Architecture
linux-arm64

Describe the bug
system-upgrade-controller fails to upgrade a node that was the initial node in a high-availability cluster with embedded database started using the --cluster-init option as described in the docs

To Reproduce

Start a new high availability k3s cluster using the --cluster-init flag
Add system-upgrade-controller and add a server plan

Expected behavior
k3s is upgraded as expected

Actual behavior
apply-server-plan fails with the following output:

+ upgrade
+ get_k3s_process_info
+ ps -ef
+ grep -E 'k3s .*(server|agent)'
+ grep -E -v '(init|grep|channelserver)'
+ awk '{print $1}'
+ K3S_PID=
+ '[' -z  ]
+ fatal 'K3s is not running on this server'
+ echo '[ERROR] ' 'K3s is not running on this server'
[ERROR]  K3s is not running on this server
+ exit 1

Additional context
Running through the bash script step by step I found that the issue lies with grep -E -v '(init|grep|channelserver)', which matches init in --cluster-init. This removes the k3s pid from the list causing the script to find no PID.

Notification Support

Is your feature request related to a problem? Please describe.

There is no way of knowing when the upgrade will occur, it can happen at any time.

Describe the solution you'd like

I would to be able to post to a webhook when an upgrade fires.

Describe alternatives you've considered

None.

Additional context

Ideally integration with OpsGenie or something like that would be great.

Need to fully move to wrangler library

Integrating system upgrade controller into Rancher fails with the message below. This is due to Rancher using a more up-to-date version of norman and wrangler. Norman is being deprecated in favor of wrangler and this package has been moved to wrangler.

go get github.com/rancher/system-upgrade-controller
go: finding github.com/rancher/system-upgrade-controller v0.2.0
go: downloading github.com/rancher/system-upgrade-controller v0.2.0
go: extracting github.com/rancher/system-upgrade-controller v0.2.0
go: finding github.com/rancher/norman latest
build github.com/rancher/system-upgrade-controller: cannot load github.com/rancher/norman/pkg/openapi: module github.com/rancher/norman@latest found (v0.0.0-20200212222655-bf773d02101e), but does not contain package github.com/rancher/norman/pkg/openapi

uncordon after successful upgrade

Is your feature request related to a problem? Please describe.
After a job has completed successfully it is left to a human operator to uncordon an upgraded node (if it was cordoned or drained).

Describe the solution you'd like
After a plan has been successfully applied to a node with cordon/drain configured the controller should uncordon the node.

Describe alternatives you've considered
n/a

Additional context
n/a

Provide optional upgrade window setting to limit when a plan can run

Is your feature request related to a problem? Please describe.
It would be great if you could limit the execution of a plan to a specific timeframe, so that the upgrade jobs of a plan can for example only run:

between 2am and 4am
every Wednesday between 2pm and 6pm
...

Describe the solution you'd like
I was thinking about adding to configurations upgrade-window-start and upgrade-window-length that work similar to the reboot windows in the container-linux-update-operator (https://github.com/coreos/container-linux-update-operator/blob/master/doc/reboot-windows.md).
The configuration flags could be optionally added as annotations to a Node and the system-upgrade-controller would only schedule a Job on this node if the current time is currently within this timeframe.
By adding the configuration to the Node one could easily configure different upgrade windows on different set of nodes within a cluster. A node setting would affect all plans the same way.

Describe alternatives you've considered
Another option would be to add the configuration flags globally to the system-upgrade-controller as environment variables and/or CLI arguments, but this would decrease the flexibility.

In addition the configuration flag could als be set as a field in the spec of a plan, but I'd propose to discuss this independently, because it adds more complexity to the feature:

What happens when the upgrade window of a plan conflicts with the upgrade window of Nodes? The plan would never run them.
How would an upgrade window setting of a plan interact with a use case where you want to schedule plans in a cron like way? Is this even something that should be supported or is the channel feature a better solution for this use-case?

Because of this I feel starting with adding these settings to Nodes is a good first step, that would already solve a large amount of use-cases.

Additional context
Once we decide on an implementation, I'll be happy to implement it and create a PR for this.

Support Common Tolerations in Deployment

Added tolerations for "CriticalAddonsOnly" and "node-role.kubernetes.io/master".
The Deployment has nodeAffinity to "node-role.kubernetes.io/master", but has no tolerations.
If the master node is tainted, the Pod cannot be scheduled and stays in Pending.

by default only deploy on master nodes

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - {key: "node-role.kubernetes.io/master", operator: In, values: ["true"]}

automatic cleanup of upgrade jobs

Is your feature request related to a problem? Please describe.
This controller purposely orphans upgrade jobs after they have been scheduled to make reasoning about system state less of a shot-in-the-dark but this can result in jobs and their pods, long since complete, hanging around and polluting kubectl output. We should consider cleaning up stale jobs every so often.

Describe the solution you'd like
Upgrade jobs of a certain age, regardless of the status of plans which were applied, should be cleaned up and removed from the system (this includes their pods). This would probably be most easily accomplished by integrating with the TTL Controller for Finished Resources

Describe alternatives you've considered
Mark and sweep effected by this controller itself.

Additional context
See the kubectl output in #23. Pods that incur a reboot as part of the upgrade job will in many cases end up in the Unknown state along with a second run of the pod. This may be confusing and shouldn't persist eternally.

drain options: add skip-wait-for-delete-timeout

add skip-wait-for-delete-timeout option to fix unsuccessful deletion of pods because of waitForDelete after the node is labeled for drain.

Reboot handling

Is your feature request related to a problem? Please describe.

In the current implementation the state of the upgrade process is determined by the Job status. If the Job is successful, the process is considered done and the node gets uncordoned. However if the Job is unsuccessful, no uncordoning happens. If the upgrade process f.e. just issues a reboot during it's work, the Job is most likely failing as the node gets suddenly rebooted and as such never gets uncordoned.

Also if you defer rebooting f.e. by using shutdown -r [TIME], it gets worse as the node is uncordoned and later rebooted without another proper draining.

Describe the solution you'd like

One solution to this issue could be to implement the reboot logic into SUC as a toplevel configuration in the Plan, like spec.reboot: true and issue a reboot when the Job is considered successful. Then track the reboot process and uncordon only when the node is properly back from reboot. This is something which is perfectly working in Kured, so probably cheating a bit there to implement a similar way could be a good way to start.

Describe alternatives you've considered

An alternative could be to combine SUC and Kured, but this would lead into draining the node two times, once by SUC and once by Kured which is considered suboptimal.

Pods with a pod disruption budget are not evicted

Version
rancher/system-upgrade-controller:v0.5.0

Describe the bug
System upgrade jobs running in hosts that have any pods with a pod disruption budget are stuck forever initializing.

To Reproduce
Run any workload that has a pod disruption budget and try to apply an upgrade plan. The job will be stuck on initializing forever on the node where the pod with the pod disruption budget is running.

Expected behavior
There should be a way to drain pods with pod disruption budget

Actual behavior
I added the following drain options to my plan:

  drain:
    deleteLocalData: true
    ignoreDaemonSets: true
    force: true

However, nodes that have any pods with a pod disruption budget are stuck forever initializing. The drain init container is stuck with the following error (that repeats every 5 seconds):

evicting pod "<pod-name>"
error when evicting pod "<pod-name>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

If I manually remove those pods (using --force --grace-period=0), then the drain container is able to finish and the plan is applied successfully.

secrets

Is your feature request related to a problem? Please describe.
Unable to inject any information into the upgrade pod/container other than what is on the host/node filesystem. We would like to inject secrets.

Describe the solution you'd like
The PlanSpec should have a way to specify name and mount-path for secrets available to the apply job/pod.

Describe alternatives you've considered
Leveraging PodPresets was considered but they are applied passively. Ideally we would like the update of the content of a secret to trigger application of the upgrade job.

K3S Automatic upgrade is not working

Version:
v1.17.5+k3s1

Describe the bug
I have followed the process as mentioned in k3s automatic upgrade. New image is getting pulled successfully and container is also started. However, when i tried running this command : Kubectl get nodes , I am unable to see the upgraded version. I have only one master machine.

To Reproduce

Deploy the system-upgrade-controller.yaml (v0.5.0)
Deploy k3s-upgrade.yaml
Below is the yaml file
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: k3s-server
namespace: system-upgrade
spec:
concurrency: 1
nodeSelector:
matchExpressions:
- {key: k3s-upgrade, operator: Exists}
  serviceAccountName: system-upgrade
  drain:
  force: true
  upgrade:
  image: rancher/k3s-upgrade
  version: v1.17.6+k3s1

Expected behavior
k3s version should be ugraded.

Actual behavior
Unable to see the upgraded k3s version.

Additional context / logs
root@sagar-VM:/home/sagar/Downloads/system-upgrade-controller/manifests# kubectl get nodes
NAME STATUS ROLES AGE VERSION
sagar-vm Ready master 4d18h v1.17.5+k3s1

root@sagar-VM:/home/sagar/Downloads/system-upgrade-controller/manifests# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
logging fluent-bit-r4wzj 1/1 Running 0 4d19h
kube-system svclb-traefik-65tmw 2/2 Running 0 4d19h
system-upgrade system-upgrade-controller-c75459c4-2kwvc 1/1 Running 0 97m
system-upgrade apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 0/1 Completed 0 66s
kube-system local-path-provisioner-58fb86bdfd-fbnmf 1/1 Running 0 63s
kube-system metrics-server-6d684c7b5-wc6s5 1/1 Running 0 63s
kube-system coredns-6c6bb68b64-m286b 1/1 Running 0 63s
kube-system traefik-7b8b884c8-qg5z9 1/1 Running 0 63s

root@sagar-VM:/home/sagar/Downloads/system-upgrade-controller/manifests# kubectl logs apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-8djm7 -n system-upgrade

upgrade
get_k3s_process_info
grep -E -v '(init|grep|channelserver)'
grep -E 'k3s .*(server|agent)'
awk '{print $1}'
ps -ef
K3S_PID=27838
'[' -z 27838 ]
info 'K3S binary is running with pid 27838'
echo '[INFO] ' 'K3S binary is running with pid 27838'
[INFO] K3S binary is running with pid 27838
cat /host/proc/27838/cmdline
awk '{print $1}'
head -n 1
K3S_BIN_PATH=/usr/local/bin/k3s
'[' 27838 '==' 1 ]
'[' -z /usr/local/bin/k3s ]
return
replace_binary
NEW_BINARY=/opt/k3s
FULL_BIN_PATH=/host/usr/local/bin/k3s
'[' '!' -f /opt/k3s ]
info 'Comparing old and new binaries'
echo '[INFO] ' 'Comparing old and new binaries'
[INFO] Comparing old and new binaries
sha256sum /opt/k3s /host/usr/local/bin/k3s
cut '-d ' -f1
uniq
wc -l
sha256sum: can't open '/host/usr/local/bin/k3s': No such file or directory
BIN_COUNT=1
'[' 1 '==' 1 ]
info 'Binary already been replaced'
echo '[INFO] ' 'Binary already been replaced'
exit 0

root@sagar-VM:/home/sagar/Downloads/system-upgrade-controller/manifests# kubectl get events -n system-upgrade
LAST SEEN TYPE REASON OBJECT MESSAGE
3m25s Normal SuccessfulCreate job/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-d1c05 Created pod: apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2
Normal Scheduled pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Successfully assigned system-upgrade/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 to sagar-vm
3m25s Normal Pulling pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Pulling image "rancher/kubectl:v1.18.2"
3m22s Normal Pulled pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Successfully pulled image "rancher/kubectl:v1.18.2"
3m22s Normal Created pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Created container drain
3m22s Normal Started pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Started container drain
2m49s Normal Pulling pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Pulling image "rancher/k3s-upgrade:v1.17.6-k3s1"
2m46s Normal Pulled pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Successfully pulled image "rancher/k3s-upgrade:v1.17.6-k3s1"
2m46s Normal Created pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Created container upgrade
2m46s Normal Started pod/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 Started container upgrade

root@sagar-VM:/home/sagar/Downloads/system-upgrade-controller/manifests# kubectl describe pod apply-k3s-server-on-shishir-vm-with-842bd0625d5c3225c5f50-tm7j2 -n system-upgrade
Name: apply-k3s-server-on-shishir-vm-with-842bd0625d5c3225c5f50-tm7j2
Namespace: system-upgrade
Priority: 0
Node: sagar-vm/192.168.225.36
Start Time: Mon, 15 Jun 2020 15:12:15 +0530
Labels: controller-uid=41bbebec-2d8d-4708-852d-45836c367d08
job-name=apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-d1c05
plan.upgrade.cattle.io/k3s-server=842bd0625d5c3225c5f50ac41b38ff3b5fc8e73d154bf982e19795df
upgrade.cattle.io/controller=system-upgrade-controller
upgrade.cattle.io/node=sagar-vm
upgrade.cattle.io/plan=k3s-server
upgrade.cattle.io/version=v1.17.6-k3s1
Annotations:
Status: Succeeded
IP: 192.168.225.36
IPs:
IP: 192.168.225.36
Controlled By: Job/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-d1c05
Init Containers:
drain:
Container ID: containerd://5ebe8a85625ecce15475e457982415ea17533c19a6828c7dd1b015be24eda715
Image: rancher/kubectl:v1.18.2
Image ID: docker.io/rancher/kubectl@sha256:0f19fc9ff70b07771ac3cbc701e118fb06c0c0d97a9605e4298c2c2469a83f05
Port:
Host Port:
Args:
drain
sagar-vm
--pod-selector
!upgrade.cattle.io/controller
--ignore-daemonsets
--delete-local-data
--force
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 15 Jun 2020 15:12:18 +0530
Finished: Mon, 15 Jun 2020 15:12:50 +0530
Ready: True
Restart Count: 0
Environment:
SYSTEM_UPGRADE_NODE_NAME: (v1:spec.nodeName)
SYSTEM_UPGRADE_POD_NAME: apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 (v1:metadata.name)
SYSTEM_UPGRADE_POD_UID: (v1:metadata.uid)
SYSTEM_UPGRADE_PLAN_NAME: k3s-server
SYSTEM_UPGRADE_PLAN_LATEST_HASH: 842bd0625d5c3225c5f50ac41b38ff3b5fc8e73d154bf982e19795df
SYSTEM_UPGRADE_PLAN_LATEST_VERSION: v1.17.6-k3s1
Mounts:
/host from host-root (rw)
/run/system-upgrade/pod from pod-info (ro)
/var/run/secrets/kubernetes.io/serviceaccount from system-upgrade-token-f2wxr (ro)
Containers:
upgrade:
Container ID: containerd://ff191c6f4f0d74d30b5bb31ad72fba6d332f1efc3639e697fdc91c93d701a211
Image: rancher/k3s-upgrade:v1.17.6-k3s1
Image ID: docker.io/rancher/k3s-upgrade@sha256:2d5fcc0642d5174c831fe32073db8202b7ba8540b1e156fbe781a53d3c853872
Port:
Host Port:
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 15 Jun 2020 15:12:54 +0530
Finished: Mon, 15 Jun 2020 15:12:54 +0530
Ready: False
Restart Count: 0
Environment:
SYSTEM_UPGRADE_NODE_NAME: (v1:spec.nodeName)
SYSTEM_UPGRADE_POD_NAME: apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 (v1:metadata.name)
SYSTEM_UPGRADE_POD_UID: (v1:metadata.uid)
SYSTEM_UPGRADE_PLAN_NAME: k3s-server
SYSTEM_UPGRADE_PLAN_LATEST_HASH: 842bd0625d5c3225c5f50ac41b38ff3b5fc8e73d154bf982e19795df
SYSTEM_UPGRADE_PLAN_LATEST_VERSION: v1.17.6-k3s1
Mounts:
/host from host-root (rw)
/run/system-upgrade/pod from pod-info (ro)
/var/run/secrets/kubernetes.io/serviceaccount from system-upgrade-token-f2wxr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType: Directory
pod-info:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
system-upgrade-token-f2wxr:
Type: Secret (a volume populated by a Secret)
SecretName: system-upgrade-token-f2wxr
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message

Normal Scheduled default-scheduler Successfully assigned system-upgrade/apply-k3s-server-on-sagar-vm-with-842bd0625d5c3225c5f50-tm7j2 to sagar-vm
Normal Pulling 5m39s kubelet, sagar-vm Pulling image "rancher/kubectl:v1.18.2"
Normal Pulled 5m36s kubelet, sagar-vm Successfully pulled image "rancher/kubectl:v1.18.2"
Normal Created 5m36s kubelet, sagar-vm Created container drain
Normal Started 5m36s kubelet, sagar-vm Started container drain
Normal Pulling 5m3s kubelet, sagar-vm Pulling image "rancher/k3s-upgrade:v1.17.6-k3s1"
Normal Pulled 5m kubelet, sagar-vm Successfully pulled image "rancher/k3s-upgrade:v1.17.6-k3s1"
Normal Created 5m kubelet, sagar-vm Created container upgrade
Normal Started 5m kubelet, sagar-vm Started container upgrade

kubernetes 1.18

the latest and greatest

Internal DNS resolution not working

Version
v0.5

Platform/Architecture
K3s - v1.18.2+k3s1

Describe the bug
The jobs generated by SUC can't resolve k8s services names. This can be fixed by setting the DNS policy to ClusterFirstWithHostNet: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pods-dns-policy

To Reproduce

dig pushgw.monitoring.svc.cluster.local within the pods generated by a plan

Expected behavior
Resolution of k8s services.

Actual behavior
Can't resolve the k8s services. dig +short pushgw.monitoring.svc.cluster.local returns nothing.

Additional context
Use case: we have a script that will push some package information to a prometheus push gateway during an Ubuntu upgrade.

Usage with package manager based Operating Systems

Is your feature request related to a problem? Please describe.

I want to use SUC with classic package manager based operating systems like Ubuntu, RHEL or Alpine which IMHO and according to the README ("Purporting to support general-purpose node upgrades...") should be supported by SUC.

In this issue I want to figure out this use-case in detail and what's needed for SUC to fully support it (hint: It doesn't currently).

Describe the solution you'd like

In my experiments with Ubuntu I crafted the following plan:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: ubuntu-latest
  namespace: system-upgrade
spec:
  concurrency: 1
  version: latest
  nodeSelector:
    matchExpressions:
      - {key: plan.upgrade.cattle.io/ubuntu-latest, operator: Exists}
  serviceAccountName: system-upgrade
  drain:
    force: true
  upgrade:
    image: busybox
    command:
      - chroot
      - /host
    args:
      - sh
      - -c
      - |
        export DEBIAN_FRONTEND=noninteractive && \
        apt-get -q update && \
        apt-get -qy dist-upgrade --auto-remove --purge && \
        [ -f /var/run/reboot-required ] && cat /var/run/reboot-required && reboot

To make this plan work, the following tweak is needed to get working network connectivity in the upgrade Pod:

--- a/pkg/upgrade/job/job.go
+++ b/pkg/upgrade/job/job.go
@@ -99,6 +99,7 @@ func NewUpgradeJob(plan *upgradeapiv1.Plan, nodeName, controllerName string) *ba
                                Spec: corev1.PodSpec{
                                        HostIPC:            true,
                                        HostPID:            true,
+                                       HostNetwork:        true,
                                        ServiceAccountName: plan.Spec.ServiceAccountName,
                                        Affinity: &corev1.Affinity{
                                                NodeAffinity: &corev1.NodeAffinity{

With that, the Plan does it's job, creates Jobs and runs Pods which will actually run upgrades on the nodes. So far so good. Now to the questions:

As installing updates via package manager there is no such thing like "channels", at least not without using something like Satellite or a similar tool. But even then, the channel or version doesn't map to a container image tag. Are there already ideas to solve this?
How could the controller be ableto figure out to undrain a node after upgrades with reboots? It seems not to work with reboots because after a reboot the Pod get's rescheduled but the node is still drained.
What do you think about this mechanism of upgrading OS packages?

Describe alternatives you've considered

SUC is THE BEST, there are no real alternatives 😄

Do not force upgrade container tag to be status.latestVersion

This is somehow both a bug report and a feature request.

Version

v0.4.0

Platform/Architecture

Fedora CoreOS

Describe the bug

We were planning to use the System Upgrade Controller to upgrade a fleet of Fedora CoreOS VM. The way the System Upgrade Controller is working fits our model: we could update the plan version field with the latest released Fedora CoreOS version (manually or automatically), and the upgrade container would just be an alpine container that chroot in the image and run the necessary commands to upgrade the OS and reboot to the intended version.

Except that it doesn't work because the System Upgrade Controller substitutes the upgrade image tag (provided in the plan) by the status.latestVersion (ie the plan version).
I understand that's how rancher is updated, but that breaks a whole lot of workflows where the upgrade container image is not that important.

I understand that I can workaround this and publish locally an alpine image tagged with the current version (possibly in the prepare container, since it's not yet subject to tag overwriting see #41 ), but that'd be awesome, IMHO to not overwrite a specified image tag in both the prepare or upgrade container:

IMHO, the behavior should be:

if the upgrade or prepare container image has the form image:tag, then use this explicitely
if the upgrade or prepare container image has the form image (without a tag) then use the latestVersion as the image tag.

This allows both workflows to coexist.

To Reproduce

Use the following plan:

apiVersion: upgrade.cattle.io/v1
kind: Plan

metadata:
  name: fedora-coreos-asmodee-production
  namespace: system-upgrade
  labels:
    addonmanager.kubernetes.io/mode: Reconcile

spec:
  concurrency: 1
  version: "31.20200323.3.2"
  nodeSelector:
    matchExpressions:
      - {key: role, operator: In, values: ["whatever","dont-care"]}

  serviceAccountName: system-upgrade

  secrets:
    - name: fedora-coreos
      path: /host/run/system-upgrade/fedora-coreos
  upgrade:
    image: alpine:3.11
    command: ["chroot", "/host"]
    args: ["/usr/bin/bash", "/run/system-upgrade/fedora-coreos/upgrade.sh"]

Expected behavior
It should run the /run/system-upgrade/fedora-coreos/upgrade.sh on every node after chrooting to the host.

Actual behavior

It fails scheduling the upgrade pod on the nodes with:

Failed to pull image "alpine:31.20200323.3.2": rpc error: code = Unknown desc = Error response from daemon: manifest for alpine:31.20200323.3.2 not found

Nil Pointer When Deleting a Plan

Version
v0.4.0

Platform/Architecture
linux-amd64

Describe the bug
When deleting a plan SUC will crash with a nil pointer, also the jobs won't get deleted.

To Reproduce

Start clean Ubuntu vagrant machine
curl -sfL https://get.k3s.io | sh -
curl -s "https://raw.githubusercontent.com/ kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
mv kustomize /usr/local/bin
kustomize build https://github.com/rancher/system-upgrade-controller | kubectl apply -f -
kubectl label node agent1 plan.upgrade.cattle.io/bionic=true
Apply example plan: https://github.com/rancher/system-upgrade-controller/blob/master/examples/ubuntu/bionic.yaml
Delete the plan

Expected behavior

Operator should not crash
Jobs associated with the plan should also get deleted

Actual behavior
Log output:

root@agent1:~# kubectl logs -f -n system-upgrade system-upgrade-controller-9bbf49c85-lm44s
W0421 11:55:00.376837       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-04-21T11:55:00Z" level=info msg="Updating CRD plans.upgrade.cattle.io"
time="2020-04-21T11:55:00Z" level=info msg="Starting /v1, Kind=Node controller"
time="2020-04-21T11:55:00Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2020-04-21T11:55:00Z" level=info msg="Starting batch/v1, Kind=Job controller"
time="2020-04-21T11:55:00Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller"
E0421 12:04:25.895701       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 125 [running]:
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1448560, 0x241b030)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1448560, 0x241b030)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handleSecrets.func1(0xc000428360, 0x15, 0x0, 0x24, 0xc1000000c0e, 0x40c44b)
        /go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_core.go:47 +0x10c
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler-api/pkg/generated/controllers/core/v1.FromSecretHandlerToHandler.func1(0xc000428360, 0x15, 0x0, 0x0, 0xc000043cb0, 0x24, 0x16948a0, 0xc00011f9e0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler-api/pkg/generated/controllers/core/v1/secret.go:96 +0xdb
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Handlers).Handle(0xc0004113e0, 0xc000428360, 0x15, 0x0, 0x0, 0x15, 0x0, 0x0, 0xc000411100)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/handlers.go:27 +0x100
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).syncHandler(0xc0000d0700, 0xc000428360, 0x15, 0x10894e1, 0xc00012d0c8)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:171 +0x122
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).processSingleItem(0xc0000d0700, 0x13db1a0, 0xc0004680c0, 0x0, 0x0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:156 +0xf0
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).processNextWorkItem(0xc0000d0700, 0xc000244668)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:133 +0x51
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).runWorker(0xc0000d0700)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:122 +0x2b
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00021be60)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00021be60, 0x3b9aca00, 0x0, 0x13db101, 0xc000086ae0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00021be60, 0x3b9aca00, 0xc000086ae0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:104 +0x178
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x128bfac]

goroutine 125 [running]:
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1448560, 0x241b030)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handleSecrets.func1(0xc000428360, 0x15, 0x0, 0x24, 0xc1000000c0e, 0x40c44b)
        /go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_core.go:47 +0x10c
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler-api/pkg/generated/controllers/core/v1.FromSecretHandlerToHandler.func1(0xc000428360, 0x15, 0x0, 0x0, 0xc000043cb0, 0x24, 0x16948a0, 0xc00011f9e0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler-api/pkg/generated/controllers/core/v1/secret.go:96 +0xdb
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Handlers).Handle(0xc0004113e0, 0xc000428360, 0x15, 0x0, 0x0, 0x15, 0x0, 0x0, 0xc000411100)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/handlers.go:27 +0x100
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).syncHandler(0xc0000d0700, 0xc000428360, 0x15, 0x10894e1, 0xc00012d0c8)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:171 +0x122
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).processSingleItem(0xc0000d0700, 0x13db1a0, 0xc0004680c0, 0x0, 0x0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:156 +0xf0
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).processNextWorkItem(0xc0000d0700, 0xc000244668)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:133 +0x51
github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).runWorker(0xc0000d0700)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:122 +0x2b
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00021be60)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00021be60, 0x3b9aca00, 0x0, 0x13db101, 0xc000086ae0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00021be60, 0x3b9aca00, 0xc000086ae0)
        /go/src/github.com/rancher/system-upgrade-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic.(*Controller).run
        /go/src/github.com/rancher/system-upgrade-controller/vendor/github.com/rancher/wrangler/pkg/generic/controller.go:104 +0x178

optionally use latest version for prepare container image

Is your feature request related to a problem? Please describe.
The prepare init container image is static and does not mutate as with the upgrade container. With use-cases such as k3s-upgrade that use essentially the same image for both prepare and upgrade it would be nice to have a way to let the controller know to use the same tag for prepare as is used for `upgrade.

Describe the solution you'd like
A way to specify on a Plan that the exact same image:tag should be used for the prepare container as is used for the upgrade container.

Describe alternatives you've considered
For Plans that have their version set explicitly they should also update the prepare container to use the same tag as will be used by upgrade.

Additional context
n/a

sonobuoy plugin for tests

write tests
publish image

rancher / system-upgrade-controller Goto Github PK

system-upgrade-controller's Introduction

System Upgrade Controller

Introduction

Presentations and Recordings

April 14, 2020

March 4, 2020

Considerations

Deploying

Example Plans

Building

Running

Testing

License

system-upgrade-controller's People

Contributors

Stargazers

Watchers

Forkers

system-upgrade-controller's Issues

Recommend Projects

Recommend Topics

Recommend Org