k8up-io / k8up Goto Github PK

internal minio
~~s3 on region us-east-1 only~~ edit: it will work for s3 regions but not other vendors at this time. ref github.com/minio/minio-go/pull/1188

Error message:

Connection to S3 endpoint not possible: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'xx-yyy'

k8up version: docker.io/vshn/k8up:v0.1.6

Proposed Solution:

Allow region as another env variable

Extend K8up Status resources with Conditions

Summary

As K8up User
I want more information in the status fields of objects
So that I know in what state my backups and restores are and where to investigate failures.

Context

Error handling is currently done via logs. But we cannot assume that every K8up user has access to view the K8up operator Logs. Thus we should incorporate error handling directly via Status field of the affected resource.

Out of Scope

Replace existing status fields, which could result in a breaking change of the CRDs. Updating the CRD definition with just field additions should be backwards compatible.

Further links

Acceptance criteria

Given an existing K8up CRD object
When K8up reconciles an object
Then the object's .status field should be updated

Given an existing K8up CRD object
When K8up encounters an error during reconcile
Then the object's .status field should be updated with a clear error message

Given an existing K8up CRD object with error in the status field
When K8up reconciles successfully after a failed attempt
Then the object's .status field should be cleared from any errors.

Implementation Ideas

Use Kubernetes Conditions to reflect an orthogonal "state" machine.

New configuration concept

Summary

As a user
I want to have composable configuration
So that so I can re-use defaults and repository configs

Context

Currently, defaults for backups can be defined by setting environment variables. While this serves its purpose, it's not very flexible and rather cumbersome to configure.

So I propose a new way to configure and schedule backups. We should split the configuration:

repository: contains information about the repository. Credentials, bucketname, encryption password. etc.
schedule: takes a jobPlan and a repository and schedules the backups according to the given cron string.
jobPlan: contains information about what should get backed up. This can be PVCs and Pods with prebackup commands.

Out of Scope

Repository objects: #580
Any changes to how things are scheduled, this issue is only about changing the API.

Further Links

Acceptance Criteria

Given a `k8up.io/v2/Schedule` spec
When I refer a `k8up.io/v2/JobPlan` spec
Then K8up can spawn backups using the configuration provided in the `JobPlan` spec.

Given a `k8up.io/v2/JobPlan` spec
When I specify what PVCs I want backed up
Then K8up will only backup those PVCs

Given a `k8up.io/v2/JobPlan` spec
When I specify what pods I want for prebackup commands
Then K8up will only backup those pods via prebackup commands

Implementation Ideas

For PVC and Pod targeting we can use multiple things. Labels or regex for the names, or a mix of both.

Add Archive jobs to the rewrite

Add the check job logic to the new implementation: https://github.com/vshn/k8up/tree/master/service/archive

The controller is already implemented for that. Now we need to implement an Executor that actually triggers the job.

With #114 merged the new implementation lives on the development branch.

Job/backupjob-XXXXXXXX hangs forever if pvc gets deleted during schedule

Dear all

I found another bug. Here is what I did:

Setup schedule, passwords and pods with RWX PVCs
Let the schedule spawn pods
Sometimes If a job spawns a pod and can not mount all PVCs and hangs with STATUS=ContainerCreating
All following pods will now hang with STATUS=Pending

Deleting the pods with STATUS=ContainerCreating gives them a second attempt and usually works.

Unless at least one of the PVCs have been deleted in the mean time.
Then all jobs created will be STATUS=pending forever default-scheduler persistentvolumeclaim "gitlab-prometheus" not found.
This could also happen if between the automated creation of a job and the execution of the pod any of the found PVCs get deleted.

Workaround: delete old entries in jobs.batch manually.

Are there plans to automate this? Or mitigate it in another way`
From the looks of it, k8up then stops creating new backups.

Cheers,
Stefan

Backup Kubernetes objects

Implement a mechanism to backup Kubernetes objects. Probably based on #12.

refs APPU-1626

Save snapshots into CRDs for better visibility

Summary

As K8up user
I want to interact with K8up snapshots via the K8s API
So that I don't need any other tools to trigger a restore\

Context

Right now if a user wants to trigger a restore we need to use restic to find the right snapshot. After the right snapshot is found we then have to create a K8up restore with the given ID.

Out of Scope

restore commands, see [1]

Further links

Acceptance Criteria

Given a K8up snapshot,
When using `kubectl -n $namespace get snapshots`,
Then present an accurate list of snapshots to the user for this namespace

Given a K8up snapshot,
When using `kubectl -n $namespace get snapshots`,
Then list all snapshots for this namespace,
And include paths, tags, date and ID for each snapshot

Given a K8up snapshot,
When listing it via the K8s API for namespace `default`,
Then only list snapshots from the namespace `default`

Implementation Ideas

There are two approaches how we could handle this:
- Write the snapshots to the K8s API as CRs and keep them in sync
  - keeping the snapshots in sync could get tricky if prunes or backups happen outside of K8up
  - saving each snapshot into its own CR could result in a huge amount of CRs for large clusters
  - saving all snapshots into one single CR could run into the 1Mb etcd limit pretty fast
- Extend the K8s API server with [2] and print the resulting list of snapshots
  - restic snapshots needs the repository configurations
  - without a persistent cache this operation is very expensive and can take minutes to complete
Each snapshot created by K8up contains the originating namespace in the hostname field for PVC backups

Is a working PVC backup required for appuio.ch/backupcommand to work?

Hi all

I am currently guessing from the code that it should work without a functioning backup.
However, is a prebackup-pod needed? (Suggested by #10)

Best,
Stefan

Future of k8up

Given the announcement of Project Syn

https://vshn.ch/syn/

Is project syn going to take precedence of this repository ?

Should I skip straight to project syn or will support for k8up continue ?

Thanks !

Add Prune jobs to the rewrite

Add the check job logic to the new implementation: https://github.com/vshn/k8up/tree/master/service/prune

The controller is already implemented for that. Now we need to implement an Executor that actually triggers the job.

That one should be quite trivial to implement.

With #114 merged the new implementation lives on the development branch.

unable to backup priviledged files

The restic command seems to run with uid 1001 and is therfor not able to backup files owned by other users (e.G. root).
Is there a possibility to run the restic command as root (by setting the securityContext of the job pod or so..)?

Support "auto" schedules

Summary

As K8up user
I want to specify smart schedules
So that I can let K8up figure out optimal schedules for optimal resource usage

Context

The scheduler should be able accept "auto" or "smart" schedules.

These can be like for crons:

daily
weekly
monthly
etc.

They should behave in such a way that the jobs will be run at least once during that defined time. The old cron syntax should still be supported for use cases where specific time is necessary.

The idea for the auto schedules would be to be triggered at any time in the given frame. For example daily should mean that the job should run at least once every day sometime between 00:00 and 23:59. When exactly should be determined by the operator.

This feature is intended mostly for jobs that need exclusive access to the backup repository like prune and check. They don't have any impact on the applications and can thus run whenever no backups are running.

For prune and check jobs the operator has to figure out the best time between backups to a repository when the jobs can run. One idea could be that the prune could be triggered right after all backups have finished. That would eliminate the need for a separate prune schedule completely.

Additional from #118:

Also the cron library used by k8up has some predefined schedules like @daily or @weekly so for the auto schedules we'd have to define them without an @ as not to break those pre-defined schedules. https://pkg.go.dev/github.com/robfig/cron#hdr-Predefined_schedules

And finally, the cron library supports intervals which could come in very handy for this feature.

Out of Scope

Implement a smart scheduling strategy with evenly distributed schedules (as discussed in this issue, this poses more problems than it may solve)

Further links

Acceptance criteria

Given a schedulable K8up object
When a standardized cron syntax is specified
Then schedule the resource at the specified times
(this keeps existing functionality)

Given a schedulable K8up object
When a non-standardized predefined cron syntax is specified
Then schedule the resource at randomized times within the given timeframe, with a stable randomization seed
(e.g. when @hourly-randomized this could result in <random number between 0 and 59> * * * * schedule, contrast to predefined 0 * * * *, which would defeat the purpose of smart scheduling)

Given a schedulable K8up object
When a non-standardized predefined cron syntax with @every is specified
Then schedule the resource at randomized times with the given interval
(e.g. when @every 2h this could result in a schedule that runs every 2h with a random start time)

Given a schedulable K8up object
When any sort of non-standardized predefined cron syntax is specified
Then store the resulting schedule in the status field of the object
(to make the actual schedule transparent for the user)

Implementation Ideas

Examples of non-standard cron syntax:
- @hourly -> same as https://pkg.go.dev/github.com/robfig/cron#hdr-Predefined_schedules
- @hourly-randomized -> this will trigger randomization of the schedule and then generate the actual, standard cron syntax
stable random seed could be generated from the Name + Namespace combination, create a unique number out of it (checksum?). For the minute, calculate this seed modulo 60, for hours modulo 24 (in case of @daily-randomized) and fill that until we get a stable, random schedule like 23 5 * * *

Feature: default backup or not behaviour per namespace

Dear vshn team

It would be very useful to have a control per namespace to set the default behaviour for the PVCs,
I am guessing this should be possible by modifying this section.
https://github.com/vshn/k8up/blob/bf8c6386bb50d31a14aa73addffb6a30a5dd1c4e/service/backup/backupRunner.go#L116-L123

Cheers

Add Check jobs to the rewrite

Add the check job logic to the new implementation: https://github.com/vshn/k8up/tree/master/service/check

The controller is already implemented for that. Now we need to implement an Executor that actually triggers the job.

That one should be quite trivial to implement.

With #114 merged the new implementation lives on the development branch.

Limit concurrently running jobs

The operator should have the ability to limit how many jobs should get run concurrently.

This limit should be configurable by job type (prune, backup, check, etc.)

Polish and Automate K8up/wrestic docs on GitHub

Automate the docs handling and get the docs up-to-date into a state where every Kubernetes admin is able to install and handle K8up and every Kubernetes user is able to get the best out of K8up custom resources.

refs APPU-1545

Backup doesn't work

When I try to create a PVC backup, the following error occurs:

No repository available, initialising...
created restic repository 7ebf084d04 at s3:https://minio-backup.local.example/volumes

Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.
Removing locks...
created new cache in /.cache/restic
successfully removed locks
Listing all pods with annotation k8up.syn.tools/backupcommand in namespace test
Listing snapshots
snapshots command:
0 Snapshots
backing up...
Starting backup for folder test-claim0
could not parse restic output: invalid character 'S' looking for beginning of value
could not parse restic output: invalid character 'S' looking for beginning of value
could not parse restic output: invalid character 'S' looking for beginning of value
could not parse restic output: invalid character 'S' looking for beginning of value
could not parse restic output: invalid character 'S' looking for beginning of value

Did I do something wrong? I've retried the backup job multiple times.

Print version number on startup

Log the version number and build information when the operator starts.

Observed a panic: "invalid memory address or nil pointer dereference"

I created the following Schedule:

apiVersion: backup.appuio.ch/v1alpha1
kind: Schedule
metadata:
  name: backup-netbox
spec:
  archive:
    schedule: '0 0 1 * *'
  backup:
    schedule: '*/5 * * * *'
    keepJobs: 6
  check:
    schedule: '0 1 * * 1'
  prune:
    schedule: '0 1 * * 0'
    retention:
      keepLast: 5
      keepDaily: 14

and then a Deployment with a Pod and the following Annotation:

appuio.ch/backupcommand: PGPASSWORD=$(cat /etc/secrets/stolon-stolonsupg_su_password) pg_dump -h stolon-proxy -U $(cat /etc/secrets/stolon-stolonsupg_su_username) -d netbox

All PVC's in the Namespace have appuio.ch/backup: "false"

and then the Operator crashes:

2019/05/22 15:14:14 [INFO] Registering prune schedule backup-netbox in namespace netbox

2019/05/22 15:14:14 [INFO] Registering check schedule backup-netbox in namespace netbox

2019/05/22 15:14:14 [INFO] Registering backup schedule backup-netbox in namespace netbox

E0522 15:14:14.010293 1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51

/usr/local/go/src/runtime/asm_amd64.s:522

/usr/local/go/src/runtime/panic.go:513

/usr/local/go/src/runtime/panic.go:82

/usr/local/go/src/runtime/signal_unix.go:390

/go/src/github.com/vshn/k8up/service/schedule/scheduleRunner.go:225

/go/src/github.com/vshn/k8up/service/schedule/scheduler.go:59

/go/src/github.com/vshn/k8up/operator/handler.go:29

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:305

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:279

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:248

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:224

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:208

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:208

/usr/local/go/src/runtime/asm_amd64.s:1333

panic: runtime error: invalid memory address or nil pointer dereference [recovered]

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xfb3047]

goroutine 192 [running]:

github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108

panic(0x1123fa0, 0x1e02a20)

/usr/local/go/src/runtime/panic.go:513 +0x1b9

github.com/vshn/k8up/service/schedule.(*scheduleRunner).Start(0xc00074e1e0, 0x10c2900, 0xc0002a0ae0)

/go/src/github.com/vshn/k8up/service/schedule/scheduleRunner.go:225 +0x877

github.com/vshn/k8up/service/schedule.(*Schedule).Ensure(0xc00036ff80, 0x13cb0a0, 0xc0000b65a0, 0x4, 0x4)

/go/src/github.com/vshn/k8up/service/schedule/scheduler.go:59 +0x381

github.com/vshn/k8up/operator.(*handler).Add(0xc0003ae800, 0x13e0600, 0xc0002f6180, 0x13cb0a0, 0xc0000b65a0, 0x13f2b40, 0x1e37598)

/go/src/github.com/vshn/k8up/operator/handler.go:29 +0x47

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).handleAdd(0xc0003de210, 0x13e0600, 0xc0002f6180, 0xc0003c0620, 0x14, 0x13cb0a0, 0xc0000b65a0, 0x0, 0x0)

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:305 +0x4b1

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).processJob(0xc0003de210, 0x13e0600, 0xc0002f6150, 0xc0003c0620, 0x14, 0xc0002f6150, 0x13f2b40)

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:279 +0xf7

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).getAndProcessNextJob(0xc0003de210, 0xc0001ac500)

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:248 +0x21a

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).runWorker(0xc0003de210)

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:224 +0x2b

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).runWorker-fm()

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:208 +0x2a

github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00005d7b0)

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54

github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000a11fb0, 0x3b9aca00, 0x0, 0x1, 0xc00003e900)

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe

github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00005d7b0, 0x3b9aca00, 0xc00003e900)

/go/src/github.com/vshn/k8up/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d

github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run.func1(0xc0003de210, 0xc00003e900)

/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:208 +0x5c

created by github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run

	/go/src/github.com/vshn/k8up/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207 +0x1e1

not yet sure, if I'm just missing anything in the Schedule CRD, but I guess the Operator should be more robust.

the Operator runs with the following ENV's

    - name: BACKUP_IMAGE
      value: docker.io/vshn/wrestic:v0.0.10
    - name: BACKUP_GLOBALS3ENDPOINT
      value: https://objects.cloudscale.ch
    - name: BACKUP_GLOBALS3BUCKET
      value: k8up_backup
    - name: BACKUP_GLOBALACCESSKEYID
      value: ....
    - name: BACKUP_GLOBALSECRETACCESSKEY
      value: ....
    - name: BACKUP_GLOBALREPOPASSWORD
      value: ....
    - name: BACKUP_GLOBALRESTORES3ENDPOINT
      value: https://objects.cloudscale.ch
    - name: BACKUP_GLOBALRESTORES3BUCKET
      value: k8up_restore
    - name: BACKUP_GLOBALRESTORES3ACCESKEYID
      value: ...
    - name: BACKUP_GLOBALRESTORES3SECRETACCESSKEY
      value: ...

Make the backup jobs request and limits configurable

Summary

As K8up admin
I want to override the default resource request and limits of Pods generated by K8up
So that I can optimize resource usage or comply with cluster or namespace resource policies.

Context

On clusters with a default pod cpu/memory limit, backups jobs are currently limited to these default values, because there is no possibility to override or remove them from k8up.

Resource Limits
 Type                      Resource                 Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----                      --------                 ---  ---  ---------------  -------------  -----------------------
 Container                 cpu                      1m   -    100m             250m           -
 Container                 memory                   1Mi  -    128Mi            256Mi          -

This limits the use cases on such clusters heavily.

Out of Scope

Actually define global default resource limits or requests. Defining defaults in a non-major upgrade of K8up could potentially stop existing backups to work (crash with OOM, etc)

Further links

URLs of relevant Git repositories, PRs, Issues, etc.

Acceptance criteria

Given a K8up Schedule object with per-schedule-specified resources
When K8up schedules Jobs
Then the containers in Pods are scheduled with configured resource request and limits

Given a K8up Schedule object outside of cluster admin's responsibility
When K8up schedules Jobs
Then the containers in Pods are scheduled with configured global default resource request and limits
(in a multi-tenant cluster, customers can create schedules, while cluster-admins can provide global defaults in case customer doesn't define those)

Implementation Ideas

Global environment variables for defaults
The actual resource usage is heavily dependent on the amount of files that are backed up. So we should have the ability to set the limits by job and schedule, where values can be overwritten like for the S3 endpoints
- The order of precedence could be: global defaults < schedule defaults < job type specifics (right overrides left)

Remaining backup jobs calculation wrong

Good evening

I think I found a bug with the cleanup of the finished jobs.

What I did:

Setup schedule running every minute, credentials and one pod in an empty namespace.
After a couple of runs I start deleting the Pods that ran the job.
The Jobs stay around.
After reaching the threshold (eg. 11) k8up deletes the latest Pods instead.
Logs from the individual jobs are no longer around.

What I expected:

Cleanup of the old Job objects
The lastest n Pods and Jobs should stick around

Workaround:

kubectl get jobs --all-namespaces | rg backupjob | rg 1/1

Then delete all completed jobs manually.

Cheers,
Stefan

Add prometheus metrics to the rewrite

The rewrite currently doesn't contain any k8up metrics.

The operator-sdk provides a default metrics endpoint though. So use that one to add the custom metrics, if possible.

Most metrics are defined in the observer or the scheduling in the old version:
https://github.com/vshn/k8up/blob/master/service/schedule/scheduleRunner.go#L30
https://github.com/vshn/k8up/blob/master/service/observe/subscription.go#L23

With #114 merged the new implementation lives on the development branch.

Disable automatic CRD management

For #16 we need to disable automatic CRD management by the operator. At least make it configurable on operator startup.

OLM wants to manage the CRD itself and isn't happy if an operator does CRD management.

`BACKUP_FILEEXTENSIONANNOTATION` and `BACKUP_BACKUPCOMMANDANNOTATION` are not passed to backup pod

the BACKUP_FILEEXTENSIONANNOTATION and BACKUP_BACKUPCOMMANDANNOTATION env variables are used to set the annotations on the PreBackupPods Pods created by wrestic, this is great.
The problem is that these env variables are not passed into the backup pod itself, so wrestic is still searching for it's own defaults.

With #91 the defaults in k8up have changed to be k8up.syn.tools/* which is great, but they have not been updated in wrestic, this will be fixed with k8up-io/wrestic#21. But in my understanding just changing them in k8up should be enough as they should be passed to the backup pod by k8up where they are then picked up by wrestic?

upgrade operator-sdk to 1.1

currently we're on 0.19. It's probably easier to upgrade to the latest version of operator-sdk before adding lots of code.

Migration guide:
https://sdk.operatorframework.io/docs/building-operators/golang/migration/

Deduplicate prune jobs

If a prune job is already running for a given repository and another one is triggered (either manually or some schedule), it should get skipped.

This ensures that multiple prune jobs don't clog up the operator and starve out the time for actual backups.

The same applies to check jobs, too.

backup: did not backup all files

When I try to create a PVC backup, the following error occurs:

Starting backup for folder nextcloud-claim0
done: 0.00% 
done: 63.95% 
error cannot open on file /data/nextcloud-claim0/config/config.php
backup finished! new files: 0 changed files: 0 bytes added: 356
Listing snapshots
snapshots command:
35 Snapshots

config.php stat:

-rw-r-----  1 www-data www-data 1.6K Aug  1 21:17 config.php

This permission error occurs in other containers too.

Deduplicate prune jobs

If a prune job is already running for a given repository and another one is triggered (either manually or some schedule), it should get skipped.

This ensures that multiple prune jobs don't clog up the operator and starve out the time for actual backups.

The same applies to check jobs, too.

SFTP remote (RE: hackernews comment thread)

I was instructed in this HN thread:

https://news.ycombinator.com/item?id=20772971

... to formally request a working SFTP remote. We (rsync.net) already support restic and would like to give our customers a recipe (and tech support) for pointing k8up at our cloud storage platform. Since we offer a stock, standard OpenSSH interface, the SFTP remote is what we would look at ...

I'm happy to give a free account to vshn/k8up for testing, etc., but that's probably superfluous since it's not any different than any other SFTP login you already have ...

End-to-End testing framework research

With #154 we have set the groundwork for e2e testing with KIND. What's left are the actual e2e tests themselves.
In https://github.com/vshn/espejo we set up e2e tests with bash, but it's considered experimental. We have chosen to go with bash for the following reasons:

Actual user perspective. There's an actual cluster and we don't add resources programmatically. Each step could be done by a user also (kubectl apply -f .. etc)
Currently, there are just 2-3 smoke tests, the rest is tested in unit and integration tests.

In https://github.com/vshn/wrestic/blob/master/TESTCASES.md, there are a bunch of Testcases, ideal candidates for automation.

There is also the question of concern: Should K8up include e2e testcases where the whole stack is tested? Or should part of the tests be automated in wrestic, while the e2e tests for K8up really test the Operator features and not transparently wrestic also.

CRD compatibility support discussion

Issue

The current CRD generated is of version apiextensions.k8s.io/v1beta1, but some generated properties aren't valid for that version in K8s 1.18+. The generator currently used could do apiextensions.k8s.io/v1 if we upgrade it, but v1 is not available in older Kubernetes server versions (e.g. OpenShift 3.11), that means we would not be able to install K8up.

Proposal

In order to stay compatible with both Openshift 3.11 and also Kubernetes 1.18+ (Rancher, K3s etc) I propose:

Generating the CRD with controller-gen 0.4+ but for spec v1beta1, make the necessary post-generate-patches to make it work. The YAML files are saved in config/crd/apiextensions.k8s.io/v1beta1/ (see #152 )
Additionally generate the CRD with controller-gen 0.4+, save the YAML files in config/crd/apiextensions.k8s.io/v1/
Make both CRD versions available in GitHub releases for download (attachment)
For development we use v1 to stay up-to-date with Operator SDK (which will eventually also migrate to controller-gen 0.4+) and K8s API.
- Also because we are developing the Operator against K8s 1.18 API and beyond
- Use GitHub actions to run e2e tests (with KIND) for a pre-1.18 cluster and post-1.18 cluster version (matrix jobs)

Background

In #114 we have migrated K8up to Operator SDK 1.0+. Operator SDK uses controller-gen version 0.3.0
controller-gen 0.4.x uses v1 by default
Reason why it can't be installed:

The CustomResourceDefinition "prebackuppods.backup.appuio.ch" is invalid: 
* spec.validation.openAPIV3Schema.properties[spec].properties[pod].properties[spec].properties[initContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[pod].properties[spec].properties[containers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
make: *** [Makefile:163: setup_e2e_test] Error 1

v1beta1 is deprecated and will be removed in 1.22:

Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition

I think 1.22 is coming sooner than we can get rid of Openshift 3.11

Add Restore jobs to the rewrite

Add the check job logic to the new implementation: https://github.com/vshn/k8up/tree/master/service/restore

The controller is already implemented for that. Now we need to implement an Executor that actually triggers the job.

With #114 merged the new implementation lives on the development branch.

Improve prebackup pod management

Summary

As a k8up developer
I don't want to handle the prebackup pod readiness in the operator
So that we can get rid of the complicated asynchronous prebackup pod handling

Context

The backup execution is currently much more complex than the other ones. The main reason for this is, that the backups are handled asynchronously via a go-routine. The reason for this is, that the actual backup job needs to wait to be applied until the prebackup pods are ready. And it can take quite a while until the pods are running. Blocking during that time could result in a reconcile congestion when multiple backups start at the same time.

Out of Scope

List aspects that are explicitly not part of this feature

Further links

Asynch backup trigger

Acceptance criteria

Given a backup with prebackpods
When the operator triggers it
Then it won't do it asynchronously

Implementation Ideas

We could tag the pre-backup pods and then handle the waiting logic in wrestic. The cleanup still has to be done by the operator with the observer callback. But we could just start everything at the same time and then wrestic will wait until the prebackup pods are running. With that we can get rid of the go-routine

Add automated integration and unit tests

Summary

As K8up developer
I want automated whitebox tests of K8up internals
So that I can contribute tested features and changes to ensure quality and avoid accidental breaking changes

Context

Currently it's unclear if there's a distinction between integration tests and unit tests.
A potential distinction could be:

Integration tests: all tests that require a K8s Api server (envtest)
Unit tests: function tests that can run without API server, to test microfunctions in code.

But in all honesty, it doesn't really make a difference. However, some IDEs may have difficulties to debug tests when envtest is running if those only run in Makefile.

Out of Scope

Implement e2e tests
Implement tests that are part of (w)restic itself
Achieve a certain code coverage threshold. code coverage as metric doesn't say much about actual quality.

Further links

Acceptance criteria

Given a K8up PR
When I push code into a feature branch
Then GitHub should run automated tests and indicate to code authors and reviewers of passing or failing tests

Implementation Ideas

The Operator-SDK scaffolds tests with envtest, a library that runs a minimal Kubernetes API server in memory where we get full support for Kubernetes client operations

Incorrect behavior when backing up two similar volumes

I have two PVCs ("data1" and "data2"), I was trying to make a copy for them with k8up, each of them has only one text file named "data.txt" in it.

I'd restore the snapshot into S3, and I got only one file in it
backup-default-data2-2020-01-04T16_27_03Z.tar.gz
:

data
└── data2
    └── data.txt

Integration in Operator Lifecycle Manager OLM

Make K8up compatible with the Operator Lifecycle Manager (OLM) and list the operator on OperatorHub.io.

refs APPU-1705

Improved K8up operator monitoring

Add a lot more of monitoring information to the operator to close the gap.

Operator:

global view over all registered backups (how many schedules for example)
reports failed successful runs (as in backoff limit was reached/crashed/etc.)
could also report how many jobs are queued up due to a blocking prune run
with that info it could be possible to find out if a backup job is late

Wrestic:

more granular per job info
how many file errors, changes, etc (as already implemented)
start/end time

refs APPU-1058

HA for Operator

Implement high-availability for the operator so that more than once operator can run per cluster.

refs APPU-1623 and APPU-1625

typo in env var BACKUP_GLOBALRESTORES3ACCESKEYID

There's an s missing in the Acces*s* part. Unsure what implications this has, but might be unpleasant 😄

unexpected status code 200 while pushing to pushgateway

hi
I use k8up v0.1.10 and wrestic v.0.2.0.
The backup job throws an error when the backup-job pushes to the prometheus pushgateway:

I0813 23:02:09.739192       1 handler.go:44] wrestic/statsHandler/promStats "level"=0 "msg"="sending prometheus stats"  "url"="https://pushgateway.example.com"
E0813 23:02:09.741696       1 backup.go:145] wrestic "msg"="prometheus send failed" "error"="unexpected status code 200 while pushing to https://pushgateway.example.com/metrics/job/restic_backup/instance/backup: "

In the pushgateway I see that the data was send successfully...

I have defined the pushgateway url in the k8up BACKUP_PROMURL env variable.

Any ideas why this happens?

Implement support for Restic Tags in Snapshots during Backup & Archive

Restic has capabilities to add tags to snapshots: https://restic.readthedocs.io/en/stable/040_backup.html#tags-for-backup
it would be create to define these tags within a Schedule object, so we can use these tags to have a bucket a bit organized.

At the same time it would be awesome to define tags in an Archive object so that the archive job only takes backups with a given tag into consideration.
In the case of Lagoon I would use the environment type (production, staging or development) for this, so we only archive snapshots of production environments.

Create a consistent developer experience

In a recent workshop we had with APPUiO customers, the developers seemed irritated about having to use restic to list historic snapshots of their backups and/or do a manual restore. Later, as we mentioned the Archive object, we explained that by design we don't use restic for long-term storage of backups.

Also, the customers - who had specifically asked for a demo of backing up and restoring (Postgres) database data - seemed irritated about the fact that k8up simply backs up data from the file system volume instead of specifically "doing database backups".

Apparently, from the perspective of an application developer, this seems like an inconsistent tool chain and/or user experience. That makes it harder to "sell" the promise that k8up makes backups and restore real simple.

Proposed Solution

We should provide an abstraction layer wherever there is something else then "just k8up", for any interaction related to backup and restore data.
In abstract terms, application developers should always interact with "k8up", never with the tools or interfaces directly of the technologies that are used to implement k8up.
In addition, the notion of application-aware backups should be more easily accessible; maybe by avoiding the need to specify the actual backup/restore commands for well-known applications.

In practical terms, this could be by

providing a k8up CLI (with an idiomatic, self-explanatory interface), and/or
providing means that make it easy to create a user interface to interact with creating backups & restore data for applications from within a web application (e.g. our Customer Portal)

Rewrite job handling

Currently there are a few technical depts in K8up:

kooper framework very old and should be replaced with operator-sdk
a lot of POC quality code
hard to extend with features

This issue will track the rewrite of following aspects:

reconcile handling
locking
observing
job triggering
migration to operator-sdk

Migrate K8up to Project Syn

Summary

Migrate K8up to Project Syn GitHub Organization (https://github.com/projectsyn/), adapt the code, docs and description accordingly and therefore make it a integral part of Project Syn.

Context

K8up needs a home with a good community infrastructure. Project Syn has that and is building up what's needed to be as open to contributors as possible.

Out of Scope

Rename K8up, the name must stay

Further links

New GitHub org: https://github.com/projectsyn/

Acceptance criteria

Code adapted (go.mod, imports etc)
Documentation updated
Deprecate docker images (these repos won't receive any new pushes):
- docker.io/vshn/k8up
- docker.io/vshn/wrestic
- quay.io/vshn/k8up
- quay.io/vshn/wrestic
Copied all image tags from [docker.io,quay.io]/vshn/[k8up,wrestic] to [docker.io,quay.io]/projectsyn/[k8up,wrestic]
Created follow-up task to delete [docker.io,quay.io]/vshn/[k8up,wrestic] later (Date tbd)

Explain promURL

Dear vshn team

In some part of the docs it says it points at a Prometheus endpoint.
However, in the examples it is the same URL as the minio server.
So my question, what does it do now?

Cheers

kustomization CRD for BaaS

The docs currently mention how to setup a backup schedule with restic.

We use kustomize to manage our k8s deployments & overlays

Is there any intend to publish an open source kustomization CRD for apiVersion: backup.appuio.ch/v1alpha1? It would be especially useful when having multiple S3 buckets so that I could define a variable per OCP project with the correct bucket name in it.

Currently I need to overwrite the whole object in every project:

apiVersion: backup.appuio.ch/v1alpha1
kind: Schedule
metadata:
  name: backup-pods
spec:
  backend:
    s3:
      bucket: XXX # This is now for every OCP project different

This could be simplified (with the help of a CRD) to a single base config:

apiVersion: backup.appuio.ch/v1alpha1
kind: Schedule
metadata:
  name: backup-pods
spec:
  backend:
    s3:
      bucket: $(BUCKET_NAME)

Generic Pre-Backup Pods

Implement a generic Pre-Backup Pod mechanism. See PR #10.

refs APPU-1530

Unable to backup RWO PVCs

When trying to backup RWO PVCs K8up doesn't check on which node the pod with said PVC mounted is running.
Since RWO volumes can only be mounted multiple times on the same node, K8up should check on which node the pod is running and create the backupjob on the same node.

As of now, the backupjob will never start since it's not running on the same node.

Is it possible to implement that check?

As a workaround we'll try to add an application aware backup command to the pod; basically "tar"-ing all of the files to stdout. But it would be much nicer if we could just use the PVC annotation.

Many thanks,
gi8lino

k8up-io / k8up Goto Github PK

k8up's Issues

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Summary

Context

Out of Scope

Further Links

Acceptance Criteria

Implementation Ideas

Summary

Context

Out of Scope

Further links

Acceptance Criteria

Implementation Ideas

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Issue

Proposal

Background

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Proposed Solution

Summary

Context

Out of Scope

Further links

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org