zalando / ghe-backup Goto Github PK
View Code? Open in Web Editor NEWGithub Enterprise backup at ZalandoTech (Kubernetes, AWS, Docker)
License: Apache License 2.0
Github Enterprise backup at ZalandoTech (Kubernetes, AWS, Docker)
License: Apache License 2.0
use delivery.yaml and use new cd platform
An in-progress file is left in backup data folder in case a backup is aborted.
The next backup attempt fails with
Error: backup process 1468 of [myhost] already in progress in snapshot 20160219T112301. Aborting.
Prune in-progress file on EBS volume if exists in backup data only on container startup for now as this file indicates a backup is running currently.
due to incorrect docker image at the point in time when docker build
is called: https://github.com/zalando/ghe-backup/blob/master/Jenkinsfile#L106
caused most likely by https://github.com/zalando/ghe-backup/blob/master/Jenkinsfile#L103
Permission issues on /kms/convert-kms-private-ssh-key.sh
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: % Total % Received % Xferd Average Speed Time Time Time Current
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: Dload Upload Total Spent Left Speed
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015100 469 100 469 0 0 114k 0 --:--:-- --:--:-- --:--:-- 114k
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: /backup/final-docker-cmd.sh: line 14: /kms/convert-kms-private-ssh-key.sh: Permission denied
If ec2 instance is stopped or terminates, autoscalinggroup spins up a new instance, but
ghe-backup-volume will not be mounted, and taupage scripts fail.
When Backup process starts, It creates a file named in-progress (with the assumption of preventing other backup processes to start), but when it is not responsive anymore (stucked for some reason), the backup does not finish, the process is still in the process list, and the in-progress is still there till the next day, which /delete-instuck-backups/delete_instuck_progress.py will take care of it and delete the in-progress file (only after one day).
The issue is that it will not take care of the running (stucked) process.
On the other hand, /start_backup.sh only checks for the pid existence in process list
pidof -o $$ -x "$0" >/dev/null 2>&1 && exit 1
in this case no other backup will be executed, till someone, manually kills the old stucked process or restart the docker machine.
id_rsa is written to a file in a wrong path because of "~" wrong expansion.
# find /backup -name id_rsa
/backup/~/.ssh/id_rsa
Current situation: backups in both AWS accounts are triggered via corn at the same time 13th minutes.
goals:
see details in #76 (review)
there should be one python file returning the ssh key
allow:
some backups failed with
backup-utils 2.6.0 or greater is required!
fix that
this is done by pipelines doing the actual deployment
there are to many (zombie) back processes running at the same time:
root 12081 0.0 0.0 45796 1000 ? S Aug24 0:00 | | \_ CRON
root 12082 0.0 0.0 4500 620 ? Ss Aug24 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 12083 0.0 0.0 9656 852 ? S Aug24 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 12102 0.0 0.0 11276 560 ? S Aug24 0:00 | | | \_ grep ghe-backup
root 12143 0.0 0.0 45796 1000 ? S Aug24 0:00 | | \_ CRON
root 12144 0.0 0.0 4500 624 ? Ss Aug24 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 12145 0.0 0.0 9656 852 ? S Aug24 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 12164 0.0 0.0 11276 560 ? S Aug24 0:00 | | | \_ grep ghe-backup
root 12216 0.0 0.0 45796 1000 ? S Aug24 0:00 | | \_ CRON
root 12217 0.0 0.0 4500 624 ? Ss Aug24 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 12218 0.0 0.0 9656 848 ? S Aug24 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 12237 0.0 0.0 11276 564 ? S Aug24 0:00 | | | \_ grep ghe-backup
root 13226 0.0 0.1 45796 1364 ? S 07:26 0:00 | | \_ CRON
root 13227 0.0 0.0 4500 664 ? Ss 07:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13228 0.0 0.1 9656 1512 ? S 07:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13247 0.0 0.0 11276 720 ? S 07:26 0:00 | | | \_ grep ghe-backup
root 13288 0.0 0.1 45796 1364 ? S 08:26 0:00 | | \_ CRON
root 13289 0.0 0.0 4500 660 ? Ss 08:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13290 0.0 0.1 9656 1520 ? S 08:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13309 0.0 0.0 11276 724 ? S 08:26 0:00 | | | \_ grep ghe-backup
root 13350 0.0 0.1 45796 1364 ? S 09:26 0:00 | | \_ CRON
root 13351 0.0 0.0 4500 664 ? Ss 09:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13352 0.0 0.1 9656 1516 ? S 09:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13371 0.0 0.0 11276 728 ? S 09:26 0:00 | | | \_ grep ghe-backup
root 13412 0.0 0.1 45796 1364 ? S 10:26 0:00 | | \_ CRON
root 13413 0.0 0.0 4500 664 ? Ss 10:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13414 0.0 0.1 9656 1516 ? S 10:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13433 0.0 0.0 11276 728 ? S 10:26 0:00 | | | \_ grep ghe-backup
root 13485 0.0 0.1 45796 1364 ? S 11:26 0:00 | | \_ CRON
root 13486 0.0 0.0 4500 664 ? Ss 11:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13487 0.0 0.1 9656 1512 ? S 11:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13506 0.0 0.0 11276 724 ? S 11:26 0:00 | | | \_ grep ghe-backup
root 13547 0.0 0.1 45796 1364 ? S 12:26 0:00 | | \_ CRON
root 13548 0.0 0.0 4500 664 ? Ss 12:26 0:00 | | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13549 0.0 0.1 9656 1516 ? S 12:26 0:00 | | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13568 0.0 0.0 11276 724 ? S 12:26 0:00 | | | \_ grep ghe-backup
root 13609 0.0 0.1 45796 1364 ? S 13:26 0:00 | | \_ CRON
root 13610 0.0 0.0 4500 660 ? Ss 13:26 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 13611 0.0 0.1 9656 1516 ? S 13:26 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 13630 0.0 0.0 11276 728 ? S 13:26 0:00 | | \_ grep ghe-backup
root 11015 0.0 0.0 11276 124 ? S Aug22 0:00 | | \_ grep ghe-backup
root 11132 0.0 0.0 45796 380 ? S Aug22 0:00 | \_ CRON
root 11133 0.0 0.0 4500 96 ? Ss Aug22 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 11134 0.0 0.0 9656 296 ? S Aug22 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 11153 0.0 0.0 11276 124 ? S Aug22 0:00 | | \_ grep ghe-backup
root 11255 0.0 0.0 45796 380 ? S Aug22 0:00 | \_ CRON
root 11256 0.0 0.0 4500 92 ? Ss Aug22 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 11257 0.0 0.0 9656 292 ? S Aug22 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 11276 0.0 0.0 11276 124 ? S Aug22 0:00 | | \_ grep ghe-backup
root 11379 0.0 0.0 45796 380 ? S Aug22 0:00 | \_ CRON
root 11380 0.0 0.0 4500 100 ? Ss Aug22 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 11381 0.0 0.0 9656 292 ? S Aug22 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 11400 0.0 0.0 11276 128 ? S Aug22 0:00 | | \_ grep ghe-backup
root 11505 0.0 0.0 45796 380 ? S Aug22 0:00 | \_ CRON
root 11506 0.0 0.0 4500 96 ? Ss Aug22 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 11507 0.0 0.0 9656 300 ? S Aug22 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 11526 0.0 0.0 11276 128 ? S Aug22 0:00 | | \_ grep ghe-backup
root 11641 0.0 0.0 45796 380 ? S Aug22 0:00 | \_ CRON
root 11642 0.0 0.0 4500 96 ? Ss Aug22 0:00 | | \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root 11643 0.0 0.0 9656 296 ? S Aug22 0:00 | | \_ bash /backup/backup-utils/bin/ghe-backup -v
root 11662 0.0 0.0 11276 128 ? S Aug22 0:00 | | \_ grep ghe-backup
``
don't rely on /meta folder being writable so test can be successful on Taupage based CIs
@lotharschulz
Here is my understanding of "k8s-master-like" and "master" branches
"k8s-master-like" branch: Its delivery.yaml should build and push an k8s compatible image to pierone,
"master" branch: its delivery.yaml should build and push a Taupage compatible image to pierone,
are these definitions mixed somehow?
Currently
k8s-master-like/delivery.yaml#L18 and k8s-master-like/delivery.yaml#L27 seem to be creating Taupage compatible ones, and
master/delivery.yaml#L22
is creating a k8s compatible one.
schedule hourly backups as propose in Scheduling backups as #51 is solved
remove hard coded AWS region:
https://github.com/zalando/ghe-backup/blob/master/convert-kms-private-ssh-key.sh#L36
Hi @lotharschulz
https://github.com/zalando/ghe-backup/blob/master/replace-convert-properties.sh is not added to
https://github.com/zalando/ghe-backup/blob/master/Dockerfile
# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
acc4c1a12aa3 pierone.stups.zalan.do/machinery/ghe-backup:cdp-master-16 "/bin/sh -c \"/backup/" 22 minutes ago Exited (127) 20 minutes ago taupageapp
# cat /var/log/application.log
May 30 09:55:21 ip-172-31-131-253 docker/acc4c1a12aa3[833]: /backup/final-docker-cmd.sh: line 13: ./replace-convert-properties.sh: No such file or directory
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]: % Total % Received % Xferd Average Speed Time Time Time Current
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]: Dload Upload Total Spent Left Speed
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015100 469 100 469 0 0 114k 0 --:--:-- --:--:-- --:--:-- 114k
Thanks,
Rasha
Removing intermediate container 3c89e8b17b65
Step 16 : "/KMS/CONVERT-KMS-PRIVATE-SSH-KEY.SH",
Unknown instruction: "/KMS/CONVERT-KMS-PRIVATE-SSH-KEY.SH",
Docker images have to be built from clean repositories (all changes added and commit).
https://github.com/zalando/ghe-backup/blob/master/Jenkinsfile#L102 is called 2 times if a deployment from master branch happens. A dirty repository applies when https://github.com/zalando/ghe-backup/blob/master/Jenkinsfile#L101 gets called the second time.
Fix that.
Hi,
Currently cron-ghe-backup-automata and cron-ghe-backup-bus each, are configured for every two hours.
First automata backup took more than 1 hour. the next one took around 13 minutes.
This would lead us not to able to calculate the normal backup, and an overlap between automata and bus backup instances.
Suggestion: change the cron to every 3-4 hours to prevent the overlap, and also some time for GHE job queue to be cleaned and completed.
@lotharschulz Please check if applicable.
$ du -hc --max-depth=1 /data/ghe-production-data 124G /data/ghe-production-data/20170321T121301 12G /data/ghe-production-data/20170321T101301
In case backup attempts are in stuck, all following backup attempts are incomplete, because there is an in-progress file left in the backup root folder.
Delete this file in case it is older than 1 day.
current variable expansion produces unexpected exception in some edge cases:
# /kms/convert-kms-ghe-mcpassword.sh
/kms/convert-kms-ghe-mcpassword.sh: line 18: $2: unbound variable
more detail about parameter expansion in shell scripts
https://www.quora.com/What-is-the-best-way-to-check-if-an-argument-exists-in-Bash
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_05_02
similar issues:
https://groups.google.com/forum/#!topic/comp.unix.shell/qklDGBv0Sdk
root@....:/data/ghe-production-data# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvdf 985G 985G 0 100% /data
lets reduce the number of backups
a script should implement a back clean up strategy e.g.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.