Giter Club home page Giter Club logo

ghe-backup's People

Contributors

hjacobs avatar jmcs avatar kgalli avatar lars-zalando avatar lotharschulz avatar m4ntr4 avatar mkempson avatar rashamalek avatar scherniavsky avatar tuxlife avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ghe-backup's Issues

run on Zalando Kubernetes setup

run ghe-backup on Zalando Kubernetes cluster:

  • create docker images for Kubernetes cluster: #66
  • deploy to Kubernetes cluster: #74
  • gather experience running it in production

prune in stuck backups

An in-progress file is left in backup data folder in case a backup is aborted.
The next backup attempt fails with

Error: backup process 1468 of [myhost] already in progress in snapshot 20160219T112301. Aborting.

Prune in-progress file on EBS volume if exists in backup data only on container startup for now as this file indicates a backup is running currently.

correct permissions for /kms/convert-kms-private-ssh-key.sh

Permission issues on /kms/convert-kms-private-ssh-key.sh
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: % Total % Received % Xferd Average Speed Time Time Time Current
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: Dload Upload Total Spent Left Speed
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015100 469 100 469 0 0 114k 0 --:--:-- --:--:-- --:--:-- 114k
May 30 13:02:46 ip-172-31-142-237 docker/d13c786d96fd[825]: /backup/final-docker-cmd.sh: line 14: /kms/convert-kms-private-ssh-key.sh: Permission denied

Backup process hangs after a couple of tries due to docker fifo issue

When Backup process starts, It creates a file named in-progress (with the assumption of preventing other backup processes to start), but when it is not responsive anymore (stucked for some reason), the backup does not finish, the process is still in the process list, and the in-progress is still there till the next day, which /delete-instuck-backups/delete_instuck_progress.py will take care of it and delete the in-progress file (only after one day).

The issue is that it will not take care of the running (stucked) process.

On the other hand, /start_backup.sh only checks for the pid existence in process list

pidof -o $$ -x "$0" >/dev/null 2>&1 && exit 1
in this case no other backup will be executed, till someone, manually kills the old stucked process or restart the docker machine.

fixing ssh private key

id_rsa is written to a file in a wrong path because of "~" wrong expansion.

# find /backup -name id_rsa
/backup/~/.ssh/id_rsa

different docker files/images/containers per AWS account

Current situation: backups in both AWS accounts are triggered via corn at the same time 13th minutes.
goals:

  • backups should be triggered in one account on odd hours in the other account in even hours
    • approach: different dockerfiles, docker images, docker container per AWS account
  • trigger the backup process only if no other backup process is in the process list

reduce the number of backup attempts

there are to many (zombie) back processes running at the same time:

  • bus instance:
root     12081  0.0  0.0  45796  1000 ?        S    Aug24   0:00      |       |   \_ CRON
root     12082  0.0  0.0   4500   620 ?        Ss   Aug24   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     12083  0.0  0.0   9656   852 ?        S    Aug24   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     12102  0.0  0.0  11276   560 ?        S    Aug24   0:00      |       |   |           \_ grep ghe-backup
root     12143  0.0  0.0  45796  1000 ?        S    Aug24   0:00      |       |   \_ CRON
root     12144  0.0  0.0   4500   624 ?        Ss   Aug24   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     12145  0.0  0.0   9656   852 ?        S    Aug24   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     12164  0.0  0.0  11276   560 ?        S    Aug24   0:00      |       |   |           \_ grep ghe-backup
root     12216  0.0  0.0  45796  1000 ?        S    Aug24   0:00      |       |   \_ CRON
root     12217  0.0  0.0   4500   624 ?        Ss   Aug24   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     12218  0.0  0.0   9656   848 ?        S    Aug24   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     12237  0.0  0.0  11276   564 ?        S    Aug24   0:00      |       |   |           \_ grep ghe-backup
root     13226  0.0  0.1  45796  1364 ?        S    07:26   0:00      |       |   \_ CRON
root     13227  0.0  0.0   4500   664 ?        Ss   07:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13228  0.0  0.1   9656  1512 ?        S    07:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13247  0.0  0.0  11276   720 ?        S    07:26   0:00      |       |   |           \_ grep ghe-backup
root     13288  0.0  0.1  45796  1364 ?        S    08:26   0:00      |       |   \_ CRON
root     13289  0.0  0.0   4500   660 ?        Ss   08:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13290  0.0  0.1   9656  1520 ?        S    08:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13309  0.0  0.0  11276   724 ?        S    08:26   0:00      |       |   |           \_ grep ghe-backup
root     13350  0.0  0.1  45796  1364 ?        S    09:26   0:00      |       |   \_ CRON
root     13351  0.0  0.0   4500   664 ?        Ss   09:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13352  0.0  0.1   9656  1516 ?        S    09:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13371  0.0  0.0  11276   728 ?        S    09:26   0:00      |       |   |           \_ grep ghe-backup
root     13412  0.0  0.1  45796  1364 ?        S    10:26   0:00      |       |   \_ CRON
root     13413  0.0  0.0   4500   664 ?        Ss   10:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13414  0.0  0.1   9656  1516 ?        S    10:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13433  0.0  0.0  11276   728 ?        S    10:26   0:00      |       |   |           \_ grep ghe-backup
root     13485  0.0  0.1  45796  1364 ?        S    11:26   0:00      |       |   \_ CRON
root     13486  0.0  0.0   4500   664 ?        Ss   11:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13487  0.0  0.1   9656  1512 ?        S    11:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13506  0.0  0.0  11276   724 ?        S    11:26   0:00      |       |   |           \_ grep ghe-backup
root     13547  0.0  0.1  45796  1364 ?        S    12:26   0:00      |       |   \_ CRON
root     13548  0.0  0.0   4500   664 ?        Ss   12:26   0:00      |       |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13549  0.0  0.1   9656  1516 ?        S    12:26   0:00      |       |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13568  0.0  0.0  11276   724 ?        S    12:26   0:00      |       |   |           \_ grep ghe-backup
root     13609  0.0  0.1  45796  1364 ?        S    13:26   0:00      |       |   \_ CRON
root     13610  0.0  0.0   4500   660 ?        Ss   13:26   0:00      |       |       \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     13611  0.0  0.1   9656  1516 ?        S    13:26   0:00      |       |           \_ bash /backup/backup-utils/bin/ghe-backup -v
root     13630  0.0  0.0  11276   728 ?        S    13:26   0:00      |       |               \_ grep ghe-backup
  • automata instancnce:
root     11015  0.0  0.0  11276   124 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
root     11132  0.0  0.0  45796   380 ?        S    Aug22   0:00              |   \_ CRON
root     11133  0.0  0.0   4500    96 ?        Ss   Aug22   0:00              |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     11134  0.0  0.0   9656   296 ?        S    Aug22   0:00              |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     11153  0.0  0.0  11276   124 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
root     11255  0.0  0.0  45796   380 ?        S    Aug22   0:00              |   \_ CRON
root     11256  0.0  0.0   4500    92 ?        Ss   Aug22   0:00              |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     11257  0.0  0.0   9656   292 ?        S    Aug22   0:00              |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     11276  0.0  0.0  11276   124 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
root     11379  0.0  0.0  45796   380 ?        S    Aug22   0:00              |   \_ CRON
root     11380  0.0  0.0   4500   100 ?        Ss   Aug22   0:00              |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     11381  0.0  0.0   9656   292 ?        S    Aug22   0:00              |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     11400  0.0  0.0  11276   128 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
root     11505  0.0  0.0  45796   380 ?        S    Aug22   0:00              |   \_ CRON
root     11506  0.0  0.0   4500    96 ?        Ss   Aug22   0:00              |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     11507  0.0  0.0   9656   300 ?        S    Aug22   0:00              |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     11526  0.0  0.0  11276   128 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
root     11641  0.0  0.0  45796   380 ?        S    Aug22   0:00              |   \_ CRON
root     11642  0.0  0.0   4500    96 ?        Ss   Aug22   0:00              |   |   \_ /bin/sh -c /backup/backup-utils/bin/ghe-backup -v 1>> /var/log/ghe-prod-backup.log 2>&1
root     11643  0.0  0.0   9656   296 ?        S    Aug22   0:00              |   |       \_ bash /backup/backup-utils/bin/ghe-backup -v
root     11662  0.0  0.0  11276   128 ?        S    Aug22   0:00              |   |           \_ grep ghe-backup
``

delivery.yaml in k8s-master-like and master branch

@lotharschulz
Here is my understanding of "k8s-master-like" and "master" branches
"k8s-master-like" branch: Its delivery.yaml should build and push an k8s compatible image to pierone,

"master" branch: its delivery.yaml should build and push a Taupage compatible image to pierone,

are these definitions mixed somehow?

Currently
k8s-master-like/delivery.yaml#L18 and k8s-master-like/delivery.yaml#L27 seem to be creating Taupage compatible ones, and
master/delivery.yaml#L22
is creating a k8s compatible one.

replace-convert-properties.sh is not added in dockerfile

Hi @lotharschulz
https://github.com/zalando/ghe-backup/blob/master/replace-convert-properties.sh is not added to
https://github.com/zalando/ghe-backup/blob/master/Dockerfile

# docker ps -a
CONTAINER ID        IMAGE                                                       COMMAND                   CREATED             STATUS                        PORTS               NAMES
acc4c1a12aa3        pierone.stups.zalan.do/machinery/ghe-backup:cdp-master-16   "/bin/sh -c \"/backup/"   22 minutes ago      Exited (127) 20 minutes ago                       taupageapp

# cat /var/log/application.log
May 30 09:55:21 ip-172-31-131-253 docker/acc4c1a12aa3[833]: /backup/final-docker-cmd.sh: line 13: ./replace-convert-properties.sh: No such file or directory
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]:                                  Dload  Upload   Total   Spent    Left  Speed
May 30 09:56:12 ip-172-31-131-253 docker/acc4c1a12aa3[833]: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   469  100   469    0     0   114k      0 --:--:-- --:--:-- --:--:--  114k

Thanks,
Rasha

docker image build broken

Removing intermediate container 3c89e8b17b65
Step 16 : "/KMS/CONVERT-KMS-PRIVATE-SSH-KEY.SH", 
Unknown instruction: "/KMS/CONVERT-KMS-PRIVATE-SSH-KEY.SH",

backup schedule overlap

Hi,
Currently cron-ghe-backup-automata and cron-ghe-backup-bus each, are configured for every two hours.
First automata backup took more than 1 hour. the next one took around 13 minutes.
This would lead us not to able to calculate the normal backup, and an overlap between automata and bus backup instances.
Suggestion: change the cron to every 3-4 hours to prevent the overlap, and also some time for GHE job queue to be cleaned and completed.
@lotharschulz Please check if applicable.

$ du -hc --max-depth=1 /data/ghe-production-data 124G /data/ghe-production-data/20170321T121301 12G /data/ghe-production-data/20170321T101301

adapt backups

  • lets do backups on sundays as there is activity there from time to time
  • we see again behavior like in #27 - lets reduce the backup attempts until we have a bigger instance (again)

current variable expansion throws exception in some cases

current variable expansion produces unexpected exception in some edge cases:

# /kms/convert-kms-ghe-mcpassword.sh
/kms/convert-kms-ghe-mcpassword.sh: line 18: $2: unbound variable

more detail about parameter expansion in shell scripts
https://www.quora.com/What-is-the-best-way-to-check-if-an-argument-exists-in-Bash
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_05_02

similar issues:
https://groups.google.com/forum/#!topic/comp.unix.shell/qklDGBv0Sdk

full backup disk

root@....:/data/ghe-production-data# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdf       985G  985G     0 100% /data

lets reduce the number of backups

deploy to kubernetes

  • adapt internal delivery.yaml to pick latest docker images created with this repo's delivery.yaml

backup clean up script

a script should implement a back clean up strategy e.g.

  • delete all backups older then 4 weeks but keep
  • one backup per calendar month with last 12 months
  • one backup per calendar year

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.