altinity / clickhouse-backup Goto Github PK

View Code? Open in Web Editor NEW

1.2K 17.0 212.0 5.03 MB

Tool for easy ClickHouse backup and restore using object storage for backup files.

Home Page: https://altinity.com

License: Other

Go 73.00% Shell 3.95% Dockerfile 0.71% Makefile 0.69% Python 21.64%

clickhouse backup s3 dump clickhouse-backup clickhousedump

clickhouse-backup's Introduction

clickhouse-backup

A tool for easy ClickHouse backup and restore with support for many cloud and non-cloud storage types.

Don't run `clickhouse-backup` remotelly

To backup data, clickhouse-backup requires access to the same files as clickhouse-server in /var/lib/clickhouse folders. For that reason, it's required to run clickhouse-backup on the same host or same Kubernetes Pod or the neighbor container on the same host where clickhouse-server ran. WARNING You can backup only schema when connect to remote clickhouse-server hosts.

Features

Easy creating and restoring backups of all or specific tables
Efficient storing of multiple backups on the file system
Uploading and downloading with streaming compression
Works with AWS, GCS, Azure, Tencent COS, FTP, SFTP
Support for Atomic Database Engine
Support for multi disks installations
Support for custom remote storage types via rclone, kopia, restic, rsync etc
Support for incremental backups on remote storage

Limitations

ClickHouse above 1.1.54394 is supported
Only MergeTree family tables engines (more table types for clickhouse-server 22.7+ and USE_EMBEDDED_BACKUP_RESTORE=true)

Support

Altinity is the primary maintainer of clickhouse-backup. We offer a range of software and services related to ClickHouse.

Official website - Get a high level overview of Altinity and our offerings.
Altinity.Cloud - Run ClickHouse in our cloud or yours.
Altinity Support - Get Enterprise-class support for ClickHouse.
Slack - Talk directly with ClickHouse users and Altinity devs.
Contact us - Contact Altinity with your questions or issues.
Free consultation - Get a free consultation with a ClickHouse expert today.

Installation

Download the latest binary from the releases page and decompress with:

tar -zxvf clickhouse-backup.tar.gz

Use the official tiny Docker image and run it on a host with clickhouse-server installed:

docker run -u $(id -u clickhouse) --rm -it --network host -v "/var/lib/clickhouse:/var/lib/clickhouse" \
   -e CLICKHOUSE_PASSWORD="password" \
   -e S3_BUCKET="clickhouse-backup" \
   -e S3_ACCESS_KEY="access_key" \
   -e S3_SECRET_KEY="secret" \
   altinity/clickhouse-backup --help

Build from the sources (required go 1.21+):

GO111MODULE=on go install github.com/Altinity/clickhouse-backup/v2/cmd/clickhouse-backup@latest

Brief description how clickhouse-backup works

Data files are immutable in the clickhouse-server. During a backup operation, clickhouse-backup creates file system hard links to existing clickhouse-server data parts via executing the ALTER TABLE ... FREEZE query. During the restore operation, clickhouse-backup copies the hard links to the detached folder and executes the ALTER TABLE ... ATTACH PART query for each data part and each table in the backup. A more detailed description is available here: https://www.youtube.com/watch?v=megsNh9Q-dw

Default Config

By default, the config file is located at /etc/clickhouse-backup/config.yml, but it can be redefined via the CLICKHOUSE_BACKUP_CONFIG environment variable or via --config command line parameter. All options can be overwritten via environment variables. Use clickhouse-backup default-config to print default config.

Explain config parameters

following values is not default, it just explain which each config parameter actually means Use clickhouse-backup print-config to print current config.

general:
  remote_storage: none           # REMOTE_STORAGE, choice from: `azblob`,`gcs`,`s3`, etc; if `none` then `upload` and `download` commands will fail.
  max_file_size: 1073741824      # MAX_FILE_SIZE, 1G by default, useless when upload_by_part is true, use to split data parts files by archives
  backups_to_keep_local: 0       # BACKUPS_TO_KEEP_LOCAL, how many latest local backup should be kept, 0 means all created backups will be stored on local disk
                                 # -1 means backup will keep after `create` but will delete after `create_remote` command
                                 # You can run `clickhouse-backup delete local <backup_name>` command to remove temporary backup files from the local disk
  backups_to_keep_remote: 0      # BACKUPS_TO_KEEP_REMOTE, how many latest backup should be kept on remote storage, 0 means all uploaded backups will be stored on remote storage.
                                 # If old backups are required for newer incremental backup then it won't be deleted. Be careful with long incremental backup sequences.
  log_level: info                # LOG_LEVEL, a choice from `debug`, `info`, `warn`, `error`
  allow_empty_backups: false     # ALLOW_EMPTY_BACKUPS
  # Concurrency means parallel tables and parallel parts inside tables
  # For example, 4 means max 4 parallel tables and 4 parallel parts inside one table, so equals 16 concurrent streams
  download_concurrency: 1        # DOWNLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
  upload_concurrency: 1          # UPLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
  
  # Throttling speed for upload and download, calculates on part level, not the socket level, it means short period for high traffic values and then time to sleep 
  download_max_bytes_per_second: 0  # DOWNLOAD_MAX_BYTES_PER_SECOND, 0 means no throttling 
  upload_max_bytes_per_second: 0    # UPLOAD_MAX_BYTES_PER_SECOND, 0 means no throttling
  
  # when table data contains in system.disks with type=ObjectStorage, then we need execute remote copy object in object storage service provider, this parameter can restrict how many files will copied in parallel  for each table 
  object_disk_server_side_copy_concurrency: 32 
  
  # RESTORE_SCHEMA_ON_CLUSTER, execute all schema related SQL queries with `ON CLUSTER` clause as Distributed DDL.
  # Check `system.clusters` table for the correct cluster name, also `system.macros` can be used.
  # This isn't applicable when `use_embedded_backup_restore: true`
  restore_schema_on_cluster: ""
  upload_by_part: true           # UPLOAD_BY_PART
  download_by_part: true         # DOWNLOAD_BY_PART
  use_resumable_state: true      # USE_RESUMABLE_STATE, allow resume upload and download according to the <backup_name>.resumable file

  # RESTORE_DATABASE_MAPPING, restore rules from backup databases to target databases, which is useful when changing destination database, all atomic tables will be created with new UUIDs.
  # The format for this env variable is "src_db1:target_db1,src_db2:target_db2". For YAML please continue using map syntax
  restore_database_mapping: {}
  retries_on_failure: 3          # RETRIES_ON_FAILURE, how many times to retry after a failure during upload or download
  retries_pause: 30s             # RETRIES_PAUSE, duration time to pause after each download or upload failure

  watch_interval: 1h       # WATCH_INTERVAL, use only for `watch` command, backup will create every 1h
  full_interval: 24h       # FULL_INTERVAL, use only for `watch` command, full backup will create every 24h
  watch_backup_name_template: "shard{shard}-{type}-{time:20060102150405}" # WATCH_BACKUP_NAME_TEMPLATE, used only for `watch` command, macros values will apply from `system.macros` for time:XXX, look format in https://go.dev/src/time/format.go

  sharded_operation_mode: none       # SHARDED_OPERATION_MODE, how different replicas will shard backing up data for tables. Options are: none (no sharding), table (table granularity), database (database granularity), first-replica (on the lexicographically sorted first active replica). If left empty, then the "none" option will be set as default.
  
  cpu_nice_priority: 15    # CPU niceness priority, to allow throttling СЗГ intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/nice.1.html
  io_nice_priority: "idle" # IO niceness priority, to allow throttling disk intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/ionice.1.html
  
  rbac_backup_always: true # always, backup RBAC objects
  rbac_resolve_conflicts: "recreate"  # action, when RBAC object with the same name already exists, allow "recreate", "ignore", "fail" values
clickhouse:
  username: default                # CLICKHOUSE_USERNAME
  password: ""                     # CLICKHOUSE_PASSWORD
  host: localhost                  # CLICKHOUSE_HOST, To make backup data `clickhouse-backup` requires access to the same file system as clickhouse-server, so `host` should localhost or address of another docker container on the same machine, or IP address bound to some network interface on the same host.
  port: 9000                       # CLICKHOUSE_PORT, don't use 8123, clickhouse-backup doesn't support HTTP protocol
  # CLICKHOUSE_DISK_MAPPING, use this mapping when your `system.disks` are different between the source and destination clusters during backup and restore process.
  # The format for this env variable is "disk_name1:disk_path1,disk_name2:disk_path2". For YAML please continue using map syntax.
  # If destination disk is different from source backup disk then you need to specify the destination disk in the config file:

  # disk_mapping:
  #  disk_destination: /var/lib/clickhouse/disks/destination
  
  # `disk_destination`  needs to be referenced in backup (source config), and all names from this map (`disk:path`) shall exist in `system.disks` on destination server.
  # During download of the backup from remote location (s3), if `name` is not present in `disk_mapping` (on the destination server config too) then `default` disk path will used for download.
  # `disk_mapping` is used to understand during download where downloaded parts shall be unpacked (which disk) on destination server and where to search for data parts directories during restore.
  disk_mapping: {}
  # CLICKHOUSE_SKIP_TABLES, the list of tables (pattern are allowed) which are ignored during backup and restore process
  # The format for this env variable is "pattern1,pattern2,pattern3". For YAML please continue using list syntax
  skip_tables:
    - system.*
    - INFORMATION_SCHEMA.*
    - information_schema.*
  # CLICKHOUSE_SKIP_TABLE_ENGINES, the list of tables engines which are ignored during backup, upload, download, restore process
  # The format for this env variable is "Engine1,Engine2,engine3". For YAML please continue using list syntax
  skip_table_engines: []
  timeout: 5m                  # CLICKHOUSE_TIMEOUT
  freeze_by_part: false        # CLICKHOUSE_FREEZE_BY_PART, allow freezing by part instead of freezing the whole table
  freeze_by_part_where: ""     # CLICKHOUSE_FREEZE_BY_PART_WHERE, allow parts filtering during freezing when freeze_by_part: true
  secure: false                # CLICKHOUSE_SECURE, use TLS encryption for connection
  skip_verify: false           # CLICKHOUSE_SKIP_VERIFY, skip certificate verification and allow potential certificate warnings
  sync_replicated_tables: true # CLICKHOUSE_SYNC_REPLICATED_TABLES
  tls_key: ""                  # CLICKHOUSE_TLS_KEY, filename with TLS key file
  tls_cert: ""                 # CLICKHOUSE_TLS_CERT, filename with TLS certificate file
  tls_ca: ""                   # CLICKHOUSE_TLS_CA, filename with TLS custom authority file
  log_sql_queries: true        # CLICKHOUSE_LOG_SQL_QUERIES, enable logging `clickhouse-backup` SQL queries on `system.query_log` table inside clickhouse-server
  debug: false                 # CLICKHOUSE_DEBUG
  config_dir:      "/etc/clickhouse-server"              # CLICKHOUSE_CONFIG_DIR
  # CLICKHOUSE_RESTART_COMMAND, use this command when restoring with --rbac, --rbac-only or --configs, --configs-only options
  # will split command by ; and execute one by one, all errors will logged and ignore
  # available prefixes
  # - sql: will execute SQL query
  # - exec: will execute command via shell
  restart_command: "exec:systemctl restart clickhouse-server" 
  ignore_not_exists_error_during_freeze: true # CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE, helps to avoid backup failures when running frequent CREATE / DROP tables and databases during backup, `clickhouse-backup` will ignore `code: 60` and `code: 81` errors during execution of `ALTER TABLE ... FREEZE`
  check_replicas_before_attach: true # CLICKHOUSE_CHECK_REPLICAS_BEFORE_ATTACH, helps avoiding concurrent ATTACH PART execution when restoring ReplicatedMergeTree tables
  use_embedded_backup_restore: false # CLICKHOUSE_USE_EMBEDDED_BACKUP_RESTORE, use BACKUP / RESTORE SQL statements instead of regular SQL queries to use features of modern ClickHouse server versions
  embedded_backup_disk: ""  # CLICKHOUSE_EMBEDDED_BACKUP_DISK - disk from system.disks which will use when `use_embedded_backup_restore: true` 
  backup_mutations: true # CLICKHOUSE_BACKUP_MUTATIONS, allow backup mutations from system.mutations WHERE is_done=0 and apply it during restore
  restore_as_attach: false # CLICKHOUSE_RESTORE_AS_ATTACH, allow restore tables which have inconsistent data parts structure and mutations in progress
  check_parts_columns: true # CLICKHOUSE_CHECK_PARTS_COLUMNS, check data types from system.parts_columns during create backup to guarantee mutation is complete
  max_connections: 0 # CLICKHOUSE_MAX_CONNECTIONS, how many parallel connections could be opened during operations
azblob:
  endpoint_suffix: "core.windows.net" # AZBLOB_ENDPOINT_SUFFIX
  account_name: ""             # AZBLOB_ACCOUNT_NAME
  account_key: ""              # AZBLOB_ACCOUNT_KEY
  sas: ""                      # AZBLOB_SAS
  use_managed_identity: false  # AZBLOB_USE_MANAGED_IDENTITY
  container: ""                # AZBLOB_CONTAINER
  path: ""                     # AZBLOB_PATH, `system.macros` values can be applied as {macro_name}
  object_disk_path: ""         # AZBLOB_OBJECT_DISK_PATH, path for backup of part from `azure_blob_storage` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
  compression_level: 1         # AZBLOB_COMPRESSION_LEVEL
  compression_format: tar      # AZBLOB_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  sse_key: ""                  # AZBLOB_SSE_KEY
  buffer_size: 0               # AZBLOB_BUFFER_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 2Mb and 4Mb
  max_parts_count: 10000       # AZBLOB_MAX_PARTS_COUNT, number of parts for AZBLOB uploads, for properly calculate buffer size
  max_buffers: 3               # AZBLOB_MAX_BUFFERS
  debug: false                 # AZBLOB_DEBUG
s3:
  access_key: ""                   # S3_ACCESS_KEY
  secret_key: ""                   # S3_SECRET_KEY
  bucket: ""                       # S3_BUCKET
  endpoint: ""                     # S3_ENDPOINT
  region: us-east-1                # S3_REGION
  # AWS changed S3 defaults in April 2023 so that all new buckets have ACL disabled: https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/
  # They also recommend that ACLs are disabled: https://docs.aws.amazon.com/AmazonS3/latest/userguide/ensure-object-ownership.html
  # use `acl: ""` if you see "api error AccessControlListNotSupported: The bucket does not allow ACLs"
  acl: private                     # S3_ACL 
  assume_role_arn: ""              # S3_ASSUME_ROLE_ARN
  force_path_style: false          # S3_FORCE_PATH_STYLE
  path: ""                         # S3_PATH, `system.macros` values can be applied as {macro_name}
  object_disk_path: ""             # S3_OBJECT_DISK_PATH, path for backup of part from `s3` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
  disable_ssl: false               # S3_DISABLE_SSL
  compression_level: 1             # S3_COMPRESSION_LEVEL
  compression_format: tar          # S3_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  # look at details in https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html
  sse: ""                          # S3_SSE, empty (default), AES256, or aws:kms
  sse_customer_algorithm: ""       # S3_SSE_CUSTOMER_ALGORITHM, encryption algorithm, for example, AES256
  sse_customer_key: ""             # S3_SSE_CUSTOMER_KEY, customer-provided encryption key use `openssl rand 32 > aws_sse.key` and `cat aws_sse.key | base64` 
  sse_customer_key_md5: ""         # S3_SSE_CUSTOMER_KEY_MD5, 128-bit MD5 digest of the encryption key according to RFC 1321 use `cat aws_sse.key |  openssl dgst -md5 -binary | base64`
  sse_kms_key_id: ""               # S3_SSE_KMS_KEY_ID, if S3_SSE is aws:kms then specifies the ID of the Amazon Web Services Key Management Service
  sse_kms_encryption_context: ""   # S3_SSE_KMS_ENCRYPTION_CONTEXT, base64-encoded UTF-8 string holding a JSON with the encryption context
                                   # Specifies the Amazon Web Services KMS Encryption Context to use for object encryption.
                                   # This is a collection of non-secret key-value pairs that represent additional authenticated data.
                                   # When you use an encryption context to encrypt data, you must specify the same (an exact case-sensitive match)
                                   # encryption context to decrypt the data. An encryption context is supported only on operations with symmetric encryption KMS keys
  disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION
  use_custom_storage_class: false  # S3_USE_CUSTOM_STORAGE_CLASS
  storage_class: STANDARD          # S3_STORAGE_CLASS, by default allow only from list https://github.com/aws/aws-sdk-go-v2/blob/main/service/s3/types/enums.go#L787-L799
  concurrency: 1                   # S3_CONCURRENCY
  part_size: 0                     # S3_PART_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 5MB and 5Gb
  max_parts_count: 10000           # S3_MAX_PARTS_COUNT, number of parts for S3 multipart uploads
  allow_multipart_download: false  # S3_ALLOW_MULTIPART_DOWNLOAD, allow faster download and upload speeds, but will require additional disk space, download_concurrency * part size in worst case
  checksum_algorithm: ""           # S3_CHECKSUM_ALGORITHM, use it when you use object lock which allow to avoid delete keys from bucket until some timeout after creation, use CRC32 as fastest

  # S3_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
  # The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
  object_labels: {}
  # S3_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depending on the backup name regexp pattern, format nameRegexp > className
  custom_storage_class_map: {}
  # S3_REQUEST_PAYER, define who will pay to request, look https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html for details, possible values requester, if empty then bucket owner
  request_payer: ""
  debug: false                     # S3_DEBUG
gcs:
  credentials_file: ""         # GCS_CREDENTIALS_FILE
  credentials_json: ""         # GCS_CREDENTIALS_JSON
  credentials_json_encoded: "" # GCS_CREDENTIALS_JSON_ENCODED
  # look https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create how to get HMAC keys for access to bucket
  embedded_access_key: ""      # GCS_EMBEDDED_ACCESS_KEY, use it when `use_embedded_backup_restore: true`, `embedded_backup_disk: ""`, `remote_storage: gcs`
  embedded_secret_key: ""      # GCS_EMBEDDED_SECRET_KEY, use it when `use_embedded_backup_restore: true`, `embedded_backup_disk: ""`, `remote_storage: gcs`
  skip_credentials: false      # GCS_SKIP_CREDENTIALS, skip add credentials to requests to allow anonymous access to bucket
  endpoint: ""                 # GCS_ENDPOINT, use it for custom GCS endpoint/compatible storage. For example, when using custom endpoint via private service connect
  bucket: ""                   # GCS_BUCKET
  path: ""                     # GCS_PATH, `system.macros` values can be applied as {macro_name}
  object_disk_path: ""         # GCS_OBJECT_DISK_PATH, path for backup of part from `s3` object disk (clickhouse support only gcs over s3 protocol), if disk present, then shall not be zero and shall not be prefixed by `path`
  compression_level: 1         # GCS_COMPRESSION_LEVEL
  compression_format: tar      # GCS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  storage_class: STANDARD      # GCS_STORAGE_CLASS
  chunk_size: 0                # GCS_CHUNK_SIZE, default 16 * 1024 * 1024 (16MB)
  client_pool_size: 500        # GCS_CLIENT_POOL_SIZE, default max(upload_concurrency, download concurrency) * 3, should be at least 3 times bigger than `UPLOAD_CONCURRENCY` or `DOWNLOAD_CONCURRENCY` in each upload and download case to avoid stuck
  # GCS_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
  # The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
  object_labels: {}
  # GCS_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depends on backup name regexp pattern, format nameRegexp > className
  custom_storage_class_map: {}
  debug: false                 # GCS_DEBUG
  force_http: false            # GCS_FORCE_HTTP
cos:
  url: ""                      # COS_URL
  timeout: 2m                  # COS_TIMEOUT
  secret_id: ""                # COS_SECRET_ID
  secret_key: ""               # COS_SECRET_KEY
  path: ""                     # COS_PATH, `system.macros` values can be applied as {macro_name}
  compression_format: tar      # COS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  compression_level: 1         # COS_COMPRESSION_LEVEL
ftp:
  address: ""                  # FTP_ADDRESS in format `host:port`
  timeout: 2m                  # FTP_TIMEOUT
  username: ""                 # FTP_USERNAME
  password: ""                 # FTP_PASSWORD
  tls: false                   # FTP_TLS
  tls_skip_verify: false       # FTP_TLS_SKIP_VERIFY
  path: ""                     # FTP_PATH, `system.macros` values can be applied as {macro_name}
  compression_format: tar      # FTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  compression_level: 1         # FTP_COMPRESSION_LEVEL
  debug: false                 # FTP_DEBUG
sftp:
  address: ""                  # SFTP_ADDRESS
  username: ""                 # SFTP_USERNAME
  password: ""                 # SFTP_PASSWORD
  port: 22                     # SFTP_PORT
  key: ""                      # SFTP_KEY
  path: ""                     # SFTP_PATH, `system.macros` values can be applied as {macro_name}
  concurrency: 1               # SFTP_CONCURRENCY
  compression_format: tar      # SFTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
  compression_level: 1         # SFTP_COMPRESSION_LEVEL
  debug: false                 # SFTP_DEBUG
custom:
  upload_command: ""           # CUSTOM_UPLOAD_COMMAND
  download_command: ""         # CUSTOM_DOWNLOAD_COMMAND
  delete_command: ""           # CUSTOM_DELETE_COMMAND
  list_command: ""             # CUSTOM_LIST_COMMAND
  command_timeout: "4h"          # CUSTOM_COMMAND_TIMEOUT
api:
  listen: "localhost:7171"     # API_LISTEN
  enable_metrics: true         # API_ENABLE_METRICS
  enable_pprof: false          # API_ENABLE_PPROF
  username: ""                 # API_USERNAME, basic authorization for API endpoint
  password: ""                 # API_PASSWORD
  secure: false                # API_SECURE, use TLS for listen API socket
  ca_cert_file: ""             # API_CA_CERT_FILE
                               # openssl genrsa -out /etc/clickhouse-backup/ca-key.pem 4096
                               # openssl req -subj "/O=altinity" -x509 -new -nodes -key /etc/clickhouse-backup/ca-key.pem -sha256 -days 365 -out /etc/clickhouse-backup/ca-cert.pem
  private_key_file: ""         # API_PRIVATE_KEY_FILE, openssl genrsa -out /etc/clickhouse-backup/server-key.pem 4096
  certificate_file: ""         # API_CERTIFICATE_FILE,
                               # openssl req -subj "/CN=localhost" -addext "subjectAltName = DNS:localhost,DNS:*.cluster.local" -new -key /etc/clickhouse-backup/server-key.pem -out /etc/clickhouse-backup/server-req.csr
                               # openssl x509 -req -days 365000 -extensions SAN -extfile <(printf "\n[SAN]\nsubjectAltName=DNS:localhost,DNS:*.cluster.local") -in /etc/clickhouse-backup/server-req.csr -out /etc/clickhouse-backup/server-cert.pem -CA /etc/clickhouse-backup/ca-cert.pem -CAkey /etc/clickhouse-backup/ca-key.pem -CAcreateserial
  integration_tables_host: ""  # API_INTEGRATION_TABLES_HOST, allow using DNS name to connect in `system.backup_list` and `system.backup_actions`
  allow_parallel: false        # API_ALLOW_PARALLEL, enable parallel operations, this allows for significant memory allocation and spawns go-routines, don't enable it if you are not sure
  create_integration_tables: false # API_CREATE_INTEGRATION_TABLES, create `system.backup_list` and `system.backup_actions`
  complete_resumable_after_restart: true # API_COMPLETE_RESUMABLE_AFTER_RESTART, after API server startup, if `/var/lib/clickhouse/backup/*/(upload|download).state` present, then operation will continue in the background
  watch_is_main_process: false # WATCH_IS_MAIN_PROCESS, treats 'watch' command as a main api process, if it is stopped unexpectedly, api server is also stopped. Does not stop api server if 'watch' command canceled by the user.

Concurrency, CPU and Memory usage recommendation

upload_concurrency and download_concurrency define how many parallel download / upload go-routines will start independently of the remote storage type. In 1.3.0+ it means how many parallel data parts will be uploaded, assuming upload_by_part and download_by_part are true (which is the default value).

concurrency in the s3 section means how many concurrent upload streams will run during multipart upload in each upload go-routine. A high value for S3_CONCURRENCY and a high value for S3_PART_SIZE will allocate a lot of memory for buffers inside the AWS golang SDK.

concurrency in the sftp section means how many concurrent request will be used for upload and download for each file.

For compression_format, a good default is tar, which uses less CPU. In most cases the data in clickhouse is already compressed, so you may not get a lot of space savings when compressing already-compressed data.

remote_storage: custom

All custom commands use the go-template language. For example, you can use {{ .cfg.* }} {{ .backupName }} {{ .diffFromRemote }}. A custom list_command returns JSON which is compatible with the metadata.BackupMetadata type with JSONEachRow format. For examples, see restic, rsync and kopia. Feel free to add yours custom storage.

ATTENTION!

Never change file permissions in /var/lib/clickhouse/backup. This path contains hard links. Permissions on all hard links to the same data on disk are always identical. That means that if you change the permissions/owner/attributes on a hard link in backup path, permissions on files with which ClickHouse works will be changed too. That can lead to data corruption.

API

Use the clickhouse-backup server command to run as a REST API server. In general, the API attempts to mirror the CLI commands.

GET /

List all current applicable HTTP routes

POST /

POST /restart

Restart HTTP server, close all current connections, close listen socket, open listen socket again, all background go-routines breaks with contexts

GET /backup/kill

Kill selected command from GET /backup/actions command list, kill process should be near immediate, but some go-routines (upload one data part) could continue to run.

Optional query argument command may contain the command name to kill, or if it is omitted then kill the first "in progress" command.

GET /backup/tables

Print list of tables: curl -s localhost:7171/backup/tables | jq ., exclude pattern matched tables from skip_tables configuration parameters

Optional query argument table works the same as the --table=pattern CLI argument.
Optional query argument remote_backupworks the same as --remote-backup=name CLI argument.

GET /backup/tables/all

Print list of tables: curl -s localhost:7171/backup/tables/all | jq ., ignore skip_tables configuration parameters.

Optional query argument table works the same as the --table value CLI argument.
Optional query argument remote_backupworks the same as --remote-backup=name CLI argument.

POST /backup/create

Create new backup: curl -s localhost:7171/backup/create -X POST | jq .

Optional query argument table works the same as the --table value CLI argument.
Optional query argument partitions works the same as the --partitions value CLI argument.
Optional query argument name works the same as specifying a backup name with the CLI.
Optional query argument schema works the same as the --schema CLI argument (backup schema only).
Optional query argument rbac works the same as the --rbac CLI argument (backup RBAC).
Optional query argument configs works the same as the --configs CLI argument (backup configs).
Optional query argument callback allow pass callback URL which will call with POST with application/json with payload {"status":"error|success","error":"not empty when error happens"}.
Additional example: curl -s 'localhost:7171/backup/create?table=default.billing&name=billing_test' -X POST

Note: this operation is asynchronous, so the API will return once the operation has started.

POST /backup/watch

Run background watch process and create full+incremental backups sequence: curl -s localhost:7171/backup/watch -X POST | jq . You can't run watch twice with the same parameters even when allow_parallel: true

Optional query argument watch_interval works the same as the --watch-interval value CLI argument.
Optional query argument full_interval works the same as the --full-interval value CLI argument.
Optional query argument watch_backup_name_template works the same as the --watch-backup-name-template value CLI argument.
Optional query argument table works the same as the --table value CLI argument (backup only selected tables).
Optional query argument partitions works the same as the --partitions value CLI argument (backup only selected partitions).
Optional query argument schema works the same as the --schema CLI argument (backup schema only).
Optional query argument rbac works the same as the --rbac CLI argument (backup RBAC).
Optional query argument configs works the same as the --configs CLI argument (backup configs).
Additional example: curl -s 'localhost:7171/backup/watch?table=default.billing&watch_interval=1h&full_interval=24h' -X POST

Note: this operation is asynchronous and can only be stopped with kill -s SIGHUP $(pgrep -f clickhouse-backup) or call /restart, /backup/kill. The API will return immediately once the operation has started.

POST /backup/clean

Clean the shadow folders using all available paths from system.disks

POST /backup/clean/remote_broken

Remove Note: this operation is sync, and could take a lot of time, increase http timeouts during call

POST /backup/upload

Upload backup to remote storage: curl -s localhost:7171/backup/upload/<BACKUP_NAME> -X POST | jq .

Optional query argument delete-source works the same as the --delete-source CLI argument.
Optional query argument diff-from works the same as the --diff-from CLI argument.
Optional query argument diff-from-remote works the same as the --diff-from-remote CLI argument.
Optional query argument table works the same as the --table value CLI argument.
Optional query argument partitions works the same as the --partitions value CLI argument.
Optional query argument schema works the same as the --schema CLI argument (upload schema only).
Optional query argument resumable works the same as the --resumable CLI argument (save intermediate upload state and resume upload if data already exists on remote storage).
Optional query argument callback allow pass callback URL which will call with POST with application/json with payload {"status":"error|success","error":"not empty when error happens"}.

Note: this operation is asynchronous, so the API will return once the operation has started.

GET /backup/list/{where}

Print a list of backups: curl -s localhost:7171/backup/list | jq . Print a list of only local backups: curl -s localhost:7171/backup/list/local | jq . Print a list of only remote backups: curl -s localhost:7171/backup/list/remote | jq .

Note: The Size field will not be set for the local backups that have just been created or are in progress. Note: The Size field will not be set for the remote backups with upload status in progress.

POST /backup/download

Download backup from remote storage: curl -s localhost:7171/backup/download/<BACKUP_NAME> -X POST | jq .

Optional query argument table works the same as the --table value CLI argument.
Optional query argument partitions works the same as the --partitions value CLI argument.
Optional query argument schema works the same as the --schema CLI argument (download schema only).
Optional query argument resumable works the same as the --resumable CLI argument (save intermediate download state and resume download if it already exists on local storage).
Optional query argument callback allow pass callback URL which will call with POST with application/json with payload {"status":"error|success","error":"not empty when error happens"}.

Note: this operation is asynchronous, so the API will return once the operation has started.

POST /backup/restore

Create schema and restore data from backup: curl -s localhost:7171/backup/restore/<BACKUP_NAME> -X POST | jq .

Optional query argument table works the same as the --table value CLI argument.
Optional query argument partitions works the same as the --partitions value CLI argument.
Optional query argument schema works the same as the --schema CLI argument (restore schema only).
Optional query argument data works the same as the --data CLI argument (restore data only).
Optional query argument rm works the same as the --rm CLI argument (drop tables before restore).
Optional query argument ignore_dependencies works the as same the --ignore-dependencies CLI argument.
Optional query argument rbac works the same as the --rbac CLI argument (restore RBAC).
Optional query argument configs works the same as the --configs CLI argument (restore configs).
Optional query argument restore_database_mapping works the same as the --restore-database-mapping CLI argument.
Optional query argument callback allow pass callback URL which will call with POST with application/json with payload {"status":"error|success","error":"not empty when error happens"}.

POST /backup/delete

Delete specific remote backup: curl -s localhost:7171/backup/delete/remote/<BACKUP_NAME> -X POST | jq .

Delete specific local backup: curl -s localhost:7171/backup/delete/local/<BACKUP_NAME> -X POST | jq .

GET /backup/status

Display list of currently running asynchronous operations: curl -s localhost:7171/backup/status | jq .

POST /backup/actions

Execute multiple backup actions: curl -X POST -d '{"command":"create test_backup"}' -s localhost:7171/backup/actions

GET /backup/actions

Display a list of all operations from start of API server: curl -s localhost:7171/backup/actions | jq .

Optional query argument filter to filter actions on server side.
Optional query argument last to show only the last N actions.

Storage types

S3

In order to make backups to S3, the following permissions should be set:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "clickhouse-backup-s3-access-to-files",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        },
        {
            "Sid": "clickhouse-backup-s3-access-to-bucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketVersioning"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        }
    ]
}

Examples

Simple cron script for daily backups and remote upload

#!/bin/bash
BACKUP_NAME=my_backup_$(date -u +%Y-%m-%dT%H-%M-%S)
clickhouse-backup create $BACKUP_NAME >> /var/log/clickhouse-backup.log 2>&1
exit_code=$?
if [[ $exit_code != 0 ]]; then
  echo "clickhouse-backup create $BACKUP_NAME FAILED and return $exit_code exit code"
  exit $exit_code
fi

clickhouse-backup upload $BACKUP_NAME >> /var/log/clickhouse-backup.log 2>&1
exit_code=$?
if [[ $exit_code != 0 ]]; then
  echo "clickhouse-backup upload $BACKUP_NAME FAILED and return $exit_code exit code"
  exit $exit_code
fi

Common CLI Usage

CLI command - tables

NAME:
   clickhouse-backup tables - List of tables, exclude skip_tables

USAGE:
   clickhouse-backup tables [--tables=<db>.<table>] [--remote-backup=<backup-name>] [--all]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --all, -a                                  Print table even when match with skip_tables pattern
   --table value, --tables value, -t value    List tables only match with table name patterns, separated by comma, allow ? and * as wildcard
   --remote-backup value                      List tables from remote backup

CLI command - create

NAME:
   clickhouse-backup create - Create new backup

USAGE:
   clickhouse-backup create [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--rbac] [--configs] [--skip-check-parts-columns] <backup_name>

DESCRIPTION:
   Create new backup

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --table value, --tables value, -t value    Create backup only matched with table name patterns, separated by comma, allow ? and * as wildcard
   --diff-from-remote value                   Create incremental embedded backup or upload incremental object disk data based on other remote backup name
   --partitions partition_id                  Create backup only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s                                      Backup schemas only, will skip data
   --rbac, --backup-rbac, --do-backup-rbac           Backup RBAC related objects
   --configs, --backup-configs, --do-backup-configs  Backup 'clickhouse-server' configuration files
   --rbac-only                                       Backup RBAC related objects only, will skip backup data, will backup schema only if --schema added
   --configs-only                                    Backup 'clickhouse-server' configuration files only, will skip backup data, will backup schema only if --schema added
   --skip-check-parts-columns                        Skip check system.parts_columns to disallow backup inconsistent column types for data parts

CLI command - create_remote

NAME:
   clickhouse-backup create_remote - Create and upload new backup

USAGE:
   clickhouse-backup create_remote [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [--diff-from=<local_backup_name>] [--diff-from-remote=<local_backup_name>] [--schema] [--rbac] [--configs] [--resumable] [--skip-check-parts-columns] <backup_name>

DESCRIPTION:
   Create and upload

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --table value, --tables value, -t value    Create and upload backup only matched with table name patterns, separated by comma, allow ? and * as wildcard
   --partitions partition_id                  Create and upload backup only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --diff-from value                                 Local backup name which used to upload current backup as incremental
   --diff-from-remote value                          Remote backup name which used to upload current backup as incremental
   --schema, -s                                      Backup and upload metadata schema only, will skip data backup
   --rbac, --backup-rbac, --do-backup-rbac           Backup and upload RBAC related objects
   --configs, --backup-configs, --do-backup-configs  Backup and upload 'clickhouse-server' configuration files
   --rbac-only                                       Backup RBAC related objects only, will skip backup data, will backup schema only if --schema added
   --configs-only                                    Backup 'clickhouse-server' configuration files only, will skip backup data, will backup schema only if --schema added
   --resume, --resumable                             Save intermediate upload state and resume upload if backup exists on remote storage, ignore when 'remote_storage: custom' or 'use_embedded_backup_restore: true'
   --skip-check-parts-columns                        Skip check system.parts_columns to disallow backup inconsistent column types for data parts
   --delete, --delete-source, --delete-local         explicitly delete local backup during upload

CLI command - upload

NAME:
   clickhouse-backup upload - Upload backup to remote storage

USAGE:
   clickhouse-backup upload [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--diff-from=<local_backup_name>] [--diff-from-remote=<remote_backup_name>] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --diff-from value                          Local backup name which used to upload current backup as incremental
   --diff-from-remote value                   Remote backup name which used to upload current backup as incremental
   --table value, --tables value, -t value    Upload data only for matched table name patterns, separated by comma, allow ? and * as wildcard
   --partitions partition_id                  Upload backup only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s                               Upload schemas only
   --resume, --resumable                      Save intermediate upload state and resume upload if backup exists on remote storage, ignored with 'remote_storage: custom' or 'use_embedded_backup_restore: true'
   --delete, --delete-source, --delete-local  explicitly delete local backup during upload

CLI command - list

NAME:
   clickhouse-backup list - List of backups

USAGE:
   clickhouse-backup list [all|local|remote] [latest|previous]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - download

NAME:
   clickhouse-backup download - Download backup from remote storage

USAGE:
   clickhouse-backup download [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --table value, --tables value, -t value    Download objects which matched with table name patterns, separated by comma, allow ? and * as wildcard
   --partitions partition_id                  Download backup data only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s           Download schema only
   --resume, --resumable  Save intermediate download state and resume download if backup exists on local storage, ignored with 'remote_storage: custom' or 'use_embedded_backup_restore: true'

CLI command - restore

NAME:
   clickhouse-backup restore - Create schema and restore data from backup

USAGE:
   clickhouse-backup restore  [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [-s, --schema] [-d, --data] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] <backup_name>

OPTIONS:
   --config value, -c value                    Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value   override any environment variable via CLI parameter
   --table value, --tables value, -t value     Restore only database and objects which matched with table name patterns, separated by comma, allow ? and * as wildcard
   --restore-database-mapping value, -m value  Define the rule to restore data. For the database not defined in this struct, the program will not deal with it.
   --partitions partition_id                   Restore backup only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s                                        Restore schema only
   --data, -d                                          Restore data only
   --rm, --drop                                        Drop exists schema objects before restore
   -i, --ignore-dependencies                           Ignore dependencies when drop exists schema objects
   --rbac, --restore-rbac, --do-restore-rbac           Restore RBAC related objects
   --configs, --restore-configs, --do-restore-configs  Restore 'clickhouse-server' CONFIG related files
   --rbac-only                                         Restore RBAC related objects only, will skip backup data, will backup schema only if --schema added
   --configs-only                                      Restore 'clickhouse-server' configuration files only, will skip backup data, will backup schema only if --schema added

CLI command - restore_remote

NAME:
   clickhouse-backup restore_remote - Download and restore

USAGE:
   clickhouse-backup restore_remote [--schema] [--data] [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] [--skip-rbac] [--skip-configs] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                    Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value   override any environment variable via CLI parameter
   --table value, --tables value, -t value     Download and restore objects which matched with table name patterns, separated by comma, allow ? and * as wildcard
   --restore-database-mapping value, -m value  Define the rule to restore data. For the database not defined in this struct, the program will not deal with it.
   --partitions partition_id                   Download and restore backup only for selected partition names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s                                        Download and Restore schema only
   --data, -d                                          Download and Restore data only
   --rm, --drop                                        Drop schema objects before restore
   -i, --ignore-dependencies                           Ignore dependencies when drop exists schema objects
   --rbac, --restore-rbac, --do-restore-rbac           Download and Restore RBAC related objects
   --configs, --restore-configs, --do-restore-configs  Download and Restore 'clickhouse-server' CONFIG related files
   --rbac-only                                         Restore RBAC related objects only, will skip backup data, will backup schema only if --schema added
   --configs-only                                      Restore 'clickhouse-server' configuration files only, will skip backup data, will backup schema only if --schema added
   --resume, --resumable                               Save intermediate upload state and resume upload if backup exists on remote storage, ignored with 'remote_storage: custom' or 'use_embedded_backup_restore: true'

CLI command - delete

NAME:
   clickhouse-backup delete - Delete specific backup

USAGE:
   clickhouse-backup delete <local|remote> <backup_name>

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - default-config

NAME:
   clickhouse-backup default-config - Print default config

USAGE:
   clickhouse-backup default-config [command options] [arguments...]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - print-config

NAME:
   clickhouse-backup print-config - Print current config merged with environment variables

USAGE:
   clickhouse-backup print-config [command options] [arguments...]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - clean

NAME:
   clickhouse-backup clean - Remove data in 'shadow' folder from all 'path' folders available from 'system.disks'

USAGE:
   clickhouse-backup clean [command options] [arguments...]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - clean_remote_broken

NAME:
   clickhouse-backup clean_remote_broken - Remove all broken remote backups

USAGE:
   clickhouse-backup clean_remote_broken [command options] [arguments...]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter

CLI command - watch

NAME:
   clickhouse-backup watch - Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences

USAGE:
   clickhouse-backup watch [--watch-interval=1h] [--full-interval=24h] [--watch-backup-name-template=shard{shard}-{type}-{time:20060102150405}] [-t, --tables=<db>.<table>] [--partitions=<partitions_names>] [--schema] [--rbac] [--configs] [--skip-check-parts-columns]

DESCRIPTION:
   Execute create_remote + delete local, create full backup every `--full-interval`, create and upload incremental backup every `--watch-interval` use previous backup as base with `--diff-from-remote` option, use `backups_to_keep_remote` config option for properly deletion remote backups, will delete old backups which not have references from other backups

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --watch-interval value                     Interval for run 'create_remote' + 'delete local' for incremental backup, look format https://pkg.go.dev/time#ParseDuration
   --full-interval value                      Interval for run 'create_remote'+'delete local' when stop create incremental backup sequence and create full backup, look format https://pkg.go.dev/time#ParseDuration
   --watch-backup-name-template value         Template for new backup name, could contain names from system.macros, {type} - full or incremental and {time:LAYOUT}, look to https://go.dev/src/time/format.go for layout examples
   --table value, --tables value, -t value    Create and upload only objects which matched with table name patterns, separated by comma, allow ? and * as wildcard
   --partitions partition_id                  Partitions names, separated by comma
If PARTITION BY clause returns numeric not hashed values for partition_id field in system.parts table, then use --partitions=partition_id1,partition_id2 format
If PARTITION BY clause returns hashed string values, then use --partitions=('non_numeric_field_value_for_part1'),('non_numeric_field_value_for_part2') format
If PARTITION BY clause returns tuple with multiple fields, then use --partitions=(numeric_value1,'string_value1','date_or_datetime_value'),(...) format
If you need different partitions for different tables, then use --partitions=db.table1:part1,part2 --partitions=db.table?:*
Values depends on field types in your table, use single quotes for String and Date/DateTime related types
Look at the system.parts partition and partition_id fields for details https://clickhouse.com/docs/en/operations/system-tables/parts/
   --schema, -s                                      Schemas only
   --rbac, --backup-rbac, --do-backup-rbac           Backup RBAC related objects only
   --configs, --backup-configs, --do-backup-configs  Backup `clickhouse-server' configuration files only
   --skip-check-parts-columns                        Skip check system.parts_columns to disallow backup inconsistent column types for data parts

CLI command - server

NAME:
   clickhouse-backup server - Run API server

USAGE:
   clickhouse-backup server [command options] [arguments...]

OPTIONS:
   --config value, -c value                   Config 'FILE' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]
   --environment-override value, --env value  override any environment variable via CLI parameter
   --watch                                    Run watch go-routine for 'create_remote' + 'delete local', after API server startup
   --watch-interval value                     Interval for run 'create_remote' + 'delete local' for incremental backup, look format https://pkg.go.dev/time#ParseDuration
   --full-interval value                      Interval for run 'create_remote'+'delete local' when stop create incremental backup sequence and create full backup, look format https://pkg.go.dev/time#ParseDuration
   --watch-backup-name-template value         Template for new backup name, could contain names from system.macros, {type} - full or incremental and {time:LAYOUT}, look to https://go.dev/src/time/format.go for layout examples

More use cases of clickhouse-backup

Original Author

Altinity wants to thank @AlexAkulov for creating this tool and for his valuable contributions.

clickhouse-backup's People

Contributors

Stargazers

Watchers

Forkers

ssfilatov maxfedotov xman1980 prog8 anuriq telubeer zuzzas ronak-padhya freeyoung zaikin alexandrst88 zhangzaki shurshun abeusher larrabee dcastanier pelletier jsding infinivision develar przemekd nikon72ru maskshell martenlindblad deepdivenow pclochard genezhang felixoid spoofedpacket doytsujin filimonov nmcclain pauljohnston2009 ivanblack19 onlyone0001 ikaro0 24metrics drissireda jawr ziyht vadv shunpeizhang doubaokun kendn1993 ilyshav oneiq lunalabsltd bilalbayasut tools-env isgasho mohamedelbahr akatashkov awesomeleo xiashaohua rtbsolutions chuhc yifanwu1 showsmall 1ac3124 cloudxo dav1d992 duringnone kiennh2909 justhonor data-analisis steuf elpeque nopius something2019 camamber ay-git generalcommission kioco glomazdov leogvozdkov combin edonin real-fire zhbdesign granderstark minervadb xsstef sushraju ne1r0n softiger druidfund muxinc erikdubbelboer drlimd1987 jacklx2021 cntsp leishb obazna vanyasvl sdil quiq excitoon-favorites dystudio alschastny ka1bi4

clickhouse-backup's Issues

Unable to backup table.

Hi,
I am trying to get my hands on this utility from past 3 days but not getting any success. I am able to connect to clickhouse database but create command isn't working.

$ clickhouse-backup tables -c ~/my_config.yml
Gives list of tables.
However


$ clickhouse-backup create -c ~/my_config.yml bkp2

Output-

2019/05/01 15:50:11 Create backup 'bkp2'
2019/05/01 15:50:11 can't get partitions for "default.tmp_employee" with code: 47, message: Unknown identifier: partition_id

Details-
Ubuntu 16.04 LTS
Clickhouse version- 1.1.54370
Clickhouse-backup (latest 26Apr2019)

What could be wrong, any idea ?

Ignore schema

When i ignore some tables, it looks like okay. But then while, try restore-schema it still try to create them.
ch: 19.3.8
CLICKHOUSE_SKIP_TABLES=system.,default.test1,default..inner.

Inherit global options

The CLI interface should inherit global options. It is unintuitive to prefix each command with global options, as opposed to using them everywhere.

Upload to S3 from Yandex Cloud

Hi!
Tell me, should the backup upload work in S3 from Yandex Cloud? I have a error:
can't upload with 403: "There were headers present in the request which were not signed"

Can't restore from backup

The metadata was restored. After that try restore data:

sudo ./clickhouse-backup -c ./config.yml restore-data $BACKUP_NAME

2019/05/03 15:16:55 Prepare data for restoring 'analytics.events'
2019/05/03 15:16:55 ALTER TABLE analytics.events ATTACH PARTITION ID '201905'
2019/05/03 15:16:55 can't attach partitions for table 'analytics.events' with code: 226, message: No columns in part 201905_1_1128_227

Error while creating backups of tables with DISTRIBUTED engine

I have observed an issue during creating a backup for table with use DISTRIBUTED engine:

$ sudo docker run --rm -it --network host -v "/mnt/clickhouse:/var/lib/clickhouse" -e S3_BUCKET=test   alexakulov/clickhouse-backup create db_layer_0.sikandar_replicated
2019/09/09 16:35:04 Create backup 'db_layer_0.sikandar_replicated'
2019/09/09 16:35:04 Freeze 'db_layer_0.sikandar_replicated'
2019/09/09 16:35:04 Freeze 'db_layer_1.sikandar_replicated'
2019/09/09 16:35:04 Freeze 'default.sikandar'
2019/09/09 16:35:04 can't freeze 'default.sikandar' with: code: 48, message: Partition operations are not supported by storage Distributed

Looks like that issue is related to a fact that distributed table does not hold data itself, thus it doesn't have any partition.

CREATE TABLE sikandar
(
    sikandarDay Date, 
    sikandarId String, 
    sikandarAge UInt32
)
ENGINE = Distributed(level0, db_layer_0, sikandar_replicated, sipHash64(sikandarId));

The way current freeze works in the code(https://github.com/AlexAkulov/clickhouse-backup/blob/5d3a0d0196d58eb00cad915738795a120554f1ed/clickhouse.go#L178) fails

d40c477505fb :) ALTER TABLE default.sikandar FREEZE

ALTER TABLE default.sikandar
    FREEZE


Received exception from server (version 19.4.1):
Code: 48. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: Partition operations are not supported by storage Distributed.

Does distributed table backup is supported now, maybe I have missed something?

BR,
Aleksandr

Improve of "archive" backup strategy

@AlexAkulov Hello,
it looks like new archive strategy is using only one cpu core. It works very slow and can't be used for terabyte range backups (gzip, level 7)

Encryption before upload?

It would be nice that the archive is encrypted before sending to S3.

hard links in /var/lib/clickhouse/shadow are not treated well

During multiple freeze executions partitions that weren't changed just create an additional hard link to already existing file. These can be seen from command line:

# find shadow -xdev -samefile shadow/10/data/default/ontime/2009_1_1_0/checksums.txt -print
shadow/10/data/default/ontime/2009_1_1_0/checksums.txt
shadow/28/data/default/ontime/2009_1_1_0/checksums.txt

Those hard links pointing to same file should be processed by clickhouse-backup, but current behaviour is as follows:

for tree strategy those are uploaded as duplicates to S3 and during download are created as separate files but with same content
for archive strategy hard links are saved correctly in tar and during download are recreated as hard links, so result will be same structure as in shadow directory. N.B. After #5 merge

But both strategies suffer from restore behaviour. It doesn't distinguish those kind of links, so duplicate partitions are copied/moved to detached folder multiple times and attached to table. In the end you will end up with duplicate rows in tables.

I plan to work on this, but wanted to discuss the way to fix this:

Shall we fix this on upload/download part, so we'll have clean 1 copy of data in the backup folder
Or fix on restore part - make it understand those links and process them.

P.S. To avoid this issue right now shadow directory should be cleaned after every backup. clean command may be used.

Пустые данные при создание бэкапа

Starting clickhouse-backup
Attaching to clickhouse-backup
clickhouse-backup | 2019/11/14 13:43:23 Create backup '1223456789'
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.act
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.events
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.notifies
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.redirect
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.url
clickhouse-backup | 2019/11/14 13:43:23 Freeze b.actual
clickhouse-backup | 2019/11/14 13:43:23 Freeze m.events
clickhouse-backup | 2019/11/14 13:43:23 Freeze m.notifies
clickhouse-backup | 2019/11/14 13:43:23 Freeze m.op
clickhouse-backup | 2019/11/14 13:43:23 Freeze m.redirects
clickhouse-backup | 2019/11/14 13:43:23 Freeze m.url
clickhouse-backup | 2019/11/14 13:43:23 Copy metadata
clickhouse-backup | 2019/11/14 13:43:23 Done.
clickhouse-backup | 2019/11/14 13:43:23 Move shadow
clickhouse-backup | 2019/11/14 13:43:23 Done.

clickhouse-backup коннектится к базе и видит таблицы (как видно из лога)
но на выходе папки backup и shadow оказываются пустыми без каких-либо данных
хотя данные в таблицах из длга есть
в директориях /var/lib/clickhouse/shadow и /var/lib/clickhouse/metadata данные также есть

не могу понять в чем проблема

Upload to s3 error

Hello,
Sometimes I got errors, when uploading files to s3

2019/06/19 08:45:01 Upload backup 'clickhouse-2019-06-19-1560933864'\n2019/06/19 10:39:52 can't upload with shadow/***/***/***/***.bin: copying contents: Put https://***-backup-clickhouse-us.s3.amazonaws.com/clickhouse_backups_incremental/***-clickhouse-us1/clickhouse-2019-06-19-1560933864.tar.lz4?partNumber=760&uploadId=***: net/http: timeout awaiting response headers
2019/06/18 06:05:39 Upload backup '2019-06-18T06-05-19'\n2019/06/18 06:35:35 can't upload with /shadow/***/***/***/***/***/***.bin: copying contents: Put https://***-backup-clickhouse-us.s3.amazonaws.com/clickhouse_backups/***-clickhouse-us1/2019-06-18T06-05-19.tar.lz4?partNumber=209&uploadId=***: dial tcp ***:443: i/o timeout

Upload exit with status 1 after this (
Does app try few times to upload part or stop after first bad try?

Regards,
Andrii

Exclude tables

Hello,
It would be nice to add --exclude option to backup all tables, except specified. Is it possible to add this?

Example
clickhouse-backup create --exclude=<db>.<table1>,<db>.<table2>

Regards,
Andrii

Download problem

root@stats:~# clickhouse-backup list
Local backups:
2019-04-08T18-23-33
Backups on S3:
2019-04-08T18-12-05.tar

root@stats:~# clickhouse-backup download
Select backup for download:
2019-04-08T18-12-05.tar

root@stats:~# clickhouse-backup download 2019-04-08T18-12-05.tar
2019/04/08 21:36:24 404: "The specified key does not exist."

Why so? :(

Refactor codebase to follow go project layout

Right now the code is a little bit disorganized - I suggest to change the code organization to follow a standard go project layout -> https://github.com/golang-standards/project-layout

Can't restore from backup

Hi!

I can't restore from backup table (ReplicatedReplacingMergeTree).

clickhouse-backup restore-data -t marketing.a_events "2019-07-31T16-27-49"

CLI:

Prepare data for restoring 'marketing.a_events'
ALTER TABLE marketing.a_events ATTACH PARTITION 200812
can't attach partitions for table 'marketing.a_events' with code: 33, message: Cannot read all data. Bytes read: 27. Bytes expected: 74.

ClickHouse logs:

/usr/bin/clickhouse-server(StackTrace::StackTrace()+0x22) [0x781c272]
/usr/bin/clickhouse-server(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)+0x22) [0x3a0a3e2]
/usr/bin/clickhouse-server(DB::ReadBuffer::readStrict(char*, unsigned long)+0x181) [0x3a19f71]
/usr/bin/clickhouse-server(DB::DataTypeString::deserializeBinary(DB::IColumn&, DB::ReadBuffer&) const+0x194) [0x66e9334]
/usr/bin/clickhouse-server(DB::MergeTreeDataPart::loadIndex()+0x22c) [0x6a5ffec]
/usr/bin/clickhouse-server(DB::MergeTreeDataPart::loadColumnsChecksumsIndexes(bool, bool)+0x58) [0x6a612d8]
/usr/bin/clickhouse-server(DB::MergeTreeData::loadPartAndFixMetadata(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0x172) [0x6a3b432]
/usr/bin/clickhouse-server(DB::StorageReplicatedMergeTree::attachPartition(std::shared_ptrDB::IAST const&, bool, DB::Context const&)+0x237) [0x69c6ac7]
/usr/bin/clickhouse-server(DB::StorageReplicatedMergeTree::alterPartition(std::shared_ptrDB::IAST const&, DB::PartitionCommands const&, DB::Context const&)+0x193) [0x69c93a3]
/usr/bin/clickhouse-server(DB::InterpreterAlterQuery::execute()+0x578) [0x6dc1b18]
/usr/bin/clickhouse-server() [0x688ca65]
/usr/bin/clickhouse-server(DB::executeQuery(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, DB::Context&, bool, DB::QueryProcessingStage::Enum, bool)+0x74) [0x688e6d4]
/usr/bin/clickhouse-server(DB::TCPHandler::runImpl()+0x830) [0x3a15c20]
/usr/bin/clickhouse-server(DB::TCPHandler::run()+0x2b) [0x3a1627b]
/usr/bin/clickhouse-server(Poco::Net::TCPServerConnection::start()+0xf) [0x725666f]
/usr/bin/clickhouse-server(Poco::Net::TCPServerDispatcher::run()+0xe9) [0x7256da9]
/usr/bin/clickhouse-server(Poco::PooledThread::run()+0x81) [0x7927e41]
/usr/bin/clickhouse-server(Poco::ThreadImpl::runnableEntry(void*)+0x38) [0x7924248]
/usr/bin/clickhouse-server() [0xb2ac5bf]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f2dc4f9a6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f2dc472188f]

301 response missing Location header

How i can resolve this? I use config from integration-test

2019/04/08 18:56:56 can't upload with Post http://s3.amazonaws.com/xxxxxx/backup/2019-04-08T15-49-42.tar.lz4?uploads=: 301 response missing Location header

Use graymeta/stow package as remote storage abstraction

I suggest using https://github.com/graymeta/stow as remote storage abstraction. That way the project can easily support more object storage solutions.

can't get Clickhouse tables with: code: 47, message: Missing columns: 'data_path'

version of clickhouse 19.15.3.6.

When i'm trying to create backup, got error:

can't get Clickhouse tables with: code: 47, message: Missing columns: 'data_path' while processing query: 'SELECT database, name, is_temporary, data_path, metadata_path FROM system.tables WHERE (data_path != '') AND (is_
temporary = 0) AND (engine LIKE '%MergeTree')', required columns: 'data_path' 'is_temporary' 'engine' 'database' 'name' 'metadata_path', source columns: 'primary_key' 'storage_policy' 'sorting_key' 'data_paths' 'partition_key' 'engine_full'
'is_temporary' 'database' 'sampling_key' 'create_table_query' 'engine' 'dependencies_table' 'name' 'metadata_path' 'metadata_modification_time' 'dependencies_database'

Can't backup clickhouse db which in kubernetes

I try to make backup database, which work in docker container from official yandex image. This docker image runs in kubernetes cluster. I made separate pod with docker image alexakulov/clickhouse-backup and tried to make backup. And i had this error when i started this util:

2019/09/26 11:08:49 envconfig.Process: assigning CLICKHOUSE_CLICKHOUSE_PORT to Port: converting 'tcp://192.168.186.165:9000' to type uint. details: strconv.ParseUint: parsing "tcp://192.168.186.165:9000": invalid syntax

My config:
clickhouse:
  username: default
  password: ""
  host: clickhouse #this is address of host with clickhouse server db
  port: 9000
  data_path: ""
  skip_tables:
  - system.*
s3:
  access_key: ""
  secret_key: ""
  bucket: ""
  endpoint: ""
  region: us-east-1
  acl: private
  force_path_style: false
  path: ""
  disable_ssl: false
  disable_progress_bar: false
  part_size: 104857600
  strategy: ""
  backups_to_keep_local: 0
  backups_to_keep_s3: 0
  compression_level: 1
  compression_format: lz4

run this command:
clickhouse-backup create my_backup

only metadata is getting backed up, no hard links to data in the shadow folder

Tried backing up a Log table
./clickhouse-backup --config config.yml create --tables=default.test_Log

Output
2019/06/01 00:54:05 Create backup '2019-06-01T04-54-05'
2019/06/01 00:54:05 Freeze 'default.test_Log'
2019/06/01 00:54:05 Copy metadata
2019/06/01 00:54:05 Done.
2019/06/01 00:54:05 Move shadow
2019/06/01 00:54:05 Done.

However, only metadata was backup under /var/lib/clickhouse/backup. Where is the contents of shadow being moved to?

New backup stategies

According to this lines https://github.com/AlexAkulov/clickhouse-backup/blob/master/main.go#L402-L416 there are two backup strategies:
- tree
- archive
It would be nice to add compressed_archive (tar.gz), because not always we have enogh space to "doube" the clickhouse data.
Also it would be nice to add option tmp_folder - path to templorary folder, because /tmp usually exists on disk with 5-10 Gb space.

Problem with restore

I try restore:

2019/04/10 10:38:33 Attach partitions for xxxx increment 175:
2019/04/10 10:38:33 ALTER TABLE xxxx ATTACH PARTITION 201901
2019/04/10 10:38:33 can't attach partitions for table xxx.xxx with code: 228, message: /var/lib/clickhouse/data/xxx/xxxx/detached/20190101_20190118_9742_9747_1/date.bin has unexpected size: 62464 instead of 1408782

Support for assuming AWS roles?

Is it possible to configure clickhouse-backup to use an AWS IAM role when accessing an S3 bucket?

https://docs.aws.amazon.com/cli/latest/topic/config-vars.html#using-aws-iam-roles

I'm trying to use an S3 bucket for clickhouse backup that was created with a specific role_arn, so the keys alone are not enough to access it unfortunately.

can't create backup with mkdir /mnt: read-only file system

Our clickhouse data data path is /mnt/data1/clickhouse, but run
./clickhouse_backup --config ./config.yml create -t table table_20191216.bin
got

2019/12/16 16:16:43 can't create backup with mkdir /mnt: read-only file system

How to pass the correct backup path to the config file?

Thanks

Allow not to use credentials.json when running on GCE

I want to add opportunity not to use credentials.json with google cloud storage (GCS) when application running on Google compute engine (GCE) because there is a way to use default service account. What do you think about it ?

Expose default database name in config

My clickhouse setup do not contain database with name 'default'.
and i can not create backup with 'can't connect to clickouse with: code: 81, message: Database default doesn't exist' error.
and config file does not contain database option.
am i miss something?

How to restore replicated tables ?

Hello,

I need to backup all my replicated tables locally, before doing some dangerous operations.
Backup is obvious, with: clickhouse-backup create my-backup
After delete or modify some data, how to restore it ?

Drop database and clickhouse-backup restore my-backup
Replicated host will also drop all data and then copy all partitions after the restore. Could be long when database is large.
No drop, and clickhouse-backup restore my-backup
Previous existing data will be duplicated, and also copied on replicas.
Delete modified partition on database, keep only modified partition in backup folder, then clickhouse-backup restore my-backup
Look better since only need partition will be copied on replicas, but manual step is dangerous.
How to do that in an automatic way ?

Dockerfile and docker image on dockerhub

Hello,
could you add Dockerfile and docker image on dockerhub?
(something like this)

# Build container
FROM golang:1.12.1-stretch

RUN git clone https://github.com/AlexAkulov/clickhouse-backup.git /go/src/clickhouse-backup
WORKDIR /go/src/clickhouse-backup
RUN go get
RUN go build -o /clickhouse-backup .

# Run container
FROM ubuntu:18.04

RUN apt-get update
RUN apt-get install -yqq  \
    ca-certificates \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*
COPY --from=0 /clickhouse-backup /usr/local/bin/
RUN chmod +x /usr/local/bin/clickhouse-backup
ENTRYPOINT [ "clickhouse-backup" ]
CMD [ "--help" ]

[QUESTION/BUG] Freezing query is failed due syntax error: CH version is 19.5.3

Hi!

I have clickhouse version: 19.5.3, and the the query is failed

ALTER TABLE db_test.test_table FREEZE PARTITION ID '201905'

Error output:

Code: 368. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: std::bad_typeid.

So basically, if remove ID from query everything works fine.

Is this issue caused by clickhouse-server version?

UPDATE:

Also, I've checked official docs and few links, and indeed ID is omitted.

Crash when trying to backup an empty database

Crash on this version if ClickHouse is empty

root@clickhouse00:/tmp# clickhouse-backup create
2019/04/22 03:49:40 Create backup '2019-04-22T03-49-40'
2019/04/22 03:49:40 There are no tables in Clickhouse, create something to freeze.
2019/04/22 03:49:40 Copy metadata
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x9ce303]

goroutine 1 [running]:
main.copyPath.func1(0xc000632080, 0x10, 0x0, 0x0, 0xc2a680, 0xc00009a8a0, 0x30, 0xad74c0)
	/home/travis/gopath/src/github.com/AlexAkulov/clickhouse-backup/utils.go:101 +0x413
path/filepath.Walk(0xc000632080, 0x10, 0xc00009a870, 0xc000022390, 0x2b)
	/home/travis/.gimme/versions/go1.11.4.linux.amd64/src/path/filepath/path.go:402 +0x6a
main.copyPath(0xc000632080, 0x10, 0xc000022390, 0x2b, 0x0, 0xa, 0x2328)
	/home/travis/gopath/src/github.com/AlexAkulov/clickhouse-backup/utils.go:85 +0x91
main.createBackup(0xc000024f58, 0x7, 0x0, 0x0, 0xc000024f80, 0xa, 0x2328, 0xc000024fb8, 0x7, 0xc000065fa0, ...)
	/home/travis/gopath/src/github.com/AlexAkulov/clickhouse-backup/main.go:466 +0x3db
main.main.func6(0xc0000aa9a0, 0x0, 0xc0000aa9a0)
	/home/travis/gopath/src/github.com/AlexAkulov/clickhouse-backup/main.go:107 +0x1a4
github.com/urfave/cli.HandleAction(0xa4b6c0, 0xb62768, 0xc0000aa9a0, 0xc000078a00, 0x0)
	/home/travis/gopath/pkg/mod/github.com/urfave/[email protected]/app.go:490 +0xc8
github.com/urfave/cli.Command.Run(0xb3dca0, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0xb54c60, 0x2b, 0xb5baa8, ...)
	/home/travis/gopath/pkg/mod/github.com/urfave/[email protected]/command.go:210 +0x9a2
github.com/urfave/cli.(*App).Run(0xc0001651e0, 0xc00000c060, 0x2, 0x2, 0x0, 0x0)
	/home/travis/gopath/pkg/mod/github.com/urfave/[email protected]/app.go:255 +0x687
main.main()
	/home/travis/gopath/src/github.com/AlexAkulov/clickhouse-backup/main.go:184 +0xa9c

Do not delete files from s3 if they were not found locally

Now if some files are missing from the folder, they will be deleted from s3 when run clickhouse-backup upload.

I suppose the possibility, when the freeze only for the last few days partitions (partition key daily), but at the same time it is necessary to keep the files on the previous days on s3 and not delete they.

Docker problem

Hello.
The tool in official docker image does not work. This happens because it was builded in another environment.

docker exec -ti clickhouse-backup sh
/ # clickhouse-backup create
2019/07/22 11:50:15 Create backup '2019-07-22T11-50-15'
2019/07/22 11:50:15 can't connect to clickouse with: could not load time location: open /home/travis/.gimme/versions/go1.11.4.linux.amd64/lib/time/zoneinfo.zip: no such file or directory

simplify cmd line

It's could be nice to simplify cmd line to allow backup or restore in a single line.
for example:

backup (create + upload)
restore (download + restore-schema + restore-data)
or allow multiple command per line.

backup does not work

./clickhouse-backup -config config.yml create
2019/06/06 16:11:51 Create backup '2019-06-06T13-11-51'
2019/06/06 16:11:51 Freeze 'pf.blogger_stat'
2019/06/06 16:11:51 partition '201905'
2019/06/06 16:11:51 partition '201906'
2019/06/06 16:11:51 Freeze 'pf.migration'
2019/06/06 16:11:51 partition '197001'
2019/06/06 16:11:52 can't freeze partition '197001' on 'pf.migration' with: code: 368, message: std::bad_typeid

root@pclickhouse:/lib65/clickhouse-backup# clickhouse-server -V
ClickHouse server version 19.5.3.8 (official build).

From console - it works:

localhost :) ALTER TABLE pf.migration FREEZE PARTITION ''

ALTER TABLE pf.migration
FREEZE PARTITION ''

Ok.

0 rows in set. Elapsed: 0.091 sec.

Backup Duration

I'm curious how does AWS work for you?
It takes half a day for every TB of data.
Am I missing something?

FREEZE WITH NAME

ClickHouse supports freeze a table with given name:

ALTER TABLE cdp_tags FREEZE WITH NAME 'abc';

...
2019.11.30 09:47:21.941618 [ 608 ] {a6445f03-179f-4970-bb3c-a57a65af3377} <Debug> default.cdp_tags: Freezing part f6b4ea42fb9c719593aebbee00e5526f_2_835_20_862 snapshot will be placed at /var/lib/clickhouse/shadow/abc/

pkg/chbackup/backup.go Freeze(config Config, tablePattern string) can be improved not to require shadow directory to empty.

memory leak while upload backup to s3

rss 4.1g( 4253984), repo size ~150GB

c000000000-c0fc000000 rw-p 00000000 00:00 0
Size: 4128768 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 3994276 kB
Pss: 3994276 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 202380 kB
Private_Dirty: 3791896 kB
Referenced: 3862512 kB
Anonymous: 3994276 kB
LazyFree: 202376 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac

Version: v0.5.0
Git Commit: 0655043
Build Date: 2019-10-27

download failed on fresh clickhouse

Hello,

I try to restore a backup on a fresh clickhouse with v0.4.1:

clickhouse-backup download backup1-1
mkdir /var/lib/clickhouse/backup/backup1-1: no such file or directory

Look like backup folder is only created with create command

"driver: bad connection " Error during table freeze on table with big amount of partitions

As I understand, after upgrading Clickhouse on version 19.11.12.69 the clickhouse-backup starts to use FreezeTable method instead of FreezeTableOldWay. Which use ALTER TABLE %v.%v FREEZE;. (instead of FREEZE PARTITION)
So then I'm trying to freeze table with big amount of partiotions it's takes more time for this operation.
But after 3 minutes clickhouse-backup fails with error :
"2020/01/24 16:39:46 can't freeze mydb.some_table_local with: driver: bad connection"
This looks like some timeout from clickhouse-go driver ?
Maybe we need an option like connection timeout or ability to choose freezing method ( oldway or new ) ?

Error while creating backups of tables of type MergeTree

Been trying to use this tool to test backups of tables of different types
./clickhouse-backup --config config.yml create --tables=default.test_mergeTree back4

I get this error
2019/06/01 00:46:43 Create backup 'back4'
2019/06/01 00:46:43 Freeze 'default.test_mergeTree'
2019/06/01 00:46:43 partition '201906'
2019/06/01 00:46:43 can't freeze partition '201906' on 'default.test_mergeTree' with: code: 368, message: std::bad_typeid

Been following the readme closely, not sure what might be causing this!

Support for Google Cloud Storage

Support for native Google Cloud Storage is desirable. I am talking about native Google Cloud API, and not the interoperability API that Googles provides. See https://cloud.google.com/storage/docs/migrating for additional information.

check table after restore data

ClickHouse/ClickHouse#5865 CHECK TABLE checks every part's checksum. clickhouse-backup shall do this check if possible.

Broken "region" variable

Hello,
i got this error in latest release
can't upload with 400:"The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-west-1'
Config:
...
s3:
region: "eu-west-1"
...

Files are reuploaded if there are more than 1000 objects

In the table for the entire period of 6500 files(objects).
Option in config s3.overwrite_strategy etag.

If objects in s3 bucket > 1000, when run clickhouse-backup upload they again upload same files except 1k.

Use ceph radosgw as s3 (maybe problem in this).
When start upload, the HTTP request log shows that two request GET:

[19/Mar/2019:07:17:38 +0000] "GET /?list-type=2&max-keys=1000&prefix=<prefix> HTTP/2.0" 200
GET /?list-type=2&max-keys=1000&prefix=<prefix> HTTP/1.1
Host: <host>
User-Agent: aws-sdk-go/1.15.58 (go1.12; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=.../default/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=<signature>
X-Amz-Content-Sha256: <content>
X-Amz-Date: 20190318T155745Z
Accept-Encoding: gzip

[19/Mar/2019:07:17:38 +0000] "GET /?list-type=2&max-keys=1000&prefix=d<prefix> HTTP/2.0" 200
GET /?list-type=2&max-keys=1000&prefix=<prefix> HTTP/1.1
Host: <host>
User-Agent: aws-sdk-go/1.15.58 (go1.12; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=.../default/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=<signature>
X-Amz-Content-Sha256: <content>
X-Amz-Date: 20190318T155745Z
Accept-Encoding: gzip

In XML response 1st request (cut out only part of the directives)

<MaxKeys>1000</MaxKeys><IsTruncated>true</IsTruncated><Marker></Marker><NextMarker>....</NextMarker>
...

2d request same objects in body, but exclude NextMarker:

<MaxKeys>1000</MaxKeys><IsTruncated>true</IsTruncated><Marker></Marker>
...

When in code i change maxkeys=7000, file not reuploaded.

Backup clarification

Hello,
I have few questions about backup.
Before I used v0.2.0 with archive strategy and full backup.
Today I tried v0.3.4.

Is there any command to delete old backups from clickhouse_folder/backup?
Is there any command to delete old backups from s3?
I used command "clickhouse-backup upload backup_name" (v0.3.4). This command deleted all my old backups from s3. WTF?)
Do I need to keep old backups locally to use incremental backup (--diff-from=<old_backup_name>) ?
v0.3.2 DEPREСATIONS: 'dry-run' flag and 'archive' strategy were marked as deprecated
Why 'archive' strategy were marked as deprecated?

Regards,
Andrii.

"restore-data" seems not doing its work completely

I have a table like:

CREATE TABLE "%table_name%" (
   date date,
   datetime Datetime,
   %some_fields%
) ENGINE = MergeTree()
PARTITION BY (date)
ORDER BY (date, ...)
PRIMARY KEY (date, ...)

The table contains data for several dates, in my case from 2019-08-24 to 2019-09-18.

Then I'm trying to back it up and restore to check if it works properly.
So that:

~$ sudo -u clickhouse ./clickhouse-backup create --tables=%my_database%.* test

2019/09/19 08:16:09 Create backup 'test'
2019/09/19 08:16:09 Freeze `%my_database%`.`%table_name%`
2019/09/19 08:16:09 Copy metadata
2019/09/19 08:16:09   Done.
2019/09/19 08:16:09 Move shadow
2019/09/19 08:16:09   Done.

~$ clickhouse-client -u %user% --password=%pass% -d %my_database% -q "DROP DATABASE %my_database%"

BANG!

~$ sudo -u clickhouse ./clickhouse-backup restore-schema --tables=%my_database%.* test

2019/09/19 08:16:57 Create table `%my_database%`.`%table_name%`

~$ sudo -u clickhouse ./clickhouse-backup restore-data --tables=%my_database%.* test

2019/09/19 08:17:28 Prepare data for restoring `%my_database%`.`%table_name%`
2019/09/19 08:17:28 ALTER TABLE `%my_database%`.`%table_name%` ATTACH PARTITION 201908
2019/09/19 08:17:28 ALTER TABLE `%my_database%`.`%table_name%` ATTACH PARTITION 201909
2019/09/19 08:17:28 ALTER TABLE `%my_database%`.`%table_name%` ATTACH PARTITION ID '20190918'

Finally, after restore-data I have only data for the last date.
I'd be able to execute some additional queries to finish restoring, but I think I do something wrong:

ALTER TABLE `%my_database%`.`%table_name%` ATTACH PARTITION ID '20190917'
ALTER TABLE `%my_database%`.`%table_name%` ATTACH PARTITION ID '20190916'
-- And so on...

Can you help me to figure out what goes wrong in my example?

Support incremental backups

Hi,
Does ClickhouseBackup support incremental backups? I mean when I execute it the first I expect it to send full dump but when I call this second time a couple hours later I'd expect the tool to send only new blocks? I saw mentiones of increments in code but I wanted to double check the tool really works as I described.

Thanks,
Paweł

Error backup with table '.inner.test_table'

Hello!
Clickhouse version 19.1.8

When creating a backup request failed:

can't freeze partition '0590dc88487514d92e90360839939c5f' on 'default..inner.test_table' with: code: 62, message: Syntax error: failed at position 21: .inner.test_table FREEZE PARTITION ID '0590dc88487514d92e90360839939c5f';. Expected identifier

It may be worth escaping the name of the table with ``?

Backup tool is not able to connect to CH server: wrong credentials issue

Hello,

This is weird issue but clickhouse-backup can't connect to clickhouse server. I tried pass credentials via env vars and via config, but no luck.

I have Clickhouse server running on 9001 port with default user and some password let's say 123. I can connect to CH via clickhouse-client w/o any problems:

$ clickhouse-client --port 9001 -u default --password 123
ClickHouse client version 19.13.1.11 (official build).
Connecting to localhost:9001 as user default.
Connected to ClickHouse server version 19.13.1 revision 54425.

clickhouse :) exit
Bye.

But connection to CH fails if I do it via clickhouse-backup:

$ export CLICKHOUSE_USERNAME=default
$ export CLICKHOUSE_PASSWORD=123
$ export CLICKHOUSE_PORT=9001
$ ./clickhouse-backup tables
2019/12/22 22:09:51 can't connect to clickouse with: code: 193, message: Wrong password for user default

Clickhouse-backup is the latest build:

./clickhouse-backup -v
Version:	 v0.5.1
Git Commit:	 5dc6234a1052c076de409666858bd4c1c6dbb48a
Build Date:	 2019-12-03

There is the same behavior if I pass creds and port via config file.
What can be wrong? How can I debug? Thanks.

Support of 'Tiered storage' added in 19.15.2.2

Needs integration tests for 'Tiered storage' added in 19.15.2.2

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

altinity / clickhouse-backup Goto Github PK

clickhouse-backup's Introduction

clickhouse-backup

Don't run clickhouse-backup remotelly

Features

Limitations

Support

Installation

Brief description how clickhouse-backup works

Default Config

Explain config parameters

Concurrency, CPU and Memory usage recommendation

remote_storage: custom

ATTENTION!

API

GET /

POST /

POST /restart

GET /backup/kill

GET /backup/tables

GET /backup/tables/all

POST /backup/create

POST /backup/watch

POST /backup/clean

POST /backup/clean/remote_broken

POST /backup/upload

GET /backup/list/{where}

POST /backup/download

POST /backup/restore

POST /backup/delete

GET /backup/status

POST /backup/actions

GET /backup/actions

Storage types

S3

Examples

Simple cron script for daily backups and remote upload

Common CLI Usage

CLI command - tables

CLI command - create

CLI command - create_remote

CLI command - upload

CLI command - list

CLI command - download

CLI command - restore

CLI command - restore_remote

CLI command - delete

CLI command - default-config

CLI command - print-config

CLI command - clean

CLI command - clean_remote_broken

CLI command - watch

CLI command - server

More use cases of clickhouse-backup

Original Author

clickhouse-backup's People

Contributors

Stargazers

Watchers

Forkers

clickhouse-backup's Issues

Recommend Projects

Recommend Topics

Recommend Org

Don't run `clickhouse-backup` remotelly