Giter Club home page Giter Club logo

cassback's Introduction

Cassback

Welcome to your Cassback! This is a project that aims backup Cassandra SSTables and load them into HDFS for further usage. It intended to be used as a command line tool - for now it can be triggered only by commands.

Installation

Build the application into a gem using the command

$ gem build cassback.gemspec

You should the following output :

  Successfully built RubyGem
  Name: cassback
  Version: 0.1.0
  File: cassback-0.1.0.gem

Install the application into your local gem store using the following command :

$ gem install cassback-0.1.0.gem

You should then see the following output :

Successfully installed cassback-0.1.0
Parsing documentation for cassback-0.1.0
Done installing documentation for cassback after 0 seconds
1 gem installed

Usage

When the cassback gem installed it adds the cassback executable file into your PATH variable. This means that you can execute it using one of the following commands and it will return example of usage :

cassback
cassback -h

A simple command that you can use for starting a backup is :

cassback -S -C path_to_some_config_file.yml

Structure of the backup

A backup, that has been pushed from multiple Cassandra nodes to one HDFS location, has the following structure :

a) Backup files

<hadoop_root_folder>/<cluster_name>/<node_name>/<keyspace>/<table>/<snapshot_path>/<backup_file>

b) Backup complete flags (stored at cluster level in the metadata folder) :

<hadoop_root_folder>/<cass_snap_metadata>/<cluster_name>/BACKUP_COMPLETE_<date>

c) Metadata files (stored at node level in metadata folder) :

<hadoop_root_folder>/cass_snap_metadata/<cluster_name>/<node_name>/cass_snap_<date>

Incremental or full backups ?

Backups are done incrementally, but published as full backups - the tool checks locally which files will have to be uploaded to HDFS and checks if those files are already present in HDFS (because Cassandra files are immutable we don't have to risk to have two files with same name but different content). However when the metadata file is published it points to all the files that compose the backup so it basically publishes it as being a full backup.

Configuration

The application has some default configuration defined. You can overwrite the default configuration using two meanings :

  1. Using a configuration file passed as parameter on the command line.

  2. Using individual configuration properties passed as parameters on the command line. The command line parameters have precedence over the configuration file.

An example of configuration file is provided under conf/local.yml.

Orchestration

The tool is designed to do snapshots at node level (and not at cluster level) - basically it has to be installed on each node and a separate process will have to be executed from there to trigger a node level snapshot. Because this task is quite complex it is recommended to use an orchestration tool (like Rundeck) that allows you to execute same command on multiple machines and run the processes in parallel.

After all node backups are finished the orchestration tool will have to take care of signaling other applications that the backup is completely finished. That is done now by adding a new empty file on the cluster metadata folder that has the format BACKUP_COMPLETED_yyyy_MM_dd. This has to be triggered only once by using the following command :

cassback -B [-d date] -C conf/path_to_some_config_file.yml

Optionally you can also pass a date, if not present current day date will be assumed.

Data Integrity

The project is using internally the webhdfs tool (see https://github.com/kzk/webhdfs) that is a Ruby project built on top of the WebHDFS API (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html). Because we're using the WebHDFS API we get for free data integrity. The tool is also configurable so in case errors it can retry the file download/upload of data. This is configurable via the following config file properties :

  1. hadoop.retryTimes - the number of retries the tool should do before giving up. Default set to 5.
  2. hadoop.retryInterval - the interval (in seconds) the tool should take between two attempts. Default set to 1 second.

If you want to check more about Hadoop's checksum algorithm that ensures data integrity you can check the following link : https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html

Also there is the hadoop.readTimeout property which has been set by default to 300s, but it can be configured to another value if necessary (if HDFS cluster is responding too slow sometimes).

Cleanup policy

Usually backups of databases take a lot of space. Even if we have optimized the code so the backups are done incrementally (meaning that a file is not stored twice even if it's present in multiple backups), still cleanup needs to be done. The tool has a cleanup policy of cleaning snapshots after some days have passed since the snapshot has been published. This is configurable via the cleanup.retentionDays property in the configuration file. One point is that cleanup is done at cluster level (for all nodes) since it doesn't make sense to keep data for only some of the nodes.

The command for triggering a cleanup is :

cassback -A -C conf/path_to_some_config_file.yml

Unit tests

Unit tests can be executed locally by running the following command :

rake test

Guard file

Sometimes you maybe want to prevent backups running on a specific node during maintenance operations. This is possible by creating a guard file.

touch /tmp/cassback.guardfile

While this file will exist, no operation will be possible

Note that you can change the file location in configuration file

Contributing

Bug reports and pull requests are welcome on GitHub at : https://github.com/criteo/cassback

cassback's People

Contributors

bniculescu avatar brugidou avatar erebe avatar vvanholl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.