Giter Club home page Giter Club logo

scddl's Introduction

Scientific Computing Data Set Download

scddl (pronounced scuttle) downloads data sets for scientific computing.

Codacy Badge

Table of Contents

Goals and Features

Consistency

  • integrity checks

    Data sets that provide file integrity information, e.g. MD5 checksums, are rigorously checked.

  • strict versioning

    Data sets that are not inherently versioned will be tagged with the download date. This makes reproducible research possible. There will be no link to the latest version to enforce this strict versioning.

    A result of this is that existing files are never overwritten. All running jobs would have inconsistent results if files would be updated in place.

Usability

  • centralized storage location

    Especially on scientific computing platforms, the data sets are intended to be downloaded to globally accessible storage locations. This avoids that users or groups have to maintain their own copies and that their file system quotas are stressed. Also, new users can immediately start working instead of having to download their data sets first.

  • improved file system performance

    Another advantage of centralized storage is that the file system can better cache the data sets. This can result in improved I/O performance, especially when a single data set is used concurrently by many users. Your mileage may vary, based on caching capability of the used file system and on the data set usage patterns.

  • periodic, automatic updates

    The download tools can be run as cron jobs or systemd timers. This way, you can easily create periodic, automated updates of data sets.

  • logging to syslog

    When specified, the download tools send their output to syslog with their script name as the tag, e.g. the tool ncbidl.sh would use ncbidl as tag. You can then search for these tags, e.g.:

    journalctl -t ncbidl

Supported Data Sets

Source Data Sets

Source data sets are downloaded directly off the internet.

  • EBI: ebidl.sh
  • Ensembl: ensembldl.sh
  • NCBI: ncbidl.sh
  • UCSC: ucscdl.sh

Derived Data Sets

Derived data sets are built from source data sets. They automatically download their sources, if these are not available yet.

  • diamond: diamonddb.sh
    • builds diamond database from NCBI sources using the makedb sub-command

Usage

Each tool provides online help via the --help command line argument, e.g.:

bash ncbidl.sh --help

The download tools can also be used as cron jobs, e.g.:

@monthly time bash /path/to/ncbidl.sh /data/db blast/db/nr
@monthly time bash /path/to/ncbidl.sh /data/db blast/db/nt

scddl's People

Contributors

wookietreiber avatar bernt-matthias avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.