Giter Club home page Giter Club logo

data-analysis's Introduction

data-analysis

Build Statuscodebeat badge Codacy Badge Codacy Badge Requirements Status

Framework to facilitate development of the (scientific) data analysis pipelines. Originally designed to handle organized processing and storing results of different stages of analysis for moderate-scale (tens of Tb) archive of very diverse data. It is intended for ingesting and processing new data that are appended to existing data rather than overwriting them. State is determined from the natural ordering of the data.

The principal idea is to organize the pipeline in analysis units (classes, inheriting from DataAnalysis) without side effects. Result of a DataAnalysis is some Data. Data is transofrmed by analysis to other data. Any Data is identified by a tree of connected DataAnalysis that where used to produce it.

Many (but not all) Data is cached: it will not be recomputed if requested, instead it will be retrieved from a storage backend (Cache). Since every DataAnalysis is a pure function of it's input, Data is uniquely characterized by the analysis graph that lead to its production.

The strong points of this approach are:

  • avoiding repeating analysis: frequently used results are stored and reused (saving computing time)
  • Data is be stored according to it's origin. For example in a nice directory structure optionally with an index (saving disk space)
  • analysis is rerunnable, with a granularity of a single DataAnalysis (built-in fault tolerance)
  • analysis can be easily paralelized (saving implementation time)

The implementation is designed to be easy to use. Each DataAnalysis is provided with the neccessary inputs by the means of dependecy injection.

weak points are:

  • special effort is needed to design the pipeline in the form of the pure funtions. however, there are not restrictions on the design within a single DataAnalysis. One can consider that this effort is equivalnet to design any analysis pipeline in a way that allows easy and controlled reuse of diverse data.
  • analysis graph can be changed as a result of the analysis. This process may be confusing for those not familiar with higher order functions and functional programming. The framework implements perhaps a good way to make this process easy to intuitively understand.
  • very large analysis may be eventually described by a very large graph. Natural shortcuts and aliases for parts of the graph are designed and can be used to avoid this.

The development was driven by the needs of analysing data of INTEGRAL space observatory: as of 2015 it is 20 Tb in 20Mfiles, about 1000 different kinds of data (see https://github.com/volodymyrss/dda-ddosa/).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.