Giter Club home page Giter Club logo

rflow's Introduction

Travis Codecov Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Rflow

About

TLDR:

A library for creating and executing DAGs, everything is stored (persistent) so you can restart workflows where it failed, support for R, SQL, Python and Julia.

Long version:

Rflow is an R package providing R users with a general-purpose workflow framework. Rflow allows you to describe your data declaratively as objects with dependencies and does the heavy lifting for you. It is suitable for various purposes: from managing several simple automation scripts to building powerfull ETL pipelines.

Rflow makes your data pipelines better organized and manageable (workflow can be visualized, objects may have documentation and tags).

It saves your time as your objects are rebuild only when its needed (also objects are persistent over sessions).

Development of Rflow package is still in its beta version phase (some breaking changes may happen).

Usecases

  1. You have a complex suite of scripts that prepare data from various sources for analysis or publication. You need to update our output repeatedly – e.g. whenever some of the inputs changes or when some of the scripts are changed. You want to organize the tasks and inspect the workflow visually.

  2. You need to use R as an ETL tool: to get data from a database, transform them using R/SQL/Python/Julia and then to upload them back to the database (or other place).

  3. You have complex long-running computations that need to be run only when some of the inputs/parameters change. You need to skip the parts that we computed the last time with the same inputs.

Features

What’s working:

  • Languages supported (to be used in tasks): R, SQL, Python, Julia, RMarkdown
  • Objects supported: R objects, Python objects, Julia objects, DB/SQL tables, files, spreadsheets, R Markdown
  • Bottom-up approach (you choose the final outputs you need, and Rflow resolves dependencies)
  • Persistence
  • Logging
  • Plotting and workflow visualization

What’s on the roadmap:

  • Other type of objects: tests, RMarkdowns, xmls
  • Error handling
  • Recovery
  • GUI

Getting started

Prerequisites

  • R (>= 3.5.3 tested)
  • devtools package
install.packages("devtools")

Installation

Rflow is hosted on GitHub. The easies way to install it is by using devtools package:

devtools::install_github("vh-d/Rflow")

How it works

An rflow represent a directed acyclic graph connecting nodes through dependency relations. There are three building blocks of rflows:

  • nodes (aka targets) represent your data objects such as R values, db tables, spreadsheets, files, etc…
  • environments serves as containers for nodes. For example a database is a container for database tables, R environment is a container for R objects, etc…
  • jobs represents dependency connection between nodes. It carries the recipe how to build a target object.

Currently, we have these types of nodes implemented:

  • node: a generic node class (really just a parent class other classes inherit from)
  • r_node: node representing R objects
  • db_node: node representing database tables and views
  • file_node: for representing files on disk
  • csv_node: descendant of file_node for representing csv files
  • excel_sheet: for excel sheets (read-only)
  • julia_node: node representing a Julia object
  • python_node: node representing a Python object
  • rmd_node: node representing Rmarkdown targets

Examples

MYFLOW <- Rflow::new_rflow()

We can define the target nodes using TOML files or directly in R as a list:

objs <- 
  list(
    
    "DB.mytable" = list(
      type = "db_node",
      desc = "A db table with data that serves as source for further computation in R"
      sql = "
        CREATE TABLE mytable AS 
        SELECT * FROM customers WHERE year > 2010
      "
    ),
    
    "RENV.mytable_summary" = list(
      type = "r_node", # you can skip this when defining R nodes
      desc = "Summary statistics of DB.mytable",
      depends = "DB.mytable", # dependencies have to be declared (this can be tested/automated)
      r_expr = expression_r({
        .RFLOW[["DB.mytable"]] %>% # Use .RFLOW to refer to the upstream nodes.
          RETL::etl_read() %>% 
          summary()
      })
    ),
    
    "RENV.main" = list(
      desc = "Main output",
      depends = "DB.mytable",
      r_expr = expression_r({
        .RFLOW[["DB.mytable"]] %>% some_fancy_computation()
      })
    ),
    
    "DB.output" = list(
      desc = "Outcome is loaded back to the DB",
      type = "db_node",
      depends = "main_product",
      r_expr = expression_r({
        .RFLOW[["R_OUT"]] %>%
          RETL::etl_write(to = self)
      })
    )
  ) 

Now we can add these definitions into an existing workflow:

objs %>% 
  Rflow::process_obj_defs() %>% 
  Rflow::add_nodes(rflow = MYFLOW)

and visualize

Rflow::visRflow(MYFLOW)

or build targets

make(MYFLOW)

For more examples see:

Details

Queries on nodes

Nodes are (R6) objects with properties and methods. You can make queries to find/filter nodes based on its properties such as tags, time of last build, etc…

nodes(RF) %>% # list all nodes in the rflow
  FilterWith("slow" %in% tags) %>% # expressions ("slow" %in% tags) is evaluated within each node, results in list of nodes with positive results
  names()

Non-deterministic jobs

In case a building a node uses a non-deterministic functions (e.g. when it depends on random numbers, system time, etc…) we can use trigger_condition = TRUE property to override all the other triggers and to always build the node.

Why Rflow? (comparison to other similar tools)

Rflow overlaps with several other tools in this domain.

GNU make is a general purpose framework from UNIX ecosystem. People use GNU Make in their Python/R datascience projects. Compared to Rflow GNU make is strictly file-based. It requires that every job has to produce a file. If your targets are files (you can safe R values in .RData files too) GNU Mmake may be a good choice.

gnumaker is an R package that builds upon GNU make and help you generate your make files using R.

drake is an R package quite similar to Rflow. Compared to Rflow, drake has more features: it tracks all your code dependencies automatically (including functions), it is able to run your jobs in parallel, etc… Currently, Rflow does not track changes in functions in your code. On the downside drake is limited to R langauge. It allows you to define input and output files, but all the logic has to be implemented in R. Rflow allows you to manage database tables via R or SQL recipes. Support for Bash, Python and Julia languages or knitting RMarkdwon files is planned too.

orderly framework seems to have very similar goals (to tackle the problem of complexity and reproducibility with various R scripts, inputs and outputs).

ProjectTemplate provides a standardized sceleton for your project and convenient API for preprocessing data from various formats and sources.

Luigi is a popular workflow management framework for Python.

Apache Airflow is a very general and sophisticated platform written in Python.

rflow's People

Contributors

vh-d avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

borivoj

rflow's Issues

Wrap references to environments in a dedicated S3 classes

Currently, when disconnecting a DBI connector, all nodes representing targets from the associated DB has to be re-initiated. It would be convenient to have dedicated environment classes for wrapping DBI (and other) connectors. Reconnecting to a DB would happen through these objects.

Support for knitr and RMarkdown

Support for RMarkdown documents may be implemented in several ways.

One way is to create a rmarkdown class (subclass of file_node probably). This would take a RMarkdown document and inject code loading its dependencies at the beginning of the script.

Another way is to use current file_node and implement API to access values of dependencies from RMarkdown documents easily (without loading the whole Rflow etc...).

Lazy cache loading

Cache for large R objects may be loaded when needed instead of when the object is initialized. This would speed up the initial setup at the cost of slower first builds. We would save the time of loading cache for nodes that are not touched during the session.

Prevent redundant hashing

If we store cache objects together with its hash we would not need to recompute the hash again when restoring objects from the cache.

Remove .last_changed

Perhaps .last_changed private property may be removed. It is only used in R nodes where it is just an alias for timestamp of last hash anyway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.