Giter Club home page Giter Club logo

gittargets's Introduction

gittargets

ropensci CRAN status check codecov lint

In computationally demanding data analysis pipelines, the targets R package maintains an up-to-date set of results while skipping tasks that do not need to rerun. This process increases speed and increases trust in the final end product. However, it also overwrites old output with new output, and past results disappear by default. To preserve historical output, the gittargets package captures version-controlled snapshots of the data store, and each snapshot links to the underlying commit of the source code. That way, when the user rolls back the code to a previous branch or commit, gittargets can recover the data contemporaneous with that commit so that all targets remain up to date.

Prerequisites

  1. Familiarity with the R programming language, covered in R for Data Science.
  2. Data science workflow management best practices.
  3. Git, covered in Happy Git and GitHub for the useR.
  4. targets, which has resources on the documentation website.
  5. Familiarity with the targets data store.

Installation

The package is available to install from any of the following sources.

Type Source Command
Release CRAN install.packages("gittargets")
Development GitHub remotes::install_github("ropensci/gittargets")
Development rOpenSci install.packages("gittargets", repos = "https://ropensci.r-universe.dev")

You will also need command line Git, available at https://git-scm.com/downloads.1 Please make sure Git is reachable from your system path environment variables. To control which Git executable gittargets uses, you may set the TAR_GIT environment variable with usethis::edit_r_environ() or Sys.setenv(). You will also need to configure your user name and user email at the global level using the instructions at https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup (or gert::git_config_global_set()). Run tar_git_ok() to check installation and configuration.

tar_git_ok()
#> ✓ Git binary: /path/to/git
#> ✓ Git config global user name: your_user_name
#> ✓ Git config global user email: [email protected]
#> [1] TRUE

There are also backend-specific installation requirements and recommendations in the package vignettes.

Motivation

Consider an example pipeline with source code in _targets.R and output in the data store.

# _targets.R
library(targets)
list(
  tar_target(data, airquality),
  tar_target(model, lm(Ozone ~ Wind, data = data)) # Regress on wind speed.
)

Suppose you run the pipeline and confirm that all targets are up to date.

tar_make()
#> • start target data
#> • built target data
#> • start target model
#> • built target model
#> • end pipeline
tar_outdated()
#> character(0)

It is good practice to track the source code in a version control repository so you can revert to previous commits or branches. However, the data store is usually too large to keep in the same repository as the code, which typically lives in a cloud platform like GitHub where space and bandwidth are pricey. So when you check out an old commit or branch, you revert the code, but not the data. In other words, your targets are out of sync and out of date.

gert::git_branch_checkout(branch = "other-model")
# _targets.R
library(targets)
list(
  tar_target(data, airquality),
  tar_target(model, lm(Ozone ~ Temp, data = data)) # Regress on temperature.
)
tar_outdated()
#> [1] "model"

Usage

With gittargets, you can keep your targets up to date even as you check out code from different commits or branches. The specific steps depend on the data backend you choose, and each supported backend has a package vignette with a walkthrough. For example, the most important steps of the Git data backend are as follows.

  1. Create the source code and run the pipeline at least once so the data store exists.
  2. tar_git_init(): initialize a Git/Git LFS repository for the data store.
  3. Bring the pipeline up to date (e.g. with tar_make()) and commit any changes to the source code.
  4. tar_git_snapshot(): create a data snapshot for the current code commit.
  5. Develop the pipeline. Creating new code commits and code branches early and often, and create data snapshots at key strategic milestones.
  6. tar_git_checkout(): revert the data to the appropriate prior snapshot.

Performance

targets generates a large amount of data in _targets/objects/, and data snapshots and checkouts may take a long time. To work around performance limitations, you may wish to only snapshot the data at the most important milestones of your project. Please refer to the package vignettes for specific recommendations on optimizing performance.

Future directions

The first data versioning system in gittargets uses Git, which is designed for source code and may not scale to enormous amounts of compressed data. Future releases of gittargets may explore alternative data backends more powerful than Git LFS.

Alternatives

Newer versions of the targets package (>= 0.9.0) support continuous data versioning through cloud storage, e.g. Amazon Web Services for S3 buckets with versioning enabled. In this approach, targets tracks the version ID of each cloud-backed target. That way, when the metadata file reverts to a prior version, the pipeline automatically uses prior versions of targets that were up to date at the time the metadata was written. This approach has two distinct advantages over gittargets:

  1. Cloud storage reduces the burden of local storage for large data pipelines.
  2. Target data is uploaded and tracked continuously, which means the user does not need to proactively take data snapshots.

However, not all users have access to cloud services like AWS, not everyone is able or willing to pay the monetary costs of cloud storage for every single version of every single target, and uploads and downloads to and from the cloud may bottleneck some pipelines. gittargets fills this niche with a data versioning system that is

  1. Entirely local, and
  2. Entirely opt-in: users pick and choose when to register data snapshots, which consumes less storage than continuous snapshots or continuous cloud uploads to a versioned S3 bucket.

Code of Conduct

Please note that the gittargets project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Citation

citation("gittargets")
#> 
#> To cite gittargets in publications use:
#> 
#>   William Michael Landau (2021). gittargets: Version Control for the
#>   targets Package. https://docs.ropensci.org/gittargets/,
#>   https://github.com/ropensci/gittargets.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {gittargets: Version Control for the Targets Package},
#>     author = {William Michael Landau},
#>     note = {https://docs.ropensci.org/gittargets/, https://github.com/ropensci/gittargets},
#>     year = {2021},
#>   }

Footnotes

  1. gert does not have these requirements, but gittargets does not exclusively rely on gert because libgit2 does not automatically work with Git LFS.

gittargets's People

Contributors

wlandau avatar wlandau-lilly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mdneuzerling

gittargets's Issues

`system2()` call "git" leads to issue on Windows

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • Confirm that your issue is a genuine bug in the gittargets package itself and not a user error, known limitation, or issue from another package that gittargets depends on. For example, if you get errors running tar_make_clustermq(), try isolating the problem in a reproducible example that runs clustermq and not gittargets. And for miscellaneous troubleshooting, please post to discussions instead of issues.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

This is just the same as yihui/crandalf#24. On Windows OS, tar_git_ok() complains even if the user name and email are configured globally. This is simply because R for Windows sets HOME environment as Documents folder. Hopefully this is taken care of here.

Reproducible example

gittargets::tar_git_ok()
#> ✔ Git binary: 'C:\PROGRA~1\Git\cmd\git.exe'
#> ✖ No Git config global user email. See details in ?tar_git_ok().
#> ✖ No Git config global user email. See details in ?tar_git_ok().
#> [1] FALSE

Created on 2022-08-20 with reprex v2.0.2

What to do about files outside the {targets} data store

Prework

Description

Currently, external input/output data files are not tracked by gittargets even if they are part of the pipeline. A workaround is for the user to move those files somewhere unobtrusive in _targets/ (not in _targets/objects/ or _targets/meta/). Should that be documented, or should it be an unofficial hack?

Branch-aware data snapshots

Prework

Description

gittargets could be capable of recording the code branch corresponding to the data commit. The current format of the data branch name is

code=<HASH>

where it could be

branch=<BRANCH_NAME>&commit=<HASH>

Then tar_git_log() could print out branch information and show data across all branches. That would give a full view of data snapshots across all code branches.

GitHub interactions are temporarily limited because the maintainer is out of office.

Vacation mode

When this issue is open, vacation mode is turned on. That means Github interactions are temporarily limited, so users cannot open or comment on issues or discussions until I return and re-enable interactions (see return date below). When this issue is closed, vacation mode is turned off and interactions are re-enabled and possible again.

Thanks

Vacation mode helps me rest because it prevents tasks from piling up in my absence. Thank you for your patience and understanding.

Day of my return

Already returned.

Vacation mode source code

Function to remove snapshots

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
  • Format your code according to the tidyverse style guide.

Proposal

Suggested by @smwindecker in her rOpenSci review. Plan:

  1. Make all git snapshots orphan branches. History will no longer be shared among branches, but there is no loss of efficiency for fully up-to-date files. Efficiency from diffs may be lost, but diffs in target storage may be impossible anyway, I will need to check.
  2. Remove commits with no ref: https://stackoverflow.com/questions/3765234/listing-and-deleting-git-commits-that-are-under-no-branch-dangling

Pure AWS S3 backend

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
  • Format your code according to the tidyverse style guide.

Proposal

Similar to #2, but directly implemented on top of AWS S3 through something like aws.s3, paws, or botor. Use the historical versioning and tagging capabilities of buckets.

Leverage existing AWS S3 targets

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
  • Format your code according to the tidyverse style guide.

Proposal

If this idea works, it will supersede #5 and #2.

{targets} can already upload data to AWS S3. As explained in https://books.ropensci.org/targets/cloud.html, a target's data is uploaded, downloaded, and tracked via an S3 bucket while the pipeline is running. If versioning is enabled in the S3 bucket, then we could revert the S3 targets back to their original versions.

I propose enhancements to tar_git_snapshot() and tar_git_checkout() that should not require any changes to core targets.

tar_git_snapshot()

tar_git_snapshot() could call aws.s3::get_versions() to get all the version IDs of all the objects in all the buckets in the current tar_meta() metadata. This could be written to a file in _targets/gittargets/ and snapshotted with the rest of the local data.

We would need to first check that the user is actually using S3 and versioning is supported in the buckets (via aws.s3::get_versioning()).

tar_git_checkout()

tar_git_checkout() could invoke aws.s3::copy_object() to promote the historical version of a target to the current version pulled with tar_read(). We somehow need to supply the version ID to the REST headers in aws.s3::copy_object(). Should theoretically be possible, but I have not tested it.

Proof of concept

We should test all these steps with a simple example in aws.s3 first:

  1. Create a versioned bucket.
  2. Upload a couple versions while keeping track of their version IDs.
  3. Revert to the old version and test that the old version was successfully reverted.
  4. Bring back the new version and test the same way.

git-annex backend

Prework

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
  • Format your code according to the tidyverse style guide.

Description

  • https://git-annex.branchable.com
  • Unlike git-lfs, git-annex uses a different command line interface than standard Git, so we may need special new tar_git_annex_*() functions in gittargets.

DVC backend

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
  • Format your code according to the tidyverse style guide.

Proposal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.