Giter Club home page Giter Club logo

ds-pipelines-targets-3's People

Contributors

github-learning-lab[bot] avatar jdiaz4302 avatar lindsayplatt avatar

Watchers

 avatar  avatar

ds-pipelines-targets-3's Issues

Recognize the unique demands of data-rich pipelines

In this course you'll learn about tools for data-intensive pipelines: how to download many datasets, run many models, or make many plots. Whereas a for loop would work for many quick tasks, here our focus is on tools for managing sets of larger tasks that each take a long time and/or are subject to occasional failure.

A recurring theme in this activity will be the split-apply-combine paradigm, which has many implementations in many software languages because it is so darn useful for data analyses. It works like this:

  1. Split a large dataset or list of tasks into logical chunks, e.g., one data chunk per lake in an analysis of many lakes.
  2. Apply an analysis to each chunk, e.g., fit a model to the data for each lake.
  3. Combine the results into a single orderly bundle, e.g., a table of fitted model coefficients for all the lakes.

There can be variations on this basic paradigm, especially for larger projects:

  1. The choice of how to Split an analysis can vary - for example, it might be fastest to download data for chunks of 100 sites rather than downloading all 10000 sites at once or downloading each site independently.
  2. Sometimes we have several Apply steps - for example, for each site you might want to munge the data, fit a model, extract the model parameters, and make a diagnostic plot specific to that site.
  3. The Combine step isn't always necessary to the analysis - for example, we may prefer to publish a collection of plot .png files, one per site, rather than combining all the site plots into a single unweildy report file. That said, we may still find it useful to also create a table summarizing which plots were created successfully and which were not.

Assign yourself to this issue to explore the split-apply-combine paradigm further.


I, the Learning Lab Bot, will sit patiently until you've assigned yourself to this issue.

Please remember to be patient with me, too - sometimes I need a few seconds and/or a refresh before you'll see my response.

Scale up

Your pipeline is looking great, @jdiaz4302! It's time to put it through its paces and experience the benefits of a well-plumbed pipeline. The larger your pipeline becomes, the more useful are the tools you've learned in this course.

In this issue you will:

  • Expand the pipeline to include all of the U.S. states and some territories
  • Learn one more method for making pipelines more robust to internet failures
  • Practice the other branching method, dynamic branching
  • Modify the pipeline to describe temperature sites instead of discharge sites

⌨️ Activity: Check for targets udpates

Before you get started, make sure you have the most up-to-date version of targets:

packageVersion('targets')
## [1] ‘0.5.0.9002’

You should have package version >= 0.5.0.9002. If you don't, reinstall with:

remotes::install_github('ropensci/targets')

Appliers

Your pipeline is looking pretty good! Now it's time to add complexity. I've just added these two files to the repository:

  • 2_process/src/tally_site_obs.R
  • 3_visualize/src/plot_site_data.R

In this issue you'll add these functions to the branching code in the form of two new steps.

Background

The goal of this issue is to expose you to multi-step branching. They're not hugely different from the single-step branching we've already implemented, but there are a few details that you haven't seen yet. The changes you'll make for this issue will also set us up to touch on some miscellaneous pipeline topics. Briefly, we'll cover:

  • The syntax for adding multiple steps
  • How to declare dependencies among steps within branching
  • A quick look at / review of split-apply-combine with the lightweight dplyr syntax

Reminder about dynamic vs static branching

Remember that both dynamic and static branching can have multiple steps or appliers, but they are defined differently. We will focus on static branching for now, but remember that you can always reference the information for dynamic and static branching in their respective locations in the targets user guide (dynamic branching documentation; static branching documentation).

Adding another step in static branching

As you already know, static branching is set up using the tar_map() function, where task targets are defined by the argument values as either a list or data.frame and steps are defined by a call to tar_target() as an additional argument. Up until now, your static branching code in _targets.R, looks something like

tar_map(
  values = tibble(state_abb = states),
  tar_target(nwis_inventory, filter(oldest_active_sites, state_cd == state_abb)),
  tar_target(nwis_data, get_site_data(nwis_inventory, state_abb, parameter))
  # Insert step for tallying data here
  # Insert step for plotting data here
)

We actually already have more than one step in our branching setup - nwis_inventory and nwis_data. This shows that you can include additional calls to tar_target() to add more appliers to your branches. If you want to use a previous step's output, just use the target name from that step and targets will appropriately pass only the output relevant to each task target between the steps within tar_map(). We are going to add a few more steps to our static branching and there is already a hint for where we will add these ... ahem #Insert step for tallying data here and #Insert step for plotting data here ahem.

Steps that require additional info per task

So far, we have used functions in static branching that only needed our state abbreviation, e.g. "WI" or "IL". What happens when we want to have other information used per task? For example, we need to save files per task and we want those to be passed into our step function. Easy! We can just edit the information we pass in for values. Currently, we are using a single-column tibble but that can easily have multiple columns and those columns can be used as arguments to tar_target() commands within tar_map(). We will try this out next!

Meet the example problem

It's time to meet the data analysis challenge for this course! Over the next series of issues, you'll connect with the USGS National Water Information System (NWIS) web service to learn about some of the longest-running monitoring stations in USGS streamgaging history.

The repository for this course is already set up with a basic targets data pipeline that:

  • Queries NWIS to find the oldest discharge gage in each of three Upper Midwest states
  • Maps the state-winner gages

Branching

In the last issue you noted some inefficiencies with writing out many nearly-identical targets within a remake.yml:

  1. It's a pain (more typing and potentially a very long _targets.R file) to add new sites.
  2. Potential for errors (more typing, more copy/paste = more room for making mistakes).

In this issue we'll fix those inefficiencies by adopting the branching approach supported by targets and the support package tarchetypes.

Definitions

Branching in targets refers to the approach of scaling up a pipeline to accomodate many tasks. It is the targets implementation of the split-apply-combine operation. In essence, we split a dataset into some number of tasks, then to each task we apply one or more analysis steps. Branches are the resulting targets for each unique task-and-step match.

In the example analysis for this course, each task is a state and the first step is a call to get_site_data() for that state's oldest monitoring site. Later we'll create additional steps for tallying and plotting observations for each state's site. See the image below for a conceptual model of branching for this course analysis.

Branches

We implement branching in two ways: as static branching, where the task targets are predefined before the pipeline runs, and dynamic branching, where task targets are defined while the pipeline runs.

In this issue you'll adjust the existing pipelining to use branching for this analysis of USGS's oldest gages.

Splitters

In the last issue you noted a lingering inefficiency: When you added Illinois to the states vector, your branching pipeline built nwis_data_WI, nwis_data_MN, and nwis_data_MI again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites, the inventory target, and that target changed to include information about a gage in Illinois. As I noted in that issue, it would be ideal if each task branch only depended on exactly the values that determine whether the data need to be downloaded again. But we need a new tool to get there: a splitter.

The splitter we're going to create in this issue will split oldest_active_sites into a separate table for each state. In this case each table will be just one row long, but there are plenty of situations where the inputs to a set of tasks will be larger, even after splitting into task-size chunks. Some splitters will be quick to run and others will take a while, but either way, we'll be saving ourselves time in the overall pipeline!

Background

The object way to split

So far in our pipeline, we already have an object that contains the inventory information for all of the states, oldest_active_sites. Now, we can write a splitter to take the full inventory and one state name and return a one-row table.

get_state_inventory <- function(sites_info, state) {
  site_info <- dplyr::filter(sites_info, state_cd == state)
}

And then we could insert an initial branching step where we pulled out that state's information before passing it to the next step, such that our tar_map() call would look like:

tar_map(
  values = tibble(state_abb = states),
  tar_target(nwis_inventory, get_state_inventory(sites_info = oldest_active_sites, state_abb)),
  tar_target(nwis_data, get_site_data(nwis_inventory, state_abb, parameter))
)

The file way to split

The "object way to split" described above works well in many cases, but note that get_state_inventory() is called for each of our task targets (so each state). Suppose that oldest_active_sites was a file that took a long time to read in - we've encountered cases like this for large spatial data files, for example - you'd have to re-open the file for each and every call to get_state_inventory(), which would be excruciatingly slow for a many-state pipeline. If you find yourself in that situation, you can approach "splitting" with files rather than objects.

Instead of calling get_state_inventory() once for each state, we could and write a single splitter function that accepts oldest_active_sites and writes a single-row table for each state. It will be faster to run because there will not be redundant reloading of the data that is needing to be split. This type of splitter would not be within your branching code and instead return a single summary table describing the state-specific files that were just created.

For this next exercise, the object method for splitting described before will suit our needs just fine. There is no need to create a single splitter function that saves state-specific files for now. We are mentioning it here so that you can be aware of the limitations of splitters and be aware that other options exist.

Your mission

In this issue you'll create a splitter to make your task table more efficient in the face of a changing inventory in oldest_active_sites. Your splitter function will generate separate one-row inventory data for each state.

Ready?

What's next

Congratulations, @jdiaz4302, you've finished the course! ✨

Now that you're on or working with the USGS Data Science team, you'll likely be using data pipelines like the one you explored here in your visualization and analysis projects. More questions will surely come up - when they do, you have some options:

  1. Visit the targets documentation pages and The targets R Package User Manual
  2. If you're a member of the IIDD Data Science Branch, ask questions and share ideas in our Function - Data Pipelines Teams channel. If you're not a member, you are likely taking this course because you are working with someone who is. Reach out to them with questions/ideas.

You rock, @jdiaz4302! Time for a well-deserved break.

Combiners

So far we've implemented split and apply operations; now it's time to explore combine operations in targets pipelines.

In this issue you'll add two combiners to serve different purposes - the first will combine all of the annual observation tallies into one giant table, and the second will summarize the set of state-specific timeseries plots generated by the task table.

Background

Approach

Given your current level of knowledge, if you were asked to add a target combining the tally outputs you would likely add a call to tar_target and use the branches as input to a command that aggregated the data. While this would certainly work, the number of inputs to a combiner should change if the number of tasks changes. If we hand-coded a combiner target with tar_target that accepted a set of inputs (e.g., tar_target(combined_tallies, combine_tallies(tally_WI, tally_MI, [etc]))), we'd need to manually edit the inputs to that function anytime we changed the states vector. That would be a pain and would make our pipeline susceptible to human error if we forgot or made a mistake in that editing.

Implementation

The targets way to use combiners for static branching is to work with the tar_combine() function (recall that combiners are automatically applied to the output in dynamic branching). tar_combine() is set up in a similar way to tar_target(), where you supply the target name and a function to the target as the command. The difference is that the input to the command will be multiple targets passed in to the ... argument. The output from a tar_combine() can be an R object or a file, but file targets need to have format = "file" passed in to tar_combine() and the function used as command must return the filepath.

Some additional implementation considerations:

  • In order to use tar_combine() with the output from tar_map(), you will need to save the output of tar_map() as an object. Thus, the branching declaration should look something like mapped_output <- tar_map() so that mapped_output can be used in your tar_combine() call.

  • You can write your own combiner function or you can use built-in combiner functions for common types of combining (such as bind_rows(), c(), etc). If you write your own combiner function, it needs to be in a script sourced in the makefile using source(). The default combiner is ?vctrs::vec_c, which is a a fancy version of c() that ensures the resulting vector keeps the common data type (e.g. factors remain factors).

  • When you pass the output of tar_map() to tar_combine(), all branch output from tar_map() will be used by default. If you had multiple steps in your tar_map() (i.e. multiple calls to tar_target()), and you only want to combine results from one of those, you can add unlist = FALSE to your tar_map() call so that the tar_map() output remained in a nested list. This makes it possible to reference just the output from each tar_target() and use in tar_combine(). For example, if you had three steps in your tap_map() call and you wanted to combine only those branches from the third step that had a target name of sum_resuts, you could use mapped_output[[3]] or mapped_output$sum_results as the input to tar_combine().

  • Within your tar_combine() function, pass the ... to your command function by specifying !!!.x in its place. It feels strange, but has to do with how the function handles non-standard evaluation. You can see an example of using this syntax when you look at the default for command in thehelp file for ?tarchetypes::tar_combine().

  • When specifying the command argument to tar_combine(), you need to include the argument, e.g. command = my_function(). Since tar_combine() has ... as its second argument, anything else you pass in without the argument name will be considered part of .... It can result in some weird errors.

Don't worry if not all of this clicked yet. We are about to see it all in action!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.