Giter Club home page Giter Club logo

datamations's Introduction

datamations

R-CMD-check

datamations is a framework for the automatic generation of explanation of the steps of an analysis pipeline. It automatically turns code into animations, showing the state of the data at each step of an analysis.

For more information, please visit the package website, which includes additional examples, defaults and conventions, and more.

Installation

You can install datamations from GitHub with:

# install.packages("devtools")
devtools::install_github("microsoft/datamations")

Usage

To get started, load datamations and dplyr:

A datamation shows a plot of what the data looks like at each step of a tidyverse pipeline, animated by the transitions that lead to each state. The following shows an example taking the built-in small_salary data set, grouping by Degree, and calculating the mean Salary.

First, define the code for the pipeline, then generate the datamation with datamation_sanddance():

library(datamations)
library(dplyr)

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

datamations supports the following dplyr functions:

  • group_by() (up to three grouping variables)
  • summarize()/summarise() (limited to summarizing one variable)
  • filter()
  • count()/tally

datamations's People

Contributors

chisingh avatar dependabot[bot] avatar fpelaez avatar georgeiskander25 avatar giorgi-ghviniashvili avatar hanschaudry avatar jhofman avatar linqingz avatar microsoft-github-policy-service[bot] avatar seankross avatar sharlagelfand avatar willdebras avatar xiaoyingpu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamations's Issues

Group by two variables consistently when using group_by

Right now e.g. group_by(work, degree) only visualizes the second grouping variable if summarise() is also used afterwards - should show the grouping by both variables just from group_by(work, degree)

datamation_sanddance("small_salary %>% group_by(Work, Degree) %>% summarise(mean = mean(Salary))", output = "summarise.gif", nframes = 5)

vs

datamation_sanddance("small_salary %>% group_by(Work, Degree)", output = "group.gif", nframes = 5)

Where should released repo(s) be hosted?

This depends on two factors.

First, whether this is hosted by Microsoft or on a personal account (say @jhofman's).

Second, how do we structure things around the fact that we'd like to support other languages some day, for instance Python. In that case it probably wouldn't make sense for it to be one huge repo, but instead to have a repo for each base language (R, Python, etc.) and a repo for the rendering package (d3.js)?

Zoom to error bar range

Expanding on #44, would be good to have the final frame zoom in to only show the range of the error bars, not of the full data.

I'll generate a test spec for this.

Add error bars to summarized frames

we discussed adding error bars to the summarized plots.

@giorgi-ghviniashvili: can we do this with just one "layer" in vegalite, or do we have to hack things again? could you play with this in the vegalite editor and then see if gemini can handle it?

if it is possible (🤞), then @sharlagelfand, we'll just need to add to the vegalite spec when exporting.

Update README, ensure table code is working

Once #23 is all closed out and we have a functional widget, will need to update the README illustrating how it works! I don't think you can embed an htmlwidget into a static README so this may have to be in the form as htmlwidget -> movie -> GIF, but will double check on that.

Since we haven't done anything with the table code yet, will take a pass through and ensure it all still works and can be left in.

Abstract away column names in frame generation functions

Right now the specific "salary" and "degree" column names are used inside of the frame generation functions in d3. If we end up going with our own d3 code for generating frames we'll need to abstract this away to "x" and "y" and have some way to pass which column names map to which variables when calling r2d3.

Let's put this on hold until we figure out #12.

Points moving across facets

@giorgi-ghviniashvili I haven't been able to figure out why points are moving across facets - as far as I can tell the IDs match! Could you take a look? The example in #47 shows it and also this one in the app with dataset: penguins and group by: species, island, sex:

flying_axes.mov

The specs for this case are here.

Thanks!!

Create some unit tests

This will require some careful thought because the output are visualizations, not numerical results.

Maybe comparing to a snapshot or checksum or something?

And perhaps hook this up to Github actions for continuous integration?

Add data point information on hover?

@giorgi-ghviniashvili is it possible to add tooltip hover over to the points on plots in vegalite / gemini?

if so this could be really useful, both for debugging (for instance to know what the summarized values are in the final frame when checking if the axes are right) and for causal users who want to see more about that datapoint.

Change "play" button to autoplay + slider + replay button

Can we modify the "play" button in the widget to be a slider that let's you scroll through the different steps in the analysis? And can we also make the animation play by default in the shiny app, with a "replay" button that allows for cycling back through the animation?

In a future version we could get really fancy about this where each notch on the slider lists the portion of the pipeline that the step corresponds to, so it was super clear which stage you're looking at. Kind of like a trimmed down version of the banner image at the top of the paper.

Look at gemini internals for fixes/enhancements

I think I should start reading gemini source code and see how it works to be able to fix some internal things. For example this.

Also #48 is related to gemini recommendations, which does not correctly gets recommended gemini animation spec. Would be better to a least debug and find why it does not work correctly.

For example, it always gets encode: { enter: true, exit: true, update: true }, even if I set them to false in options.

Show tables and plots side-by-side?

@dggoldst and I realized that there are some data operations that make sense in table-based datamations but not in plot-based ones and vice-versa.

For instance, consider the "select" operation where only certain columns are retained. It's not clear how this would be represented in a plot, but in a table you'd simply drop the columns.

Maybe one way out of this is that we show table and plots side-by-side, and some animations affect one but not the other. We could also add interactivity so that when you hover over an observation in the plot it higlights in the table and vice-versa.

Replicate a version of plot_degree.gif in d3 using r2d3

Create a version of the full animation here for small_salary data of 100 points in d3 using r2d3.

This should load the data:

load('src/dmpkg/data/small_salary_100.rda')

For now, let's go for the following key frames:

  1. Ungrouped data, in a grid all grey.
  2. Grouped data, in a grid blue and red.
  3. Scatter plot (x = degree, y = salary, color = degree), as you have from #5.
  4. Collapsed points for averages (x = degree, y = average_salary, color = degree), also from #5.

There will be issues with scales zooming in and with showing error bars, etc., that we can address in a future iteration.

Angle x-axis labels so they're not cut off

Discussed already with @giorgi-ghviniashvili, but right now long x-axis labels are cut off (Dream isn't shown at all)

Screen Shot 2021-05-25 at 10 46 02 AM

We should either: pass the angle for the axis labels in the specs, or set them within JS to all have labelAngle = -90, so it looks more like this:

Screen Shot 2021-05-25 at 10 49 21 AM

I'm inclined to just have this handled on the JS side and always set them to -90

Package license

We should include a license for the package - R CMD check warns if there isn't one, and just generally good practice to have something listed :) Not sure if there's an existing preference at MSFT, but some more info specifically for R package licensing here: https://r-pkgs.org/license.html

Investigate faceting in vegalite and gemini

For #25, we're looking at possibly using faceting to break up multi-variable groupings into rows / columns / subplots.

If we go this route, we'll need to know how well faceting works in vegalite and if we can animate between facets in gemini.

To investigate this:

  1. Try converting the existing group_by(degree, workplace) dot plot into a faceted plot where the degree is the row, workplace is the column for a 2-by-2 faceted plot. SEe here for vegalite facet spec. You probably want the equivalent of a "free scale" within each facet so that it automatically sizes to the range of the data. There might be something in vegalite to specify this.
  2. See if we can animate between facets. You could start with the first dot plot that has no facets and try to transition to the second with facets.

Hopefully these facets show annotations to label the rows and columns, but if not we'll have to think about how to do this ourselves.

Try to render two key frames with vega(lite?) and animate between them

It looks like Vega (or Vega-lite) might be a good "exchange format" for passing plot specifications between a base language (like R) and a rendering language (like d3).

For now, let's try to just make plots of two key frames from the salary datamation using it and see if we can link them with a transition. Specifically:

Frame 1: Scatter plot of x = degree, y = salary, color = degree
Frame 2: Plot of average salary by degree, x = degree, y = mean_salary, color = degree

Allow more than one summarized value

Right now we only support one summarized value, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary))

Maybe in the future could think about how multiple operations (or summarizing multiple variables) could work, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary), median = median(Salary))

Fix small salary data

Looks like there's actually two small salary data sets, which give different results:

library(dplyr)
library(datamations)

small_salary
#> # A tibble: 100 x 6
#>       ID Degree  Work     Salary i     order
#>    <int> <fct>   <fct>     <dbl> <chr> <int>
#>  1    22 Masters Academia   81.9 id        1
#>  2    96 PhD     Academia   84.5 id        2
#>  3    10 Masters Academia   82.9 id        3
#>  4    42 PhD     Academia   83.8 id        4
#>  5    55 PhD     Academia   83.8 id        5
#>  6    14 PhD     Academia   85.3 id        6
#>  7    33 PhD     Industry   91.4 id        7
#>  8   100 PhD     Academia   85.3 id        8
#>  9    57 Masters Academia   83.3 id        9
#> 10     2 PhD     Industry   92.3 id       10
#> # … with 90 more rows

small_salary %>% 
  group_by(Degree) %>%
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <fct>   <dbl>
#> 1 Masters  90.2
#> 2 PhD      88.2

small_salary_data
#> # A tibble: 30 x 3
#>    Degree  Work     Salary
#>    <chr>   <chr>     <dbl>
#>  1 Masters Industry     86
#>  2 Masters Academia     71
#>  3 PhD     Industry    104
#>  4 Masters Industry     94
#>  5 Masters Academia     93
#>  6 Masters Academia     96
#>  7 PhD     Academia    100
#>  8 Masters Industry     86
#>  9 PhD     Academia     80
#> 10 Masters Industry     85
#> # … with 20 more rows

small_salary_data %>%
  group_by(Degree) %>% 
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <chr>   <dbl>
#> 1 Masters  90.6
#> 2 PhD      92.1

@jhofman can you confirm that the one we want is the first, with means 90.2 and 88.2?

Downsample data

As per #50, we can't quite handle 3000 points - so we should warn people to downsample (and if they don't, downsample ourselves with a warning).

I'll do some experimenting to see what the cutoff for downsampling is first.

Investigate vegalite in a Shiny app

Related to #27, how feasible is it to render a vegalite plot inside of a Shiny app?

To check this out, let's try a simple Shiny app with a dropdown that lets you choose between a bar chart and a colored bar chart.

The JSON can be hard-coded in. If successful hopefully we can port to a working datamation version that mirrors something like this.

(How) can we parse and handle a ggplot command at the end of a pipeline?

Right now we're sort of implicitly assuming that grouping variables become faceting variables, which is reasonable and will generalize. But what if someone wants control over this,? More generally, we want to "respect" the final plot that they generate and have the steps leading up to that reflect this.

To illustrate, imagine the same data analysis pipeline, but with three different plotting commands at the end. Right now we'd show the same datamation for each, but in theory they should end in different frames (and so should also contain different frames leading up to that).

Degree on the x, Work as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_wrap(~ Work)

vs

Degree on the x, Work and Degree as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_grid(Degree ~ Work)

vs

Degree on the x, Work as (dodged) color, no facet

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary, color = Work)) +
  geom_point(aes(position = position_dodge(width=0.25)))

This will require a bunch of thinking and probably some hacking of ggproto objects, but let's do the thinking before the hacking.

Decide where calculation of infogrid and jitter coordinates happens

Should this be on the side of the "base language" (R) or the "rendering language" (javascript)?

If we can push this to the rendering language we'd save duplicated effort when porting to new base languages (like Python), but it doesn't seem like there's a natural way to do this vega, so we'd have to roll our own?

Shiny app illustrating widget

Again, once #23 is closed out and the widget is functional, should create a Shiny app that allows you to choose a data set, grouping variables, summary variable + operation, and shows the code and widget/animation, similar to the app here.

I had a live prototype of this that I wrote over (!) but the code is here for reusing.

What should multi-variable grouping look like in the general case?

We currently have something that looks good for degree and work in the salary example.

  • Should we put limits on the number of grouping variables and number of levels for each grouping variable? Can we handle 3 binary grouping variables, for instance?
  • Do we want the behavior that we currently have (spatial break vs. underlining)?
  • Should the order in which the grouping variables are specified in the code be reflected in the visualization?

Pipeline cases that don't work

Just keeping track of some examples of pipelines that don't work, for fixing/testing with later:

more than 1 value summarised

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary), median = median(Salary))

(the second one is just ignored, gif includes mean only)

Allow pipeline without quoting

Would be very slick to allow the pipeline without needing to pass it as a character vector, e.g.

small_salary_data %>%
  group_by(Degree) %>%
  summarize(mean = mean(Salary)) %>%
  datamation_sanddance()

instead of

"small_salary_data %>% group_by(Degree) %>% summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

Have been looking into this a bit but writing it down here to track!

Axes and legend appearing too late

@giorgi-ghviniashvili just wanted to track here the issue that axes are appearing too late - they should appear as soon as the infogrid -> jitter transition starts, but they appear when jitter -> summary starts:

axes_legend_too_late.mov

I also just noticed that the legend doesn't appear until jitter -> summary! Ideally it should appear as soon as the faceted infogrid starts to animate into the colored version. Maybe has to do with the fake axes and when those show up?

You can see this on the app with dataset: penguins and group by: species, island, sex

Thanks!

Use slider to change tab in Shiny app

As a next step on #30, looking into if it's possible / how difficult it is to have changing the slider in the datamation update which tab of data is shown in the app.

Create datamation-editor

Create a web based tool called datamation-editor (name suggestions are welcome).

With this tool, we can add json spec (either vega-lite or vega) or directly import from raw github.

Add multiple of them and adjust sequence as we want and with play button it will start animation and transitions any valid spec pairs.

Also add an input to paste or import gemini specs.

Resolve warnings that happen from README demo code

Running this:

library(tidyverse)
library(datamations)

mean_salary_by_degree_pipe <- "small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary))"

degree_title_step1 <- "Step 1: Each dot shows one person\n and each group shows degree type"
degree_title_step2 <- "Step 2: Next you plot the salary of each person\n within each group"
degree_title_step3 <- "Step 3: Lastly you plot the average salary \n of each group and zoom in"

datamation_sanddance(
  pipeline = mean_salary_by_degree_pipe,
  output = "mean_salary_group_by_degree.gif",
  titles = c(degree_title_step1, degree_title_step2, degree_title_step3),
  nframes = 30
)

gives the following error messages:

Warning messages:
1: Unknown or uninitialised column: `.id`. 
2: Unknown or uninitialised column: `.frame`. 
3: Unknown or uninitialised column: `.id`. 
4: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
5: Unknown or uninitialised column: `.id`. 
6: Unknown or uninitialised column: `.frame`. 
7: Unknown or uninitialised column: `.id`. 
8: Unknown or uninitialised column: `.id`. 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.