microsoft / datamations Goto Github PK

Home Page: https://microsoft.github.io/datamations/

License: Other

R 4.19% CSS 0.09% JavaScript 19.04% HTML 67.76% Python 1.23% Jupyter Notebook 7.70% Shell 0.01%

datamations's Introduction

datamations

datamations is a framework for the automatic generation of explanation of the steps of an analysis pipeline. It automatically turns code into animations, showing the state of the data at each step of an analysis.

For more information, please visit the package website, which includes additional examples, defaults and conventions, and more.

Installation

You can install datamations from GitHub with:

# install.packages("devtools")
devtools::install_github("microsoft/datamations")

Usage

To get started, load datamations and dplyr:

A datamation shows a plot of what the data looks like at each step of a tidyverse pipeline, animated by the transitions that lead to each state. The following shows an example taking the built-in small_salary data set, grouping by Degree, and calculating the mean Salary.

First, define the code for the pipeline, then generate the datamation with datamation_sanddance():

library(datamations)
library(dplyr)

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

datamations supports the following dplyr functions:

group_by() (up to three grouping variables)
summarize()/summarise() (limited to summarizing one variable)
filter()
count()/tally

datamations's People

Contributors

Stargazers

Watchers

Forkers

sharlagelfand standardgalactic chisingh fpelaez willdebras g-arj seankross slee1009 test-mass-forker-org-1 darylroberts vjayalakshmik somachak

datamations's Issues

Group by two variables consistently when using group_by

Right now e.g. group_by(work, degree) only visualizes the second grouping variable if summarise() is also used afterwards - should show the grouping by both variables just from group_by(work, degree)

datamation_sanddance("small_salary %>% group_by(Work, Degree) %>% summarise(mean = mean(Salary))", output = "summarise.gif", nframes = 5)

datamation_sanddance("small_salary %>% group_by(Work, Degree)", output = "group.gif", nframes = 5)

Where should released repo(s) be hosted?

This depends on two factors.

First, whether this is hosted by Microsoft or on a personal account (say @jhofman's).

Second, how do we structure things around the fact that we'd like to support other languages some day, for instance Python. In that case it probably wouldn't make sense for it to be one huge repo, but instead to have a repo for each base language (R, Python, etc.) and a repo for the rendering package (d3.js)?

Zoom to error bar range

Expanding on #44, would be good to have the final frame zoom in to only show the range of the error bars, not of the full data.

I'll generate a test spec for this.

Add error bars to summarized frames

we discussed adding error bars to the summarized plots.

@giorgi-ghviniashvili: can we do this with just one "layer" in vegalite, or do we have to hack things again? could you play with this in the vegalite editor and then see if gemini can handle it?

if it is possible (🤞), then @sharlagelfand, we'll just need to add to the vegalite spec when exporting.

Hack faceted vega specs to build fake faceted view in a single plot

Update README, ensure table code is working

Once #23 is all closed out and we have a functional widget, will need to update the README illustrating how it works! I don't think you can embed an htmlwidget into a static README so this may have to be in the form as htmlwidget -> movie -> GIF, but will double check on that.

Since we haven't done anything with the table code yet, will take a pass through and ensure it all still works and can be left in.

Abstract away column names in frame generation functions

Right now the specific "salary" and "degree" column names are used inside of the frame generation functions in d3. If we end up going with our own d3 code for generating frames we'll need to abstract this away to "x" and "y" and have some way to pass which column names map to which variables when calling r2d3.

Let's put this on hold until we figure out #12.

Points moving across facets

@giorgi-ghviniashvili I haven't been able to figure out why points are moving across facets - as far as I can tell the IDs match! Could you take a look? The example in #47 shows it and also this one in the app with dataset: penguins and group by: species, island, sex:

flying_axes.mov

The specs for this case are here.

Thanks!!

Export vega specs for all frames from R to json using vegawidget

@sharlagelfand, can you add each of the frames in vegawidget_exploration.R to a json file in that same directory for @giorgi-ghviniashvili to use, and number them sequentially?

Note, #21 depends on this.

Create some unit tests

This will require some careful thought because the output are visualizations, not numerical results.

Maybe comparing to a snapshot or checksum or something?

And perhaps hook this up to Github actions for continuous integration?

Add data point information on hover?

@giorgi-ghviniashvili is it possible to add tooltip hover over to the points on plots in vegalite / gemini?

if so this could be really useful, both for debugging (for instance to know what the summarized values are in the final frame when checking if the axes are right) and for causal users who want to see more about that datapoint.

Change "play" button to autoplay + slider + replay button

Can we modify the "play" button in the widget to be a slider that let's you scroll through the different steps in the analysis? And can we also make the animation play by default in the shiny app, with a "replay" button that allows for cycling back through the animation?

In a future version we could get really fancy about this where each notch on the slider lists the portion of the pipeline that the step corresponds to, so it was super clear which stage you're looking at. Kind of like a trimmed down version of the banner image at the top of the paper.

Look at gemini internals for fixes/enhancements

I think I should start reading gemini source code and see how it works to be able to fix some internal things. For example this.

Also #48 is related to gemini recommendations, which does not correctly gets recommended gemini animation spec. Would be better to a least debug and find why it does not work correctly.

For example, it always gets encode: { enter: true, exit: true, update: true }, even if I set them to false in options.

Show tables and plots side-by-side?

@dggoldst and I realized that there are some data operations that make sense in table-based datamations but not in plot-based ones and vice-versa.

For instance, consider the "select" operation where only certain columns are retained. It's not clear how this would be represented in a plot, but in a table you'd simply drop the columns.

Maybe one way out of this is that we show table and plots side-by-side, and some animations affect one but not the other. We could also add interactivity so that when you hover over an observation in the plot it higlights in the table and vice-versa.

Replicate a version of plot_degree.gif in d3 using r2d3

Create a version of the full animation here for small_salary data of 100 points in d3 using r2d3.

This should load the data:

load('src/dmpkg/data/small_salary_100.rda')

For now, let's go for the following key frames:

Ungrouped data, in a grid all grey.
Grouped data, in a grid blue and red.
Scatter plot (x = degree, y = salary, color = degree), as you have from #5.
Collapsed points for averages (x = degree, y = average_salary, color = degree), also from #5.

There will be issues with scales zooming in and with showing error bars, etc., that we can address in a future iteration.

List out all verbs we'd like to datamate

Let's discuss what different verbs should look like. Taking them from the Tidyverse cheat sheet.

verb	plot-based animation	table-based animation
mutate
group_by
summarize
arrange
filter
select
distinct
join
bind_rows
bind_cols

Feel free to edit inline here instead if useful.

Related to #30.

Flow for multiple group by + summarise steps

Want to test out if it's possible to do group_by -> summarise -> group_by -> summarise (or e.g. group_by -> summarise -> summarise) - @jhofman will provide an example

Collect examples of code we'd want to datamate

Drop snippets of tidyverse pipelines here!

Angle x-axis labels so they're not cut off

Discussed already with @giorgi-ghviniashvili, but right now long x-axis labels are cut off (Dream isn't shown at all)

We should either: pass the angle for the axis labels in the specs, or set them within JS to all have labelAngle = -90, so it looks more like this:

I'm inclined to just have this handled on the JS side and always set them to -90

Figure out how to view vega/gemini renderings in R browser

At some point we need to load the appropriate javascript libraries and pass data from R to html.

Can do so with files, but is there a better way (akin to r2d3)?

Get current version of R package working

Test on @giorgi-ghviniashvili and @jhofman's machines.

Add custom aggregation animations

Right now mean shows points collapsing. Here are suggestions for how other aggregation operations can be animated: https://idl.cs.washington.edu/files/2019-AnimatedAggregates-EuroVis.pdf

Package license

We should include a license for the package - R CMD check warns if there isn't one, and just generally good practice to have something listed :) Not sure if there's an existing preference at MSFT, but some more info specifically for R package licensing here: https://r-pkgs.org/license.html

Investigate faceting in vegalite and gemini

For #25, we're looking at possibly using faceting to break up multi-variable groupings into rows / columns / subplots.

If we go this route, we'll need to know how well faceting works in vegalite and if we can animate between facets in gemini.

To investigate this:

Try converting the existing group_by(degree, workplace) dot plot into a faceted plot where the degree is the row, workplace is the column for a 2-by-2 faceted plot. SEe here for vegalite facet spec. You probably want the equivalent of a "free scale" within each facet so that it automatically sizes to the range of the data. There might be something in vegalite to specify this.
See if we can animate between facets. You could start with the first dot plot that has no facets and try to transition to the second with facets.

Hopefully these facets show annotations to label the rows and columns, but if not we'll have to think about how to do this ourselves.

Try to render two key frames with vega(lite?) and animate between them

It looks like Vega (or Vega-lite) might be a good "exchange format" for passing plot specifications between a base language (like R) and a rendering language (like d3).

For now, let's try to just make plots of two key frames from the salary datamation using it and see if we can link them with a transition. Specifically:

Frame 1: Scatter plot of x = degree, y = salary, color = degree
Frame 2: Plot of average salary by degree, x = degree, y = mean_salary, color = degree

Allow more than one summarized value

Right now we only support one summarized value, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary))

Maybe in the future could think about how multiple operations (or summarizing multiple variables) could work, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary), median = median(Salary))

Use gemini to animate all frames exported from R

Commit json and/or html to generate visualizations in sandbox/[subdir]

Fix small salary data

Looks like there's actually two small salary data sets, which give different results:

library(dplyr)
library(datamations)

small_salary
#> # A tibble: 100 x 6
#>       ID Degree  Work     Salary i     order
#>    <int> <fct>   <fct>     <dbl> <chr> <int>
#>  1    22 Masters Academia   81.9 id        1
#>  2    96 PhD     Academia   84.5 id        2
#>  3    10 Masters Academia   82.9 id        3
#>  4    42 PhD     Academia   83.8 id        4
#>  5    55 PhD     Academia   83.8 id        5
#>  6    14 PhD     Academia   85.3 id        6
#>  7    33 PhD     Industry   91.4 id        7
#>  8   100 PhD     Academia   85.3 id        8
#>  9    57 Masters Academia   83.3 id        9
#> 10     2 PhD     Industry   92.3 id       10
#> # … with 90 more rows

small_salary %>% 
  group_by(Degree) %>%
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <fct>   <dbl>
#> 1 Masters  90.2
#> 2 PhD      88.2

small_salary_data
#> # A tibble: 30 x 3
#>    Degree  Work     Salary
#>    <chr>   <chr>     <dbl>
#>  1 Masters Industry     86
#>  2 Masters Academia     71
#>  3 PhD     Industry    104
#>  4 Masters Industry     94
#>  5 Masters Academia     93
#>  6 Masters Academia     96
#>  7 PhD     Academia    100
#>  8 Masters Industry     86
#>  9 PhD     Academia     80
#> 10 Masters Industry     85
#> # … with 20 more rows

small_salary_data %>%
  group_by(Degree) %>% 
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <chr>   <dbl>
#> 1 Masters  90.6
#> 2 PhD      92.1

@jhofman can you confirm that the one we want is the first, with means 90.2 and 88.2?

Link to js dependencies instead of storing locally

Need to figure out a way to link out to the dependencies instead of having them locally - there is an issue indicating this wasn't possible at some point, but I'll keep digging to see if it is now or if we can find some other creative solution.

Downsample data

As per #50, we can't quite handle 3000 points - so we should warn people to downsample (and if they don't, downsample ourselves with a warning).

I'll do some experimenting to see what the cutoff for downsampling is first.

Step through README examples and comment code to clarify

Presumably there's a lot of coordinate hacking to get animations to line up. Would be good to know how much of this is happening and where.

Investigate vegalite in a Shiny app

Related to #27, how feasible is it to render a vegalite plot inside of a Shiny app?

To check this out, let's try a simple Shiny app with a dropdown that lets you choose between a bar chart and a colored bar chart.

The JSON can be hard-coded in. If successful hopefully we can port to a working datamation version that mirrors something like this.

(How) can we parse and handle a ggplot command at the end of a pipeline?

Right now we're sort of implicitly assuming that grouping variables become faceting variables, which is reasonable and will generalize. But what if someone wants control over this,? More generally, we want to "respect" the final plot that they generate and have the steps leading up to that reflect this.

To illustrate, imagine the same data analysis pipeline, but with three different plotting commands at the end. Right now we'd show the same datamation for each, but in theory they should end in different frames (and so should also contain different frames leading up to that).

Degree on the x, Work as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_wrap(~ Work)

Degree on the x, Work and Degree as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_grid(Degree ~ Work)

Degree on the x, Work as (dodged) color, no facet

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary, color = Work)) +
  geom_point(aes(position = position_dodge(width=0.25)))

This will require a bunch of thinking and probably some hacking of ggproto objects, but let's do the thinking before the hacking.

Possible to have dynamic width of slider bar / description?

Curious if it's possible to have the width of the slider bar and description adjust to the width of the actual plot (which isn't fixed):

e.g. instead of spanning the whole container

only being as wide as the final plot is? @giorgi-ghviniashvili is this possible?

Clone repo and run R examples

See the README and feel free to update if anything is broken / needs explaining.

Decide where calculation of infogrid and jitter coordinates happens

Should this be on the side of the "base language" (R) or the "rendering language" (javascript)?

If we can push this to the rendering language we'd save duplicated effort when porting to new base languages (like Python), but it doesn't seem like there's a natural way to do this vega, so we'd have to roll our own?

Shiny app illustrating widget

Again, once #23 is closed out and the widget is functional, should create a Shiny app that allows you to choose a data set, grouping variables, summary variable + operation, and shows the code and widget/animation, similar to the app here.

I had a live prototype of this that I wrote over (!) but the code is here for reusing.

Figure out where d3.js files can go in current repo

Make sure this is a spot that doesn't create conflicts with the current R package structure.

Identify integration points for d3.js in current code

Right now the code uses ggplot2, gganimate, and its lower-level cousin tweenr to render animations.

We'll need to figure out the right place to hand off rendering to d3.js. We can do this narrowly, but we want to strike a good balance of ease of use and modularity with flexibility.

What should multi-variable grouping look like in the general case?

We currently have something that looks good for degree and work in the salary example.

Should we put limits on the number of grouping variables and number of levels for each grouping variable? Can we handle 3 binary grouping variables, for instance?
Do we want the behavior that we currently have (spatial break vs. underlining)?
Should the order in which the grouping variables are specified in the code be reflected in the visualization?

Pipeline cases that don't work

Just keeping track of some examples of pipelines that don't work, for fixing/testing with later:

more than 1 value summarised

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary), median = median(Salary))

(the second one is just ignored, gif includes mean only)

Multiple axis in fake faceted views

I got an idea to render two vega specs, one for axis and not touched by Gemini and second for fake faceted with animated by Gemini.

Allow pipeline without quoting

Would be very slick to allow the pipeline without needing to pass it as a character vector, e.g.

small_salary_data %>%
  group_by(Degree) %>%
  summarize(mean = mean(Salary)) %>%
  datamation_sanddance()

instead of

"small_salary_data %>% group_by(Degree) %>% summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

Have been looking into this a bit but writing it down here to track!

Axes and legend appearing too late

@giorgi-ghviniashvili just wanted to track here the issue that axes are appearing too late - they should appear as soon as the infogrid -> jitter transition starts, but they appear when jitter -> summary starts:

axes_legend_too_late.mov

I also just noticed that the legend doesn't appear until jitter -> summary! Ideally it should appear as soon as the faceted infogrid starts to animate into the colored version. Maybe has to do with the fake axes and when those show up?

You can see this on the app with dataset: penguins and group by: species, island, sex

Thanks!

Get a basic d3.js example working with r2d3

This package, r2d3, should allow you to execute d3.js files in R.

Fix gemini axes, labels and other component issues

Even with gemini recommendations it has some errors and does not correctly draws axes, labels, etc. We need to fix this by learning structure of gemini animation specs and building ourselves.

Use slider to change tab in Shiny app

As a next step on #30, looking into if it's possible / how difficult it is to have changing the slider in the datamation update which tab of data is shown in the app.

Create datamation-editor

Create a web based tool called datamation-editor (name suggestions are welcome).

With this tool, we can add json spec (either vega-lite or vega) or directly import from raw github.

Add multiple of them and adjust sequence as we want and with play button it will start animation and transitions any valid spec pairs.

Also add an input to paste or import gemini specs.

Resolve warnings that happen from README demo code

Running this:

library(tidyverse)
library(datamations)

mean_salary_by_degree_pipe <- "small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary))"

degree_title_step1 <- "Step 1: Each dot shows one person\n and each group shows degree type"
degree_title_step2 <- "Step 2: Next you plot the salary of each person\n within each group"
degree_title_step3 <- "Step 3: Lastly you plot the average salary \n of each group and zoom in"

datamation_sanddance(
  pipeline = mean_salary_by_degree_pipe,
  output = "mean_salary_group_by_degree.gif",
  titles = c(degree_title_step1, degree_title_step2, degree_title_step3),
  nframes = 30
)

gives the following error messages:

Warning messages:
1: Unknown or uninitialised column: `.id`. 
2: Unknown or uninitialised column: `.frame`. 
3: Unknown or uninitialised column: `.id`. 
4: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
5: Unknown or uninitialised column: `.id`. 
6: Unknown or uninitialised column: `.frame`. 
7: Unknown or uninitialised column: `.id`. 
8: Unknown or uninitialised column: `.id`.

Some question about R

Hi @jhofman and @sharlagelfand , I will ask some questions about R here and if you have some time, please answer.

Some of them might be very dumb , sorry about that ..