Giter Club home page Giter Club logo

gtsummary's Introduction

R-CMD-check CRAN status Codecov test coverage DOI:10.32614/RJ-2021-053

gtsummary

The {gtsummary} package provides an elegant and flexible way to create publication-ready analytical and summary tables using the R programming language. The {gtsummary} package summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.

  • Summarize data frames or tibbles easily in R. Perfect for presenting descriptive statistics, comparing group demographics (e.g creating a Table 1 for medical journals), and more. Automatically detects continuous, categorical, and dichotomous variables in your data set, calculates appropriate descriptive statistics, and also includes amount of missingness in each variable.

  • Summarize regression models in R and include reference rows for categorical variables. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically identified and the tables are pre-filled with appropriate column headers (i.e. Odds Ratio and Hazard Ratio).

  • Customize gtsummary tables using a growing list of formatting/styling functions. Bold labels, italicize levels, add p-value to summary tables, style the statistics however you choose, merge or stack tables to present results side by side… there are so many possibilities to create the table of your dreams!

  • Report statistics inline from summary tables and regression summary tables in R markdown. Make your reports completely reproducible!

By leveraging {broom}, {gt}, and {labelled} packages, {gtsummary} creates beautifully formatted, ready-to-share summary and result tables in a single line of R code!

Check out the examples below, review the vignettes for a detailed exploration of the output options, and view the gallery for various customization examples.

Installation

The {gtsummary} package was written as a companion to the {gt} package from RStudio. You can install {gtsummary} with the following code.

install.packages("gtsummary")

Install the development version of {gtsummary} with:

remotes::install_github("ddsjoberg/gtsummary")

Examples

Summary Table

Use tbl_summary() to summarize a data frame.

animated

Example basic table:

library(gtsummary)

# summarize the data with our package
table1 <- 
  trial %>%
  tbl_summary(include = c(age, grade, response))

There are many customization options to add information (like comparing groups) and format results (like bold labels) in your table. See the tbl_summary() tutorial for many more options, or below for one example.

table2 <-
  tbl_summary(
    trial,
    include = c(age, grade, response),
    by = trt, # split table by group
    missing = "no" # don't list missing data separately
  ) %>%
  add_n() %>% # add column with total number of non-missing observations
  add_p() %>% # test for a difference between groups
  modify_header(label = "**Variable**") %>% # update the column header
  bold_labels()

Regression Models

Use tbl_regression() to easily and beautifully display regression model results in a table. See the tutorial for customization options.

mod1 <- glm(response ~ trt + age + grade, trial, family = binomial)

t1 <- tbl_regression(mod1, exponentiate = TRUE)

Side-by-side Regression Models

You can also present side-by-side regression model results using tbl_merge()

library(survival)

# build survival model table
t2 <-
  coxph(Surv(ttdeath, death) ~ trt + grade + age, trial) %>%
  tbl_regression(exponentiate = TRUE)

# merge tables
tbl_merge_ex1 <-
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("**Tumor Response**", "**Time to Death**")
  )

Review even more output options in the table gallery.

gtsummary + R Markdown

The {gtsummary} package was written to be a companion to the {gt} package from RStudio. But not all output types are supported by the {gt} package. Therefore, we have made it possible to print {gtsummary} tables with various engines.

Review the gtsummary + R Markdown vignette for details.

Save Individual Tables

{gtsummary} tables can also be saved directly to file as an image, HTML, Word, RTF, and LaTeX file.

tbl %>%
  as_gt() %>%
  gt::gtsave(filename = ".") # use extensions .png, .html, .docx, .rtf, .tex, .ltx

Additional Resources

Cite gtsummary

> citation("gtsummary")

To cite gtsummary in publications use:

  Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package.
  The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.

A BibTeX entry for LaTeX users is

  @Article{gtsummary,
    author = {Daniel D. Sjoberg and Karissa Whiting and Michael Curry and Jessica A. Lavery and Joseph Larmarange},
    title = {Reproducible Summary Tables with the gtsummary Package},
    journal = {{The R Journal}},
    year = {2021},
    url = {https://doi.org/10.32614/RJ-2021-053},
    doi = {10.32614/RJ-2021-053},
    volume = {13},
    issue = {1},
    pages = {570-580},
  }

Contributing

Big thank you to @jeffreybears for the hex sticker!

Please note that the {gtsummary} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Thank you to all contributors!

gtsummary's People

Contributors

bx259 avatar corradolanera avatar davisvaughan avatar ddsjoberg avatar dereksonderegger avatar emilyvertosick avatar gorkang avatar hughjonesd avatar indrajeetpatil avatar jalavery avatar jemus42 avatar jennybc avatar jflynn264 avatar jongretar avatar karissawhiting avatar larmarange avatar ltin1214 avatar mfansler avatar michaelcurry1123 avatar mljaniczek avatar msberends avatar oranwutang avatar pedersebastian avatar shannonpileggi avatar shaunporwal avatar slobaugh avatar storopoli avatar szimmer avatar yoursdearboy avatar zabore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gtsummary's Issues

tbl_survival does not handle escape characters in variable levels

library(survival)
library(gtsummary)
library(dplyr)
lung2 <- mutate(lung, s2 = ifelse(sex == 1, "1\2", "2/3"))
tbl_survival(lung2, Surv(time,status)~s2, times = c(500,540))

If you look at the output you will see that the function does not print "1\2" but prints 1 and a special character.

image

tbl_uvregression argument inputs

Decide on inputs to tbl_uvregression. Currently, the method and outcome are being passed as strings, but perhaps non-standard evaluation is more consistent with other functions? Check out the geom_smooth function and match it?

use S3 everywhere

make every function that takes in a tbl_summary, tbl_regression, or tbl_uvregression object an S3 generic function. in the help file for each of these functions, we'll list every subsequent function that can be called

update .$gt_calls code

Each {gtsummary} function that assists in creating/modifying tables added {gt} code to a list called gt_calls. This list contains one command per entry. For example:

> t = 
+   trial %>%
+   tbl_summary()
> t$gt_calls[1:2]
$gt
[1] "gt(data = x$table_body)"

$cols_label_label
[1] "cols_label(label = md('**Characteristic**'))"

Each command is stored as a string, but think this is not best practices. Probably, each of the commands should be stored as a quoted expression using quote() or rlang::expr()? Look into updating this.

Missing levels across models makes it difficult to determine reference group tbl_merge

I wanted to see what would happen if you ran a model with all levels and then removed one of the levels in a separate model and then used tbl_merge to look at both models. Both the missing level and the reference group get a dashed line which makes it difficult to determine which one is the reference and which is the missing level. Should the missing level just be blank instead of dashed?

library(survival)
library(gtsummary)
library(dplyr)


lung2 <- mutate(lung, stat = ifelse(status==2,0,1),
                ph.ecog = as.character(ph.ecog))

moda <-   glm(stat ~ ph.ecog,family = binomial, data =lung2) %>% 
          tbl_regression()

lung3 <- lung2 %>% 
         filter(ph.ecog != 3) %>% 
         mutate(ph.ecog = as.character(ph.ecog))

modb <-   glm(stat ~ ph.ecog,family = binomial, data =lung3) %>% 
  tbl_regression()

gtsummary::tbl_merge(tbls = list(moda,modb), tab_spanner = c("1", "s"))

image

tbl_summary - improve dichotomous value passing

The entire dichotomous value portion needs to be revisited.

  • the user should be able to pass a level that will be presented
  • the function makes assumptions about which levels should be presented (which is fine), but sometimes these are not observed in the data set, and the entire variable is silently omitted from the output.
  • Variables that should be assumed to be dichotomous and their value to present
    • logical, TRUE
    • yes, no factors, yes
    • yes, no characters, yes
    • 0/1 numeric, 1

tbl_merge support binomal glm but not lm models

moda <-   glm(vs ~ cyl,family = binomial, data =mtcars ) %>% 
          tbl_regression()


modb <-   glm(vs ~ cyl, binomial,data = mtcars ) %>% 
           tbl_regression()
#works
gtsummary::tbl_merge(tbls = list(moda,modb), tab_spanner = c("1", "s"))
#can fit lm with tbl_regression
mod1 <-   lm(vs ~ cyl, data = mtcars ) %>% 
  tbl_regression()

mod2 <- gtsummary::tbl_regression(lm(vs ~ cyl , mtcars))
#breaks 
gtsummary::tbl_merge(tbls = list(mod1,mod2), tab_spanner = c("1", "s"))

Primary Function Prefixes

  1. In gt, the fmt_* prefix is used for formatting columns.
  2. We need to change the names of fmt_table1, fmt_regression, and fmt_uni_regression.
  3. One suggestion is tbl_summary, tbl_regression, and tbl_uregression.
  4. I've never liked fmt_uni_regression and don't really like tbl_uregression either.

Formatting Summary Statistics

  1. The fmt_* functions in gt format columns. We need to rename fmt_percent, fmt_pvalue, and fmt_beta. Perhaps we can call them str_percent, etc., as they convert items to strings? But str_ is already a common prefix to operate ON strings, not for making strings.
  2. I don't think it is possible to use fmt_* functions from gt to format the summary statistics. I am not sure what the best method is.
  • One possibility is to keep the code as it is and pass an integer in the digits argument (i.e. digits = list(age = 1). But this assumes we want every statistic rounded to the same number of places.
  • We can generalize the above option slightly to accept arguments like digits = list(age = c(0, 1)). If the statistics being presented were "{mean} ({sd})", the mean would be rounded to the nearest integer and the SD to one decimal place.
  • Or maybe there is another option that allows for any kind of formatting (scientific, currency, dates, etc.)

tbl_merge - merging regression models with unmatched variables

The function currently allows you to merge two regression models that don't have the same model covariates. The results look good, but not perfect.

library(survival)
t1 <-
  glm(response ~ marker + grade + age, trial, family = binomial) %>%
  tbl_regression(
    label = list(trt = "Treatment", grade = "Grade", age = "Age"),
    exponentiate = TRUE
  )
t2 <-
  coxph(Surv(ttdeath, death) ~ stage + age, trial) %>%
  tbl_regression(
    label = list(trt = "Treatment", grade = "Grade", age = "Age"),
    exponentiate = TRUE
  )
t3 <-
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("Tumor Response", "Time to Death")
  )
t3

image

The dashes being added are due to the code that adds "---" for the reference groups of categorical variables. This means that "---" is being differentially added for missing variables depending on whether they are categorical.

I need to update the missing gt code to only add "---" if the variable is present in a particular model. All other blanks can merely by "".

To do/To think about

What needs to get done for the initial conversion from biostatR to gtsummary.

Priority

  • What prefix should we use for our primary functions? #2
    • I like tbl_. I think it can work.
  • What should we name our primary functions?
    • fmt_table1 is tbl_summary
    • fmt_regression is tbl_regression
    • fmt_uni_regression is tbl_uvregression
  • What prefix should we use for our numeric formatting functions? #3
    • fmt_pvalue is style_pvalue
    • fmt_percent is style_percent
    • fmt_beta is style_sigfig
  • Convert fmt_table1 to tbl_summary utilizing gt
  • Convert fmt_regression to tbl_regression utilizing gt
  • Convert fmt_uni_regression to tbl_uvregression utilizing gt

New Functions

  • Convert inline_text to work with updated function outputs. Need to discuss and decide on best input options.
    • Keep input as they were in biostatR, cell = "<var name>:<var level (if applicable)>:<by level (if applicable)>".
    • Change to variable = "<varname>", level = <var level (if applicable)>, by = <by level (if applicable)>. I think this is more transparent for the users compared to the first option implemented in biostatR. The variable and by levels are stored as strings in the tables. A user may pass level = 2, and in the background, we'd be looking for the level that equals "2". I don't think it should be an issue, however.
      • These constructs don't allow for the user to select the p-value, q-value, or n (anything else that comes later). Perhaps we could change the by = to column = . The user could pass the name of the column they want to select (e.g. column = "stat_1" or column = "pvalue"). We can create a helper function that converts by variable levels to the appropriate column name, and users could pass something like this, column = level("Drug"), which would be equivalent column = "stat_1"
  • Convert bold_labels, bold_levels, italicize_labels, and italicize_levels to gt compatible output
  • Convert bold_p to gt compatible output
  • Convert add_global to gt compatible output
  • Convert add_q to gt compatible output
  • Convert add_comparison to gt compatible output
  • Convert add_overall to gt compatible output
  • Convert add_n to gt compatible output
  • Convert add_overall to gt compatible output
  • Write as_gt function to convert output into gt objects
  • Write S3 print and knit_print method functions

Function Updates

  • Convert tbl_summary vignette
    • Largely complete. Needs editing and the inline_text() section when that function is ready
  • Convert tbl_regression vignette
    • Rough draft complete. Model outline after the tbl_summary() vignette
  • Add message to knit_print functions. If the output type is word, inform users the correct output type is rtf_document
  • Decide on inputs to tbl_uvregression. Currently, the method and outcome are being passed as strings, but perhaps non-standard evaluation is more consistent with other functions? Check out the geom_smooth function and match it?
  • Update cols_label_summary function that modify headers for tbl_summary
    • The function still needs some work. I would like users to be able to pass literal strings (e.g. stat_overall = "Overall (N = {N})") as well as interpreted strings, like stat_overall = md("**Overall** (N = {N})")
    • Look into adding line breaks in the column label and test for all output types. Perhaps we can make a line break function that adds appropriate line breaks depending on output type.
  • Update add_global.tbl_uvregression to pass the dots to car::Anova
  • Re-write all package functions and utility functions not using deprecated tidyverse functions
  • Load gt package when gtsummary is loaded. I added gt to Depends in DESCRIPTION and it does indeed load. But the messages are not as cute as when tidyverse loads packages. Perhaps update?
  • Add smart default headers for tbl_regression, e.g. display "Odds Ratio" in column header, when logistic regression model is passed
  • Add default footnote to add_comparison indicating tests performed.
  • Add default footnote to tbl_summary denoting which statistics are being displayed
  • Move all gt formatting calls to their own file. Save the calls for each function in separate lists. Or move the the list to the bottom of the function in same file
    • Only for the tbl_* functions, the gt_calls were moved to below the function. The other helper functions' reference to gt code wasn't as intrusive
  • Add gt code to include --- for reference groups in tbl_regression
  • Add gt code to include --- for reference groups in tbl_uvregression
  • Include default headers for tbl_summary

tbl_survival level_label not working appropriately?

library(survival)
library(gtsummary)
library(dplyr)
lung2 <- mutate(lung, s2 = ifelse(sex == 1, "1\2", "2/3"))

tbl_survival(lung2, Surv(time,status)~sex,times = c(500,550)
             ,level_label = "male female"
             )

How does this work because when you try to add your own level labels it removes the subheadings?

image

consistent input for tbl_summary modifications

Some arguments for tbl_summary() take named lists where each name is a variable in the dataset, and others take named lists where the names are continuous or categorical. It would be preferable if these naming conventions were consistent. Each list can take in a named list of variable names, and ..continuous.., ..categorical.. will be reserved names and make the modification for all variable of that type.

line break function

I would like to write a function that knows the environment it's being run in a creates the appropriate line break. Want to use this the column labels, so the N=200 is on a second line.

{gt} changing name of cell_styles() function

The {gt} package is renaming one of the functions used in {gtsummary}. (from cell_styles() to cell_text()) rstudio/gt#258

When this update goes through, we'll need to update our code. As of this moment, the cell_styles() function is used in:
The bolding/italicizing of labels/levels functions,
tab_style_bold_p(),
tbl_merge(),
tbl_regression(),
tbl_summary(),
tbl_uvregression()

add_comparison - change re option to lme4?

Consider changing the re test in add_comparison to lme4. Perhaps this will signal to the users exactly how the p-value is calculated, much like how t.test signals the t.test() function was used.

inline_text.tbl_summary - improve passing column names

at the moment, the function accepts the literal column names from .$table_body. Secondarily, the function assesses whether the input matches one of the by variable levels and converts that to the appropriate column name.

I think there is probably a better way to pass these arguments.

Pass 'data' to tbl_regression for the column labels

Some regression model functions strip the labels when model.frame(x) is created (e.g. coxph does this for categorical variables, but not continuous variables). A simple fix to getting the labels would be to pass the data as an optional argument. If the label is stripped by model.frame() the function can secondarily look for the label in data.

I think the update will be as easy as change

var_label = map_chr(
  .data$variable, ~ 
    label[[.x]] %||% 
    attr(stats::model.frame(x)[[.x]], "label") %||%  
    .x
)

to

var_label = map_chr(
  .data$variable, ~ 
    label[[.x]] %||% 
    attr(stats::model.frame(x)[[.x]], "label") %||%  
    attr(data[[.x]], "label") %||%
    .x
)

where data is NULL by default

Deprecated functions and tidy evaluation

Many deprecated functions were used to create the package. Re-write with the the non-deprecated versions using tidy evaluation.

Completed conversions:

  • mutate_()
  • filter_()
  • group_by_()
  • arrange_()
  • count_()
  • nest_()
  • unnest_()
  • spread_()
  • gather_()
  • complete_()

tidy evaluation

Re-write all package functions and utility functions not using deprecated tidyverse functions

left justify label header in tbl_summary

The left column header becomes centered (rather than left justified) after adding a spanning header.

I think this may happen in {gt} rather than something I did in {gtsummary}?

library(gtsummary)
#> Loading required package: gt
trial %>%
  dplyr::select(grade, age, response, marker) %>%
  tbl_summary(by = "grade") %>%
  as_gt() %>%
  tab_spanner(label = "Tumor Grade", columns = starts_with("stat_"))

image

add smart headers for lme4 objects

Add code to have headers like "OR" and "IRR" for logistic and Poisson regression models for random effects models, as we do for GLM models.

library(gtsummary)
library(lme4)
mod_glmer <-
  glmer(am ~ hp + (1 | gear), mtcars, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

image

mod_glm <-
  glm(am ~ hp, mtcars, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

image

include figures in {gtsummary} help files

The gt output tables don't display in the online help files.

Look at the gt package help pages to see who they display their examples. It looks like they are displaying image files.
Perhaps one solution is to add all the code to create the tables in the data-raw folder, and use code to save image of the tables. The help files will then import the images.

{gt} tables in help files (cont.)

@karissawhiting putting the comments from #83 in this issue to keep track.

  1. We should shorten the tbl_summary table examples (and perhaps others). Also, two examples should be sufficient (rather than the three we have). With fewer examples and shorter tables, the space between example code and the rest of the help file will be shorter.
  2. Rather than call the section "Figures", perhaps we should call it "Example Output"? That may help people understand more easily that these are the output tables.
  3. Is there a way to make a separation between two example tables?

What do you think?

tbl_survival results in wide format

By default the tbl_survival results are printed in long format (one line per stratum per time point). We want to add functionality for users to have the table print in wide format. Results can be made in two ways:

  1. With each stratum variable as a column, or
  2. With each time point as a column

We need to decide whether to incorporate these as options in tbl_survival or by making the modification via a new piped function. I am leaning towards making them options in the function.

We will need to return a long tibble in the results (default printing format) regardless of the chosen output format, so the inline_text() function identify the correct statistics to return.

Whatever we decide, it should be compatible with the yet to be complete function tbl_survival.cuminc (#64) AND with the survival quantile updates to tbl_survival.survfit (#61)

attach gt package

I added gt to Depends in DESCRIPTION and it does indeed load. But the messages are not as cute as when tidyverse loads packages. Perhaps update?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.