ddsjoberg / gtsummary Goto Github PK

View Code? Open in Web Editor NEW

946.0 17.0 106.0 682.02 MB

Presentation-Ready Data Summary and Analytic Result Tables

Home Page: http://www.danieldsjoberg.com/gtsummary

License: Other

R 57.07% HTML 42.93%

r summary-tables statistics rstats r-package gt html5 easy-to-use summary-statistics reproducible-research

gtsummary's Introduction

gtsummary

The {gtsummary} package provides an elegant and flexible way to create publication-ready analytical and summary tables using the R programming language. The {gtsummary} package summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.

Summarize data frames or tibbles easily in R. Perfect for presenting descriptive statistics, comparing group demographics (e.g creating a Table 1 for medical journals), and more. Automatically detects continuous, categorical, and dichotomous variables in your data set, calculates appropriate descriptive statistics, and also includes amount of missingness in each variable.
Summarize regression models in R and include reference rows for categorical variables. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically identified and the tables are pre-filled with appropriate column headers (i.e. Odds Ratio and Hazard Ratio).
Customize gtsummary tables using a growing list of formatting/styling functions. Bold labels, italicize levels, add p-value to summary tables, style the statistics however you choose, merge or stack tables to present results side by side… there are so many possibilities to create the table of your dreams!
Report statistics inline from summary tables and regression summary tables in R markdown. Make your reports completely reproducible!

By leveraging {broom}, {gt}, and {labelled} packages, {gtsummary} creates beautifully formatted, ready-to-share summary and result tables in a single line of R code!

Check out the examples below, review the vignettes for a detailed exploration of the output options, and view the gallery for various customization examples.

Installation

The {gtsummary} package was written as a companion to the {gt} package from RStudio. You can install {gtsummary} with the following code.

install.packages("gtsummary")

Install the development version of {gtsummary} with:

remotes::install_github("ddsjoberg/gtsummary")

Examples

Summary Table

Use tbl_summary() to summarize a data frame.

Example basic table:

library(gtsummary)

# summarize the data with our package
table1 <- 
  trial %>%
  tbl_summary(include = c(age, grade, response))

There are many customization options to add information (like comparing groups) and format results (like bold labels) in your table. See the tbl_summary() tutorial for many more options, or below for one example.

table2 <-
  tbl_summary(
    trial,
    include = c(age, grade, response),
    by = trt, # split table by group
    missing = "no" # don't list missing data separately
  ) %>%
  add_n() %>% # add column with total number of non-missing observations
  add_p() %>% # test for a difference between groups
  modify_header(label = "**Variable**") %>% # update the column header
  bold_labels()

Regression Models

Use tbl_regression() to easily and beautifully display regression model results in a table. See the tutorial for customization options.

mod1 <- glm(response ~ trt + age + grade, trial, family = binomial)

t1 <- tbl_regression(mod1, exponentiate = TRUE)

Side-by-side Regression Models

You can also present side-by-side regression model results using tbl_merge()

library(survival)

# build survival model table
t2 <-
  coxph(Surv(ttdeath, death) ~ trt + grade + age, trial) %>%
  tbl_regression(exponentiate = TRUE)

# merge tables
tbl_merge_ex1 <-
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("**Tumor Response**", "**Time to Death**")
  )

Review even more output options in the table gallery.

gtsummary + R Markdown

The {gtsummary} package was written to be a companion to the {gt} package from RStudio. But not all output types are supported by the {gt} package. Therefore, we have made it possible to print {gtsummary} tables with various engines.

Review the gtsummary + R Markdown vignette for details.

Save Individual Tables

{gtsummary} tables can also be saved directly to file as an image, HTML, Word, RTF, and LaTeX file.

tbl %>%
  as_gt() %>%
  gt::gtsave(filename = ".") # use extensions .png, .html, .docx, .rtf, .tex, .ltx

Additional Resources

The best resources are the gtsummary vignettes: table gallery, tbl_summary() tutorial, tbl_regression() tutorial, inline_text() tutorial, gtsummary themes, gtsummary+R markdown.
The R Journal Article Reproducible Summary Tables with the gtsummary Package.
The RStudio Education Blog includes a post with a brief introduction to the package.
A recording of a presentation given to the Weill Cornell Biostatistics Department and the Memorial Sloan Kettering R Users Group.
<iframe width="560" height="315" src="https://www.youtube.com/embed/tANo9E1SYJE" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Cite gtsummary

> citation("gtsummary")

To cite gtsummary in publications use:

  Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package.
  The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.

A BibTeX entry for LaTeX users is

  @Article{gtsummary,
    author = {Daniel D. Sjoberg and Karissa Whiting and Michael Curry and Jessica A. Lavery and Joseph Larmarange},
    title = {Reproducible Summary Tables with the gtsummary Package},
    journal = {{The R Journal}},
    year = {2021},
    url = {https://doi.org/10.32614/RJ-2021-053},
    doi = {10.32614/RJ-2021-053},
    volume = {13},
    issue = {1},
    pages = {570-580},
  }

Contributing

Big thank you to @jeffreybears for the hex sticker!

Please note that the {gtsummary} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Thank you to all contributors!

gtsummary's People

Contributors

Stargazers

Watchers

Forkers

mljaniczek jflynn264 yushu-liu glewando guhjy slobaugh rweberc student3841701 han-tun romainfrancois jupiterchamsae991102 tpfeeney jhuanglabfavorite hughjonesd talaricd clinicopath oranwutang raghunandanalugubelli matthieu-faron thomas-neitmann shaesen2 denis-or jixing475 emilyvertosick shijianasdf zixi-liu veseshan jalavery csetraynor gejielin lenamax2355 raphidoc jongretar drshaneburke mtysar rich-iannone lindayuanyuan michaelcurry1123 guigui351 anhnguyendepocen eliascrapa ukveteran ibrahimhe catherineshan jfontestad zhengnow meierhsu mwaurapatrick bx259 aleszib clara1989 felipejcolon yanliangs rebeccagb philsf genomicsnx chas-mellish doulia ltin1214 shannonpileggi mosesotieno tataphani yjunechoe pedersebastian larmarange gorkang mohammed-kj naddo joannef1229 yoursdearboy jam524 erikvona teikot nmintz shaunporwal barnes7599 szimmer ssyamoako hadley mfansler davisvaughan trekonom gg-big-org rbcavanaugh eegk arlionn howardbaek mchughmatthew karissawhiting hfuchs5 spread0x richardholleyman gast1111 dereksonderegger zhangkaicr corradolanera edacehmitas bc-teixeira shannyfoo allenlile

gtsummary's Issues

tbl_survival - provide surv quantile and get time back

Should you be able to provide an probability estimate and get a time point back? For example provide median survival and get the time that it occurred?

update inline_text.tbl_survival help file to reference all columns available to print via {}

tbl_regression - interactions don't print properly

function to merge results from two multivariable model results with same predictors

Previous note from Emily: For example, model results for OS and PFS could be placed side by side

tbl_summary() - add option to sort levels by frequency

Complete the following:

Add checks for the input
Add sorting type to .$meta_analysis
Sort the levels by requested type

tbl_regression does not support glm when family = gaussian

mod1 <-   glm(vs ~ cyl,family = binomial, data =mtcars ) %>% 
          tbl_regression()


mod2 <-   glm(mpg ~ cyl,family = gaussian, data =mtcars ) %>% 
  tbl_regression()

first model runs fine second does not

tbl_merge - support merging tbl_summary object with regression

Look into supporting this feature.

tbl_survival does not handle escape characters in variable levels

library(survival)
library(gtsummary)
library(dplyr)
lung2 <- mutate(lung, s2 = ifelse(sex == 1, "1\2", "2/3"))
tbl_survival(lung2, Surv(time,status)~s2, times = c(500,540))

If you look at the output you will see that the function does not print "1\2" but prints 1 and a special character.

knit_print word_document warning

Add message to knit_print functions. If the output type is word, inform users the correct output type is rtf_document

tbl_uvregression argument inputs

Decide on inputs to tbl_uvregression. Currently, the method and outcome are being passed as strings, but perhaps non-standard evaluation is more consistent with other functions? Check out the geom_smooth function and match it?

use S3 everywhere

make every function that takes in a tbl_summary, tbl_regression, or tbl_uvregression object an S3 generic function. in the help file for each of these functions, we'll list every subsequent function that can be called

update .$gt_calls code

Each {gtsummary} function that assists in creating/modifying tables added {gt} code to a list called gt_calls. This list contains one command per entry. For example:

> t = 
+   trial %>%
+   tbl_summary()
> t$gt_calls[1:2]
$gt
[1] "gt(data = x$table_body)"

$cols_label_label
[1] "cols_label(label = md('**Characteristic**'))"

Each command is stored as a string, but think this is not best practices. Probably, each of the commands should be stored as a quoted expression using quote() or rlang::expr()? Look into updating this.

Missing levels across models makes it difficult to determine reference group tbl_merge

I wanted to see what would happen if you ran a model with all levels and then removed one of the levels in a separate model and then used tbl_merge to look at both models. Both the missing level and the reference group get a dashed line which makes it difficult to determine which one is the reference and which is the missing level. Should the missing level just be blank instead of dashed?

library(survival)
library(gtsummary)
library(dplyr)


lung2 <- mutate(lung, stat = ifelse(status==2,0,1),
                ph.ecog = as.character(ph.ecog))

moda <-   glm(stat ~ ph.ecog,family = binomial, data =lung2) %>% 
          tbl_regression()

lung3 <- lung2 %>% 
         filter(ph.ecog != 3) %>% 
         mutate(ph.ecog = as.character(ph.ecog))

modb <-   glm(stat ~ ph.ecog,family = binomial, data =lung3) %>% 
  tbl_regression()

gtsummary::tbl_merge(tbls = list(moda,modb), tab_spanner = c("1", "s"))

tbl_summary - improve dichotomous value passing

The entire dichotomous value portion needs to be revisited.

the user should be able to pass a level that will be presented
the function makes assumptions about which levels should be presented (which is fine), but sometimes these are not observed in the data set, and the entire variable is silently omitted from the output.
Variables that should be assumed to be dichotomous and their value to present
- logical, TRUE
- yes, no factors, yes
- yes, no characters, yes
- 0/1 numeric, 1

tbl_merge support binomal glm but not lm models

moda <-   glm(vs ~ cyl,family = binomial, data =mtcars ) %>% 
          tbl_regression()


modb <-   glm(vs ~ cyl, binomial,data = mtcars ) %>% 
           tbl_regression()
#works
gtsummary::tbl_merge(tbls = list(moda,modb), tab_spanner = c("1", "s"))
#can fit lm with tbl_regression
mod1 <-   lm(vs ~ cyl, data = mtcars ) %>% 
  tbl_regression()

mod2 <- gtsummary::tbl_regression(lm(vs ~ cyl , mtcars))
#breaks 
gtsummary::tbl_merge(tbls = list(mod1,mod2), tab_spanner = c("1", "s"))

Primary Function Prefixes

In gt, the fmt_* prefix is used for formatting columns.
We need to change the names of fmt_table1, fmt_regression, and fmt_uni_regression.
One suggestion is tbl_summary, tbl_regression, and tbl_uregression.
I've never liked fmt_uni_regression and don't really like tbl_uregression either.

Formatting Summary Statistics

The fmt_* functions in gt format columns. We need to rename fmt_percent, fmt_pvalue, and fmt_beta. Perhaps we can call them str_percent, etc., as they convert items to strings? But str_ is already a common prefix to operate ON strings, not for making strings.
I don't think it is possible to use fmt_* functions from gt to format the summary statistics. I am not sure what the best method is.

One possibility is to keep the code as it is and pass an integer in the digits argument (i.e. digits = list(age = 1). But this assumes we want every statistic rounded to the same number of places.
We can generalize the above option slightly to accept arguments like digits = list(age = c(0, 1)). If the statistics being presented were "{mean} ({sd})", the mean would be rounded to the nearest integer and the SD to one decimal place.
Or maybe there is another option that allows for any kind of formatting (scientific, currency, dates, etc.)

tbl_regression - sample size incorrect with time-dependent covariates

Imported biostatR issue

If you use a long dataset for analysis of time-dependent covariates and pass the model results to fmt_regression, the sample size is the number of rows in the dataset rather than the number of unique observations in the data.

cols_label_summary force evaluation of arguments to make strings

I would like users to be able to pass literal strings (e.g. stat_overall = "Overall (N = {N})") as well as interpreted strings, like stat_overall = md("Overall (N = {N})")

tbl_merge - merging regression models with unmatched variables

The function currently allows you to merge two regression models that don't have the same model covariates. The results look good, but not perfect.

library(survival)
t1 <-
  glm(response ~ marker + grade + age, trial, family = binomial) %>%
  tbl_regression(
    label = list(trt = "Treatment", grade = "Grade", age = "Age"),
    exponentiate = TRUE
  )
t2 <-
  coxph(Surv(ttdeath, death) ~ stage + age, trial) %>%
  tbl_regression(
    label = list(trt = "Treatment", grade = "Grade", age = "Age"),
    exponentiate = TRUE
  )
t3 <-
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("Tumor Response", "Time to Death")
  )
t3

The dashes being added are due to the code that adds "---" for the reference groups of categorical variables. This means that "---" is being differentially added for missing variables depending on whether they are categorical.

I need to update the missing gt code to only add "---" if the variable is present in a particular model. All other blanks can merely by "".

To do/To think about

What needs to get done for the initial conversion from biostatR to gtsummary.

Priority

What prefix should we use for our primary functions? #2
- I like tbl_. I think it can work.
What should we name our primary functions?
- fmt_table1 is tbl_summary
- fmt_regression is tbl_regression
- fmt_uni_regression is tbl_uvregression
What prefix should we use for our numeric formatting functions? #3
- fmt_pvalue is style_pvalue
- fmt_percent is style_percent
- fmt_beta is style_sigfig
Convert fmt_table1 to tbl_summary utilizing gt
Convert fmt_regression to tbl_regression utilizing gt
Convert fmt_uni_regression to tbl_uvregression utilizing gt

New Functions

Function Updates

tbl_summary vignette

complete tbl_summary vignette

Create table of Kaplan-Meier estimates

Previous note from Emily: x-year estimates and confidence intervals from Kaplan-Meier, either overall or by a predictor

tbl_survival to handle cuminc from cmprsk package

This will allow people to get cumulative incidence for competing risks for each event.

tbl_survival level_label not working appropriately?

library(survival)
library(gtsummary)
library(dplyr)
lung2 <- mutate(lung, s2 = ifelse(sex == 1, "1\2", "2/3"))

tbl_survival(lung2, Surv(time,status)~sex,times = c(500,550)
             ,level_label = "male female"
             )

How does this work because when you try to add your own level labels it removes the subheadings?

consistent input for tbl_summary modifications

Some arguments for tbl_summary() take named lists where each name is a variable in the dataset, and others take named lists where the names are continuous or categorical. It would be preferable if these naming conventions were consistent. Each list can take in a named list of variable names, and ..continuous.., ..categorical.. will be reserved names and make the modification for all variable of that type.

line break function

I would like to write a function that knows the environment it's being run in a creates the appropriate line break. Want to use this the column labels, so the N=200 is on a second line.

Modify missing text in tbl_summary

When missing values are printed in tbl_summary objects, the text display is "Unknown".

@karissawhiting brought up some time ago the possibility of changing the default text.

Option added in 9c730f0

change style_coef function to style_ratio

this is a more descriptive name for what the function does

Add methods for coxph and survreg to tbl_survival

{gt} changing name of cell_styles() function

The {gt} package is renaming one of the functions used in {gtsummary}. (from cell_styles() to cell_text()) rstudio/gt#258

When this update goes through, we'll need to update our code. As of this moment, the cell_styles() function is used in:
The bolding/italicizing of labels/levels functions,
tab_style_bold_p(),
tbl_merge(),
tbl_regression(),
tbl_summary(),
tbl_uvregression()

write function to add n.event to Cox regression $table_body output

add_comparison - change re option to lme4?

Consider changing the re test in add_comparison to lme4. Perhaps this will signal to the users exactly how the p-value is calculated, much like how t.test signals the t.test() function was used.

inline_text.tbl_summary - improve passing column names

at the moment, the function accepts the literal column names from .$table_body. Secondarily, the function assesses whether the input matches one of the by variable levels and converts that to the appropriate column name.

I think there is probably a better way to pass these arguments.

Pass 'data' to tbl_regression for the column labels

Some regression model functions strip the labels when model.frame(x) is created (e.g. coxph does this for categorical variables, but not continuous variables). A simple fix to getting the labels would be to pass the data as an optional argument. If the label is stripped by model.frame() the function can secondarily look for the label in data.

I think the update will be as easy as change

var_label = map_chr(
  .data$variable, ~ 
    label[[.x]] %||% 
    attr(stats::model.frame(x)[[.x]], "label") %||%  
    .x
)

var_label = map_chr(
  .data$variable, ~ 
    label[[.x]] %||% 
    attr(stats::model.frame(x)[[.x]], "label") %||%  
    attr(data[[.x]], "label") %||%
    .x
)

where data is NULL by default

add_global - move cars package to suggests

print message if add_global is used without first installing cars package. the cars package has many dependencies that most will not need.

Deprecated functions and tidy evaluation

Many deprecated functions were used to create the package. Re-write with the the non-deprecated versions using tidy evaluation.

Completed conversions:

add tbl_regression smart headers for {geepack} regression models

unobserved levels not printing

mtcars %>%
  dplyr::select(vs, carb) %>%
  tbl_summary(by = "vs")

tidy evaluation

Re-write all package functions and utility functions not using deprecated tidyverse functions

left justify label header in tbl_summary

The left column header becomes centered (rather than left justified) after adding a spanning header.

I think this may happen in {gt} rather than something I did in {gtsummary}?

library(gtsummary)
#> Loading required package: gt
trial %>%
  dplyr::select(grade, age, response, marker) %>%
  tbl_summary(by = "grade") %>%
  as_gt() %>%
  tab_spanner(label = "Tumor Grade", columns = starts_with("stat_"))

add_global.tbl_uvregression pass ... to car::Anova

Update add_global.tbl_uvregression to pass the dots (...) to car::Anova

add smart headers for lme4 objects

Add code to have headers like "OR" and "IRR" for logistic and Poisson regression models for random effects models, as we do for GLM models.

library(gtsummary)
library(lme4)
mod_glmer <-
  glmer(am ~ hp + (1 | gear), mtcars, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

mod_glm <-
  glm(am ~ hp, mtcars, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

include figures in {gtsummary} help files

The gt output tables don't display in the online help files.

Look at the gt package help pages to see who they display their examples. It looks like they are displaying image files.
Perhaps one solution is to add all the code to create the tables in the data-raw folder, and use code to save image of the tables. The help files will then import the images.

{gt} tables in help files (cont.)

@karissawhiting putting the comments from #83 in this issue to keep track.

We should shorten the tbl_summary table examples (and perhaps others). Also, two examples should be sufficient (rather than the three we have). With fewer examples and shorter tables, the space between example code and the rest of the help file will be shorter.
Rather than call the section "Figures", perhaps we should call it "Example Output"? That may help people understand more easily that these are the output tables.
Is there a way to make a separation between two example tables?

What do you think?

tables in README not rendering properly

tbl_summary vignette

First draft complete. Needs editing

tbl_survival results in wide format

By default the tbl_survival results are printed in long format (one line per stratum per time point). We want to add functionality for users to have the table print in wide format. Results can be made in two ways:

With each stratum variable as a column, or
With each time point as a column

We need to decide whether to incorporate these as options in tbl_survival or by making the modification via a new piped function. I am leaning towards making them options in the function.

We will need to return a long tibble in the results (default printing format) regardless of the chosen output format, so the inline_text() function identify the correct statistics to return.

Whatever we decide, it should be compatible with the yet to be complete function tbl_survival.cuminc (#64) AND with the survival quantile updates to tbl_survival.survfit (#61)

tbl_regression vignette

finish tbl_regression vignette

attach gt package

I added gt to Depends in DESCRIPTION and it does indeed load. But the messages are not as cute as when tidyverse loads packages. Perhaps update?