Is your feature request related to a problem? Please describe. Qu

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Feature request: Quantile regression as p-value/test about gtsummary HOT 7 CLOSED

spiralparagon commented on June 2, 2024

Feature request: Quantile regression as p-value/test

from gtsummary.

Comments (7)

ddsjoberg commented on June 2, 2024

Thanks for the post. Can you please provide a minimal reproducible example calculating the pvalue?

Can you also link to a manuscript using the test in the way you're suggesting?

from gtsummary.

spiralparagon commented on June 2, 2024

Hi Daniel! Thanks for replying!

I have the r-bloggers page as an example: https://www.r-bloggers.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/

But this question is based on a false premise: that the the Wilcoxon rank-sum test is used to compare medians. The premise is based on a misunderstanding of the null hypothesis of the test. The actual null hypothesis is that there is a 50% probability that a random value from one population exceeds an random value from the other population. The practical value of this is hard to see, and thus in many places, including textbooks, the null hypothesis is presented as “the two populations have equal medians”. The actual null hypothesis can be expressed as the latter median hypothesis, but only under the additional assumption that the shapes of the distributions are identical in each group.

In other words, our interpretation of the test as comparing the medians of the distributions requires the location-shift-only alternative to be the case. Since this is rarely true, and never assessed, we should probably use extreme caution in using, and especially in interpreting, the Wilcoxon rank-sum test.

I've paraphrased the code on to the following:

#Import libraries
library(tidyverse)
library(gtsummary)
library(quantreg)

#Create a reproducible dataset
set.seed(123)

y1 = as_tibble(rexp(10000))
y2 = as_tibble(rnorm(10000) + log(2))

y1$group = "Y1"
y2$group = "Y2"

df = rbind(y1,y2)

#Run wilcox.test manually
wilcox.test(value ~ group, data=df)

#Show using gtsummary
df %>% 
  tbl_summary(
    by = group,
    statistic = list(all_continuous() ~ "{median} ({p25}, {p75})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = list(all_continuous() ~ c(1,1)),
    ) %>% 
  bold_labels() %>% 
  add_p(pvalue_fun = ~style_pvalue(.x, digits = 2))

#Run a quantile regression, for the median (tau=0.5)
qr <- rq(value ~ group, data=df, tau = 0.5)
qr

#Calculate standard errors using bootstrapping with 100 replications, and respective p-values
set.seed(123)
summary(qr, se="boot", R=100)

from gtsummary.

spiralparagon commented on June 2, 2024

Maybe I'm misunderstanding the syntax for add_p?
For example, when I attempt to replace wilcoxons test with kolmorov-smirnov, which is called by "ks.test" with identical syntax, the p-value is omitted in the output

df %>% 
  tbl_summary(
    by = group,
    statistic = list(all_continuous() ~ "{median} ({p25}, {p75})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = list(all_continuous() ~ c(1,1)),
    ) %>% 
  bold_labels() %>% 
  add_p(pvalue_fun = ~style_pvalue(.x, digits = 2),
        test = list(all_continuous() ~ "ks.test"))

Similarily, I don't see where you could add parameters to the test-call for example, which would be needed in the case for quantreg (specifying the bootstrap estimator and number of replications)

from gtsummary.

ddsjoberg commented on June 2, 2024

Thanks @spiralparagon for the additional context!

Frist, regarding adding a quantile regression test: unfortunately, I don't think we will add it natively in the package as it depends on the {quantreg} package and I don't want to take on more dependencies than we already have. But that doesn't mean you can't include these results in your gtsummary tables. You can include any p-value by creating a custom function: https://www.danieldsjoberg.com/gtsummary/reference/tests.html#custom-functions-1

here's an example:

library(gtsummary)
#> #BlackLivesMatter

add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  quantreg::rq(
    reformulate(variable, response = by), 
    data = data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate("{by}" := factor(.data[[by]]) |> as.numeric()), 
    tau = 0.5
  ) |> 
    # use summary to get p-value
    summary(se = "boot") |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}

add_p_median_reg(mtcars, variable = "mpg", by = "am")
#>       estimate  std.error statistic     p.value            method
#> mpg 0.05649718 0.01887302  2.993542 0.005478198 Median Regression
add_p_median_reg(trial, variable = "age", by = "trt")
#>     estimate   std.error statistic p.value            method
#> age        0 0.007789238         0       1 Median Regression

tbl <-
  trial |> 
  tbl_summary(
    by = trt, 
    missing = "no",
    include = c(age, marker)
  ) |> 
  add_p(test = all_continuous() ~ add_p_median_reg)

^{Created on 2024-04-06 with reprex v2.1.0}

Second, we have not yet included support for the ks.test (no one has requested it before. Since it's included in the base R, we can include it. To keep things clean, if you'd like, can you request support in a separate issue?

from gtsummary.

spiralparagon commented on June 2, 2024

Hi Daniel!

Thank you so much for the help. I understand not wanting to increase dependencies and I appreciate your thorough example on how to implement this despite the fact.

Strangely, I don't see the expected behaviour with my previous code sample - the output of add_p_median_reg is correct on all accounts except for the p-value where it gives a 0.

library(tidyverse)
library(gtsummary)
library(quantreg)

#Create a reproducible dataset
set.seed(123)

y1 = as_tibble(rexp(10000))
y2 = as_tibble(rnorm(10000) + log(2))

y1$group = "Y1"
y2$group = "Y2"

df = rbind(y1,y2)

qr <- rq(value ~ group, data=df, tau = 0.5)
set.seed(123)
summary(qr, se="boot", R=100)

add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  quantreg::rq(
    reformulate(variable, response = by), 
    data = data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate("{by}" := factor(.data[[by]]) |> as.numeric()), 
    tau = 0.5
  ) |> 
    # use summary to get p-value, added 200 reps
    summary(se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}

set.seed(123)
add_p_median_reg(df, variable = "value", by = "group")

I can't see what's wrong in the code - even running a subset of the function directly gives expected result

set.seed(123)
summary(qr, se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))

I'm comparing the output of age to trt in the trial dataset

qr <- rq(age ~ trt, data=trial, tau = 0.5)
set.seed(123)
summary(qr, se="boot", R=100)

vs the function output

I'll create a seperate thread for ks.test. Thank you again!

from gtsummary.

spiralparagon commented on June 2, 2024

After exploring this a bit looking at the model, the problem seems to be in reformulate not giving expected behaviour to the rq function.

Attempted to use paste instead, aided with the magic of chatgpt (which also had an issue with the factor to numeric conversion, not sure how accurate it is). Caveat emptor as I'm not experienced :(

add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  data <- data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate(!!sym(by) := factor(.data[[by]]))
  
  # Construct the formula directly
  formula_str <- paste(variable, "~", by)
  formula <- as.formula(formula_str)
  
  # Build regression model using the constructed formula
  quantreg::rq(
    formula, 
    data = data,
    tau = 0.5
  ) |> 
    # use summary to get p-value
    summary(se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}

seems to give expected behaviour

from gtsummary.

ddsjoberg commented on June 2, 2024

Ahh, i had flipped the dependent and independent variables in the reformulate 😱 glad you have a solution

from gtsummary.

Feature request: Quantile regression as p-value/test about gtsummary HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent