Giter Club home page Giter Club logo

Comments (7)

ddsjoberg avatar ddsjoberg commented on June 2, 2024

Thanks for the post. Can you please provide a minimal reproducible example calculating the pvalue?

Can you also link to a manuscript using the test in the way you're suggesting?

from gtsummary.

spiralparagon avatar spiralparagon commented on June 2, 2024

Hi Daniel! Thanks for replying!

I have the r-bloggers page as an example: https://www.r-bloggers.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/

But this question is based on a false premise: that the the Wilcoxon rank-sum test is used to compare medians. The premise is based on a misunderstanding of the null hypothesis of the test. The actual null hypothesis is that there is a 50% probability that a random value from one population exceeds an random value from the other population. The practical value of this is hard to see, and thus in many places, including textbooks, the null hypothesis is presented as “the two populations have equal medians”. The actual null hypothesis can be expressed as the latter median hypothesis, but only under the additional assumption that the shapes of the distributions are identical in each group.

In other words, our interpretation of the test as comparing the medians of the distributions requires the location-shift-only alternative to be the case. Since this is rarely true, and never assessed, we should probably use extreme caution in using, and especially in interpreting, the Wilcoxon rank-sum test.

I've paraphrased the code on to the following:

#Import libraries
library(tidyverse)
library(gtsummary)
library(quantreg)
#Create a reproducible dataset
set.seed(123)

y1 = as_tibble(rexp(10000))
y2 = as_tibble(rnorm(10000) + log(2))

y1$group = "Y1"
y2$group = "Y2"

df = rbind(y1,y2)
#Run wilcox.test manually
wilcox.test(value ~ group, data=df)

#Show using gtsummary
df %>% 
  tbl_summary(
    by = group,
    statistic = list(all_continuous() ~ "{median} ({p25}, {p75})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = list(all_continuous() ~ c(1,1)),
    ) %>% 
  bold_labels() %>% 
  add_p(pvalue_fun = ~style_pvalue(.x, digits = 2)) 
#Run a quantile regression, for the median (tau=0.5)
qr <- rq(value ~ group, data=df, tau = 0.5)
qr
#Calculate standard errors using bootstrapping with 100 replications, and respective p-values
set.seed(123)
summary(qr, se="boot", R=100)

from gtsummary.

spiralparagon avatar spiralparagon commented on June 2, 2024

Maybe I'm misunderstanding the syntax for add_p?
For example, when I attempt to replace wilcoxons test with kolmorov-smirnov, which is called by "ks.test" with identical syntax, the p-value is omitted in the output

df %>% 
  tbl_summary(
    by = group,
    statistic = list(all_continuous() ~ "{median} ({p25}, {p75})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = list(all_continuous() ~ c(1,1)),
    ) %>% 
  bold_labels() %>% 
  add_p(pvalue_fun = ~style_pvalue(.x, digits = 2),
        test = list(all_continuous() ~ "ks.test"))

Similarily, I don't see where you could add parameters to the test-call for example, which would be needed in the case for quantreg (specifying the bootstrap estimator and number of replications)

from gtsummary.

ddsjoberg avatar ddsjoberg commented on June 2, 2024

Thanks @spiralparagon for the additional context!

Frist, regarding adding a quantile regression test: unfortunately, I don't think we will add it natively in the package as it depends on the {quantreg} package and I don't want to take on more dependencies than we already have. But that doesn't mean you can't include these results in your gtsummary tables. You can include any p-value by creating a custom function: https://www.danieldsjoberg.com/gtsummary/reference/tests.html#custom-functions-1

here's an example:

library(gtsummary)
#> #BlackLivesMatter

add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  quantreg::rq(
    reformulate(variable, response = by), 
    data = data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate("{by}" := factor(.data[[by]]) |> as.numeric()), 
    tau = 0.5
  ) |> 
    # use summary to get p-value
    summary(se = "boot") |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}

add_p_median_reg(mtcars, variable = "mpg", by = "am")
#>       estimate  std.error statistic     p.value            method
#> mpg 0.05649718 0.01887302  2.993542 0.005478198 Median Regression
add_p_median_reg(trial, variable = "age", by = "trt")
#>     estimate   std.error statistic p.value            method
#> age        0 0.007789238         0       1 Median Regression

tbl <-
  trial |> 
  tbl_summary(
    by = trt, 
    missing = "no",
    include = c(age, marker)
  ) |> 
  add_p(test = all_continuous() ~ add_p_median_reg)
image

Created on 2024-04-06 with reprex v2.1.0

Second, we have not yet included support for the ks.test (no one has requested it before. Since it's included in the base R, we can include it. To keep things clean, if you'd like, can you request support in a separate issue?

from gtsummary.

spiralparagon avatar spiralparagon commented on June 2, 2024

Hi Daniel!

Thank you so much for the help. I understand not wanting to increase dependencies and I appreciate your thorough example on how to implement this despite the fact.

Strangely, I don't see the expected behaviour with my previous code sample - the output of add_p_median_reg is correct on all accounts except for the p-value where it gives a 0.

library(tidyverse)
library(gtsummary)
library(quantreg)
#Create a reproducible dataset
set.seed(123)

y1 = as_tibble(rexp(10000))
y2 = as_tibble(rnorm(10000) + log(2))

y1$group = "Y1"
y2$group = "Y2"

df = rbind(y1,y2)
qr <- rq(value ~ group, data=df, tau = 0.5)
set.seed(123)
summary(qr, se="boot", R=100)
add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  quantreg::rq(
    reformulate(variable, response = by), 
    data = data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate("{by}" := factor(.data[[by]]) |> as.numeric()), 
    tau = 0.5
  ) |> 
    # use summary to get p-value, added 200 reps
    summary(se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}
set.seed(123)
add_p_median_reg(df, variable = "value", by = "group")

bild

I can't see what's wrong in the code - even running a subset of the function directly gives expected result

set.seed(123)
summary(qr, se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))

bild

I'm comparing the output of age to trt in the trial dataset

qr <- rq(age ~ trt, data=trial, tau = 0.5)
set.seed(123)
summary(qr, se="boot", R=100)

bild

vs the function output
bild

I'll create a seperate thread for ks.test. Thank you again!

from gtsummary.

spiralparagon avatar spiralparagon commented on June 2, 2024

After exploring this a bit looking at the model, the problem seems to be in reformulate not giving expected behaviour to the rq function.

Attempted to use paste instead, aided with the magic of chatgpt (which also had an issue with the factor to numeric conversion, not sure how accurate it is). Caveat emptor as I'm not experienced :(

add_p_median_reg <- function(data, variable, by, ...) {
  # build regression model
  data <- data |> 
      tidyr::drop_na(all_of(c(variable, by))) |> 
      dplyr::mutate(!!sym(by) := factor(.data[[by]]))
  
  # Construct the formula directly
  formula_str <- paste(variable, "~", by)
  formula <- as.formula(formula_str)
  
  # Build regression model using the constructed formula
  quantreg::rq(
    formula, 
    data = data,
    tau = 0.5
  ) |> 
    # use summary to get p-value
    summary(se = "boot", R=200) |> 
    # format results like you'd see from `broom::tidy()`
    purrr::pluck(coefficients) |> 
    as.data.frame() |> 
    tail(n = 1) |> 
    setNames(c("estimate", "std.error", "statistic", "p.value"))  |> 
    dplyr::mutate(method = "Median Regression")
}

seems to give expected behaviour

from gtsummary.

ddsjoberg avatar ddsjoberg commented on June 2, 2024

Ahh, i had flipped the dependent and independent variables in the reformulate 😱 glad you have a solution

from gtsummary.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.