I am working on a simulation study for which I need to known the treatment generating

You have written the probability of <math-renderer class="js-inline-math" style="displ

<div class="highlight highlight-source-r notranslate position-relative overflow-auto" dir="auto" dat

Use formula to simulate categorical variable about simstudy HOT 9 CLOSED

lorenzoFabbri commented on August 24, 2024

Use formula to simulate categorical variable

from simstudy.

Comments (9)

kgoldfeld commented on August 24, 2024 1

You need to be careful in your data generating process. If you add a couple of extra lines, you can see the probabilities that you are generating - you generally don't want to create a scenario where the probabilities are crazy high. Looking at your the probabilities your formulas are generating, you would see that the probabilities exceed 50! Here is a simpler example:

library(simstudy)
library(data.table)

def <- 
 defData(varname = "sex", dist = "binary", link = "logit", formula = 0.5) |>
 defData(varname = "p1", dist = "nonrandom", link = "logit", formula = "-0.3 + .2*sex") |>
 defData(varname = "p2", dist = "nonrandom", link = "logit", formula = "-1.3 + 0.3*sex") |>
 defData(varname = "exposure", dist = "categorical", formula = "p1;p2")

# Simulate data

set.seed(123)

ret <- genData(n = 10000, dtDefs = def)
#> Warning: Probabilities do not sum to 1. Adding category to all rows!

ret
#> Key: <id>
#>           id   sex        p1        p2 exposure
#>        <int> <int>     <num>     <num>    <int>
#>     1:     1     1 0.4750208 0.2689414        2
#>     2:     2     0 0.4255575 0.2141650        1
#>     3:     3     1 0.4750208 0.2689414        2
#>     4:     4     0 0.4255575 0.2141650        2
#>     5:     5     0 0.4255575 0.2141650        1
#>    ---                                         
#>  9996:  9996     1 0.4750208 0.2689414        3
#>  9997:  9997     0 0.4255575 0.2141650        1
#>  9998:  9998     1 0.4750208 0.2689414        2
#>  9999:  9999     1 0.4750208 0.2689414        2
#> 10000: 10000     0 0.4255575 0.2141650        1

ret[, table(exposure)]
#> exposure
#>    1    2    3 
#> 4539 2520 2941

from simstudy.

kgoldfeld commented on August 24, 2024

Lorenzo - I am not fully clear on what the treatment assignment mechanism is. Can you describe the probabilities of treatment assignment with an equation (as a function of variable 1 and variable 2)? Or in words. Feel free to e-mail me directly at [email protected].

from simstudy.

lorenzoFabbri commented on August 24, 2024

I apologize!

Let's assume that we have two confounding variables, $C1$ and $C2$. The treatment variable is $X$ and it has 3 levels, and the treatment generating mechanism is given by $X \sim C1 + log(C2)$. So if we wanted to model $X$, and if we were to know the treatment generating mechanism, we would fit a model $X \sim C1 + log(C2)$ (for instance with a multinomial logistic model). So I'd like to generate a categorical variable using the formula $C1 + log(C2)$. I hope now it's clear! Thank you.

from simstudy.

kgoldfeld commented on August 24, 2024

You have written the probability of $X$ as a binary outcome. There need to be at least two probabilities specified. In this post, I generate data from a multinomial distribution (before fitting a model). Take a look and see if it helps.

from simstudy.

lorenzoFabbri commented on August 24, 2024

This is probably what I need, thank you. I do have some questions now. In the example, you:

Define a variable with a binary distribution with an identity link and one with a logit link. I do know the differences between the links, but it is not clear to me what type of variable you want to simulate with a binary distribution and a identity link.
Define a categorical variable with a logit link. From the documentation it does not seem to be supported.

I am trying to replicate your example but with my own variables:

def <- simstudy::defData(
    varname = "sex",
    dist = "binary",
    link = "logit",
    formula = 0.5
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "age",
    dist = "normal",
    formula = 10,
    variance = 2
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "eth",
    dist = "categorical",
    formula = "0.4;0.3;0.2;0.1",
    variance = "1;2;3;4"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "edu",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "sep",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "weight",
    dist = "normal",
    formula = 35,
    variance = 15
  )
  
  # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )
  
  # Simulate data
  ret <- simstudy::genData(
    n = num_obs,
    dtDefs = def
  )

It is not clear to me why in my case, exposure has only two levels rather than 3.

from simstudy.

assignUser commented on August 24, 2024

 # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )

Your definition for the categorical formula only provides 2 probabilities where as the definition in the blog post has 3:

defY <- defDataAdd(defY, varname = "y", 
  formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
             b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
             -0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ", 
  dist = "categorical", link = "logit")

from simstudy.

lorenzoFabbri commented on August 24, 2024

Thank you but in the example, Y has four levels. I understood we have to provide K-1 formulas if the variable has `K levels.
It is also not clear how to make sure the probabilities sum to 1 when using complex data generating mechanisms.

from simstudy.

kgoldfeld commented on August 24, 2024

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

from simstudy.

lorenzoFabbri commented on August 24, 2024

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

Okay makes sense. I apologize for this basic question, but with such complex generating mechanisms (especially for variables with many levels), how can we make sure that the sum of the probabilities is indeed 1?

from simstudy.

Use formula to simulate categorical variable about simstudy HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent