Giter Club home page Giter Club logo

Comments (9)

kgoldfeld avatar kgoldfeld commented on July 22, 2024 1

You need to be careful in your data generating process. If you add a couple of extra lines, you can see the probabilities that you are generating - you generally don't want to create a scenario where the probabilities are crazy high. Looking at your the probabilities your formulas are generating, you would see that the probabilities exceed 50! Here is a simpler example:

library(simstudy)
library(data.table)

def <- 
 defData(varname = "sex", dist = "binary", link = "logit", formula = 0.5) |>
 defData(varname = "p1", dist = "nonrandom", link = "logit", formula = "-0.3 + .2*sex") |>
 defData(varname = "p2", dist = "nonrandom", link = "logit", formula = "-1.3 + 0.3*sex") |>
 defData(varname = "exposure", dist = "categorical", formula = "p1;p2")

# Simulate data

set.seed(123)

ret <- genData(n = 10000, dtDefs = def)
#> Warning: Probabilities do not sum to 1. Adding category to all rows!

ret
#> Key: <id>
#>           id   sex        p1        p2 exposure
#>        <int> <int>     <num>     <num>    <int>
#>     1:     1     1 0.4750208 0.2689414        2
#>     2:     2     0 0.4255575 0.2141650        1
#>     3:     3     1 0.4750208 0.2689414        2
#>     4:     4     0 0.4255575 0.2141650        2
#>     5:     5     0 0.4255575 0.2141650        1
#>    ---                                         
#>  9996:  9996     1 0.4750208 0.2689414        3
#>  9997:  9997     0 0.4255575 0.2141650        1
#>  9998:  9998     1 0.4750208 0.2689414        2
#>  9999:  9999     1 0.4750208 0.2689414        2
#> 10000: 10000     0 0.4255575 0.2141650        1

ret[, table(exposure)]
#> exposure
#>    1    2    3 
#> 4539 2520 2941

from simstudy.

kgoldfeld avatar kgoldfeld commented on July 22, 2024

Lorenzo - I am not fully clear on what the treatment assignment mechanism is. Can you describe the probabilities of treatment assignment with an equation (as a function of variable 1 and variable 2)? Or in words. Feel free to e-mail me directly at [email protected].

from simstudy.

lorenzoFabbri avatar lorenzoFabbri commented on July 22, 2024

I apologize!

Let's assume that we have two confounding variables, $C1$ and $C2$. The treatment variable is $X$ and it has 3 levels, and the treatment generating mechanism is given by $X \sim C1 + log(C2)$. So if we wanted to model $X$, and if we were to know the treatment generating mechanism, we would fit a model $X \sim C1 + log(C2)$ (for instance with a multinomial logistic model). So I'd like to generate a categorical variable using the formula $C1 + log(C2)$. I hope now it's clear! Thank you.

from simstudy.

kgoldfeld avatar kgoldfeld commented on July 22, 2024

You have written the probability of $X$ as a binary outcome. There need to be at least two probabilities specified. In this post, I generate data from a multinomial distribution (before fitting a model). Take a look and see if it helps.

from simstudy.

lorenzoFabbri avatar lorenzoFabbri commented on July 22, 2024

This is probably what I need, thank you. I do have some questions now. In the example, you:

  • Define a variable with a binary distribution with an identity link and one with a logit link. I do know the differences between the links, but it is not clear to me what type of variable you want to simulate with a binary distribution and a identity link.
  • Define a categorical variable with a logit link. From the documentation it does not seem to be supported.

I am trying to replicate your example but with my own variables:

def <- simstudy::defData(
    varname = "sex",
    dist = "binary",
    link = "logit",
    formula = 0.5
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "age",
    dist = "normal",
    formula = 10,
    variance = 2
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "eth",
    dist = "categorical",
    formula = "0.4;0.3;0.2;0.1",
    variance = "1;2;3;4"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "edu",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "sep",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "weight",
    dist = "normal",
    formula = 35,
    variance = 15
  )
  
  # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )
  
  # Simulate data
  ret <- simstudy::genData(
    n = num_obs,
    dtDefs = def
  )

It is not clear to me why in my case, exposure has only two levels rather than 3.

from simstudy.

assignUser avatar assignUser commented on July 22, 2024
 # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )

Your definition for the categorical formula only provides 2 probabilities where as the definition in the blog post has 3:

defY <- defDataAdd(defY, varname = "y", 
  formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
             b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
             -0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ", 
  dist = "categorical", link = "logit")

See also https://kgoldfeld.github.io/simstudy/articles/simstudy.html#categorical

from simstudy.

lorenzoFabbri avatar lorenzoFabbri commented on July 22, 2024

Thank you but in the example, Y has four levels. I understood we have to provide K-1 formulas if the variable has `K levels.
It is also not clear how to make sure the probabilities sum to 1 when using complex data generating mechanisms.

from simstudy.

kgoldfeld avatar kgoldfeld commented on July 22, 2024

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

from simstudy.

lorenzoFabbri avatar lorenzoFabbri commented on July 22, 2024

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

Okay makes sense. I apologize for this basic question, but with such complex generating mechanisms (especially for variables with many levels), how can we make sure that the sum of the probabilities is indeed 1?

from simstudy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.