Comments (9)
You need to be careful in your data generating process. If you add a couple of extra lines, you can see the probabilities that you are generating - you generally don't want to create a scenario where the probabilities are crazy high. Looking at your the probabilities your formulas are generating, you would see that the probabilities exceed 50! Here is a simpler example:
library(simstudy)
library(data.table)
def <-
defData(varname = "sex", dist = "binary", link = "logit", formula = 0.5) |>
defData(varname = "p1", dist = "nonrandom", link = "logit", formula = "-0.3 + .2*sex") |>
defData(varname = "p2", dist = "nonrandom", link = "logit", formula = "-1.3 + 0.3*sex") |>
defData(varname = "exposure", dist = "categorical", formula = "p1;p2")
# Simulate data
set.seed(123)
ret <- genData(n = 10000, dtDefs = def)
#> Warning: Probabilities do not sum to 1. Adding category to all rows!
ret
#> Key: <id>
#> id sex p1 p2 exposure
#> <int> <int> <num> <num> <int>
#> 1: 1 1 0.4750208 0.2689414 2
#> 2: 2 0 0.4255575 0.2141650 1
#> 3: 3 1 0.4750208 0.2689414 2
#> 4: 4 0 0.4255575 0.2141650 2
#> 5: 5 0 0.4255575 0.2141650 1
#> ---
#> 9996: 9996 1 0.4750208 0.2689414 3
#> 9997: 9997 0 0.4255575 0.2141650 1
#> 9998: 9998 1 0.4750208 0.2689414 2
#> 9999: 9999 1 0.4750208 0.2689414 2
#> 10000: 10000 0 0.4255575 0.2141650 1
ret[, table(exposure)]
#> exposure
#> 1 2 3
#> 4539 2520 2941
from simstudy.
Lorenzo - I am not fully clear on what the treatment assignment mechanism is. Can you describe the probabilities of treatment assignment with an equation (as a function of variable 1 and variable 2)? Or in words. Feel free to e-mail me directly at [email protected].
from simstudy.
I apologize!
Let's assume that we have two confounding variables, formula
from simstudy.
You have written the probability of
from simstudy.
This is probably what I need, thank you. I do have some questions now. In the example, you:
- Define a variable with a binary distribution with an identity link and one with a logit link. I do know the differences between the links, but it is not clear to me what type of variable you want to simulate with a binary distribution and a identity link.
- Define a categorical variable with a logit link. From the documentation it does not seem to be supported.
I am trying to replicate your example but with my own variables:
def <- simstudy::defData(
varname = "sex",
dist = "binary",
link = "logit",
formula = 0.5
)
def <- simstudy::defData(
dtDefs = def,
varname = "age",
dist = "normal",
formula = 10,
variance = 2
)
def <- simstudy::defData(
dtDefs = def,
varname = "eth",
dist = "categorical",
formula = "0.4;0.3;0.2;0.1",
variance = "1;2;3;4"
)
def <- simstudy::defData(
dtDefs = def,
varname = "edu",
dist = "categorical",
formula = "0.2;0.5;0.3",
variance = "1;2;3"
)
def <- simstudy::defData(
dtDefs = def,
varname = "sep",
dist = "categorical",
formula = "0.2;0.5;0.3",
variance = "1;2;3"
)
def <- simstudy::defData(
dtDefs = def,
varname = "weight",
dist = "normal",
formula = 35,
variance = 15
)
# Generate exposure given confounders
def <- simstudy::defData(
dtDefs = def,
varname = "exposure",
dist = "categorical",
link = "logit",
formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
-1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
)
# Simulate data
ret <- simstudy::genData(
n = num_obs,
dtDefs = def
)
It is not clear to me why in my case, exposure
has only two levels rather than 3.
from simstudy.
# Generate exposure given confounders
def <- simstudy::defData(
dtDefs = def,
varname = "exposure",
dist = "categorical",
link = "logit",
formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
-1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
)
Your definition for the categorical formula only provides 2 probabilities where as the definition in the blog post has 3:
defY <- defDataAdd(defY, varname = "y",
formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
-0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ",
dist = "categorical", link = "logit")
See also https://kgoldfeld.github.io/simstudy/articles/simstudy.html#categorical
from simstudy.
Thank you but in the example, Y
has four levels. I understood we have to provide K-1
formulas if the variable has `K levels.
It is also not clear how to make sure the probabilities sum to 1 when using complex data generating mechanisms.
from simstudy.
Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.
And good catch on the documentation. Categorical data can be defined with the logit link.
from simstudy.
Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.
And good catch on the documentation. Categorical data can be defined with the logit link.
Okay makes sense. I apologize for this basic question, but with such complex generating mechanisms (especially for variables with many levels), how can we make sure that the sum of the probabilities is indeed 1?
from simstudy.
Related Issues (20)
- Release simstudy 0.6.0 HOT 2
- add double-dot functionality for defSurv
- Generate unbalanced cluster sizes HOT 1
- Modify survParamPlot to allow x-axis limits HOT 1
- double dot notation not working properly in genSurv HOT 1
- Release simstudy 0.7.0
- Generating large data sets is slower than I thought HOT 8
- addCorGen can be quite slow
- genCorGen with varying cluster sizes
- Add flexibility to function logisticCoefs
- Treatment values change when `ratio` argument is used? HOT 6
- nonrandom distribution returns a single value when repeated values are expected HOT 1
- Release simstudy 0.7.1
- External variable in `logisticCoefs` call not recognized when inside function call HOT 9
- Create a longitudinal dataset is not just counting integers but reflects actual time points HOT 4
- genBlockMat() function doesn't exist HOT 2
- Release simstudy 0.8.0 HOT 1
- Add log link option for binary data generation
- functions used in formula= argument in `defData` not available for functions defined out of global namespace HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from simstudy.