kgoldfeld / simstudy Goto Github PK

View Code? Open in Web Editor NEW

74.0 5.0 8.0 68.22 MB

simstudy: Illuminating research methods through data generation

Home Page: https://kgoldfeld.github.io/simstudy/

License: GNU General Public License v3.0

R 94.11% C++ 3.49% TeX 2.40%

r data-simulation data-generation simulation statistical-models

simstudy's Introduction

simstudy

The simstudy package is a collection of functions that allow users to generate simulated data sets in order to explore modeling techniques or better understand data generating processes. The user defines the distributions of individual variables, specifies relationships between covariates and outcomes, and generates data based on these specifications. The final data sets can represent randomized control trials, repeated measure designs, cluster randomized trials, or naturally observed data processes. Other complexities that can be added include survival data, correlated data, factorial study designs, step wedge designs, and missing data processes.

Simulation using simstudy has two fundamental steps. The user (1) defines the data elements of a data set and (2) generates the data based on these definitions. Additional functionality exists to simulate observed or randomized treatment assignment/exposures, to create longitudinal/panel data, to create multi-level/hierarchical data, to create datasets with correlated variables based on a specified covariance structure, to merge datasets, to create data sets with missing data, and to create non-linear relationships with underlying spline curves.

The overarching philosophy of simstudy is to create data generating processes that mimic the typical models used to fit those types of data. So, the parameterization of some of the data generating processes may not follow the standard parameterizations for the specific distributions. For example, in simstudy gamma-distributed data are generated based on the specification of a mean μ (or log(μ)) and a dispersion $d$, rather than shape α and rate β parameters that more typically characterize the gamma distribution. When we estimate the parameters, we are modeling μ (or some function of μ), so we should explicitly recover the simstudy parameters used to generate the model, thus illuminating the relationship between the underlying data generating processes and the models. For more details on the package, use cases, examples, and function reference see the documentation page.

Installation

You can install the released version of simstudy from CRAN with:

install.packages("simstudy")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("kgoldfeld/simstudy")

Example

Here is some simple sample code, much more in the vignettes:

library(simstudy)
set.seed(1965)

def <- defData(varname="x", formula = 10, variance = 2, dist = "normal")
def <- defData(def, varname="y", formula = "3 + 0.5 * x", variance = 1, dist = "normal")
dd <- genData(250, def)

dd <- trtAssign(dd, nTrt = 4, grpName = "grp", balanced = TRUE)

dd
#>       id         x        y grp
#>   1:   1 11.191960 8.949389   4
#>   2:   2 10.418375 7.372060   4
#>   3:   3  8.512109 6.925844   3
#>   4:   4 11.361632 9.850340   4
#>   5:   5  9.928811 6.515463   4
#>  ---                           
#> 246: 246  8.220609 7.898416   2
#> 247: 247  8.531483 8.681783   2
#> 248: 248 10.507370 8.552350   3
#> 249: 249  8.621339 6.652300   1
#> 250: 250  9.508164 7.083845   3

Contributing & Support

If you find a bug or need help, please file an issue with a reprex on Github. We are happy to accept contributions to simstudy. More information on how to propose changes or fix bugs can be found here.

Code of Conduct

Please note that the simstudy project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

simstudy's People

Stargazers

Watchers

Forkers

michaelsbradleyjr abeeisnotabug gsverhoeven brunaw assignuser elinsooon apoorvalal vlulla

simstudy's Issues

check for negative entries in .adjustProbs

currently there is no check for negative probabilities in .adjustProbs

question about using simulated correlated data

I have a question about using the simulated correlated data. Suppose I generate three variables a b and c with a particular correlation matrix using genCorData. I would like to simulate a new variable d with d = 0.2*a+0.3*b+0.5*c and variance 1. However, I haven't figured out a way to do that, since addCorData doesn't have an option to impose variance and the correlated variables a b c are not part of a data definition table.

Thanks!

Testing of random/statistical functions

My goal is to cover each function we work on during other issues (like #23 ) with tests while fixing these. This is pretty easy for deterministic functions like trimData but requires some thought when the function in question has a random element (aka all gen* functions etc.).

The most "basic" way to test these would be to create sample output using a seed, saving the output and during test use the same seed and compare if the function output and sample output are identical.

I am not in favor of this approach as it limits testing (e.g. just one input. no property based testing, just simple comparison) and does not really do what tests should do. We want to test the API (or behavior) of the function, not it's internals. When we use exact output to check against we bind ourself to the implementation, as any change in the generation function (either on our end or on packages we depend on) would break the test. And we would not know if something is actually wrong with the generated data or if just some other equally valid values where generated.

So I think the best way would be to actually analyse the output for the specified characteristics. Sounds easy enough but at this point I will need your help in some (most? 😁 ) cases.

For genOrdCat it is easy enough just generate a large number of observations, calculate frequencies and compare to the baseprobs. But for e.g. genCorOrdCat I am not sure what is specifically required to check (in the example you show how to check the baseprobs vs. generated data so thats a start). So it would be great if I could depend on you in these cases to write some code to check for the relevant characteristics of the functions in question (currently: genCorOrdCat) and I will use that code to create proper full-featured tests.

Truncated Distributions

rename catProbs & deprecate

btw should we rename catProbs to genCatFormula to be inline with genFormula and genMixFormula?

Originally posted by @assignUser in #51 (comment)

unify defData/defData add

While getting the new evalDef working (need to pass link from defData), I noticed that the first defined variable is not checked by evalDef, only tested to see if it is scalar.

Refactor .evalDef to be easily extendable

Specifications for .evalDef

Feel free to add or change anything as you see fit!
The goal is to refactor .evalDef to be easily testable, maintainable and extensible.

test the new .evalDef function thoroughly

General Checks

newVar,defVars:

[err] newVars is NOT previously defined and
[warn] a valid r name -> make.names

formula:

[err] dist dependent format checks
[err] if no defVars: check that formula == scalar

dist:

[err] is valid dist

Distribution Specific Checks

general arithmetic expression

[err] all used vars are previously defined or
are escaped with ..var and exist in .GlobalEnv
[err] is a valid expression build from: numbers, previously defined vars, ~~and base R-Operators and functions (e.g. log)~~ valid functions and ..vars.
[err] or scalar

beta

[err] formula see g.a.e
[err] variance/dispersion value see g.a.e
[err] link %in% c("identity","logit")

binary

[err] ~~formula = [0,1]~~ not posssible until after genData -> g.a.e
[err] link %in% c("identity","logit")
variance = NULL

binomial

[err] ~~formula = [0,1]~~ not posssible until after genData -> g.a.e
[err] ~~variance = integer >= 0~~ not posssible until after genData -> g.a.e
[err] link %in% c("identity","logit")

categorical

[err] formula format == "p1;...;pn"
~~allow external vars (..var)~~ no real benefit, can just be pasted into formula
[err] all.equal(sum(probs),1) == TRUE
link/variance = NULL (in the future may be used as category data input)

exponential

[err] formula see g.a.e
[err] link %in% c("identity","log")
variance = NULL

gamma

[err] formula see g.a.e
[err] variance/dispersion value see g.a.e
[err] link %in% c("identity","log")

mixture

[err] all used vars %in% defVars
allow external vars (..var)
[err] all.equal(sum(probs),1) == TRUE
[err] formula should have form "x1 | p1 + ... + xn | pn" or "numeric | p1 + ... + numeric | pn"
link/var = NULL

negBinomial

[err] formula see g.a.e
[err] variance/dispersion value see g.a.e
[err] link %in% c("identity","log")

nonrandom

[err] formula see g.a.e
variance/link = NULL

normal

[err] formula see g.a.e
[err] variance see g.a.e
link = NULL

noZeroPoisson

[err] formula see g.a.e
[err] link %in% c("identity","log")
variance = NULL

poisson

[err] formula see g.a.e
[err] link %in% c("identity","log")
variance = NULL

uniform

[err] formula format == "min;max"
~~allow external vars (..var)~~ no real benefit, can just be pasted into formula
[err] any(is.numeric(c(min,max)))
[err] max > min
[warn] max == min
link/var = NULL

uniformInt

see uniform
any(is.integer(min,max))

Hello!

During the implementation of the pseudorandom distribution I came about .evalDef and noticed that it needs to be changed at several places to support a new distribution. It also applies certain checks to all formula arguments irrespective of use in the chosen distribution (i had to switch from "1,2,3;4" to "1+2+3;4").

I think the function can be refactored to be easily extendable with new distributions and to allow for more possibilities when parsing arguments.

The function could use templates/patterns/helper functions (naming is hard) to check the applicable arguments but with tests dependent on the chosen distribution.

Add CITATION file

After the JOSS paper is published we should create a proper CITATION file for the package.

genCorOrdCat fails for rho < 0 despite doc stating otherwise.

library(simstudy)
baseProbs1 <- matrix(0.25, 5, 4)

dT <- genData(1000)

dX <- genCorOrdCat(dT,
  adjVar = NULL, baseprobs = baseProbs1,
  prefix = "q", rho = -.5, corstr = "cs"
)
#> Error in mvnfast::rmvn(n = n, mu = mu, sigma = varMatrix): chol(): decomposition failed

^{Created on 2020-09-19 by the reprex package (v0.3.0.9001)}

Vectorize genFactor

Issue this etc.
https://github.com/kgoldfeld/simstudy/blob/main/R/generate_data.R#L229-L237

Use R-expression for formula arguments

I guess a more R-way (which is a matter of taste) would be to specify formula = -2+age*0.1, i.e., as an R expression, see ?deparse
and ?substitute. The same with varname and dist. See base R functions transform() and subset() for inspiration.

As brought up by gagolews in openjournals/joss-reviews/issues/2763 we should think about this long term. It is a breaking change so not a small issue.

double-dot notation for formula

The double-dot notation for external variable reference is awesome. A couple of things.

(1) In my workflow, I often put the data definitions at the top, and it seems natural to sometimes first reference the external below

 #--- this doesn't work but is the way I like to do things

def <- defData(varname = "age", formula=10, dist = "nonrandom")
def <- defData(def, varname="agemult", 
               formula="age * ..age_effect", dist="nonrandom")
#> Error: Escaped variables referenced not defined (or not numeric): age_effect

age_effect <- 3
genData(2, def)
#>    id age
#> 1:  1  10
#> 2:  2  10

  #--- this does work as I am sure you know

age_effect <- 3

def <- defData(varname = "age", formula=10, dist = "nonrandom")
def <- defData(def, varname="agemult", 
               formula="age * ..age_effect", dist="nonrandom")

genData(2, def)
#>    id age agemult
#> 1:  1  10      30
#> 2:  2  10      30

(2) The strength of the double-dot notation is how it makes dynamic data definition so easy. To generate different data sets in R using different assumptions, there are at least two ways. The first approach, using for loops, works fine, but I find it a little clunky and a bit less amenable to parallelization. (And, by the way, this is the perfect kind of case where it doesn't make sense to introduce age_effect before creating def. In fact, it is just not possible - except to create a dummy version of age_effect, which is not so appealing. But, maybe there's no way around this.

  #--- this works but is a little clunky

list_of_data <- list()

age_effects <- c(0, 5, 10)
for (i in seq_along(age_effects)) {
  age_effect <- age_effects[i]
  list_of_data[[i]] <- genData(2, def)  
}

list_of_data
#> [[1]]
#>    id age agemult
#> 1:  1  10       0
#> 2:  2  10       0
#> 
#> [[2]]
#>    id age agemult
#> 1:  1  10      50
#> 2:  2  10      50
#> 
#> [[3]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100

I prefer to use lapply or mclapply - seems more compact and faster than the for loop, especially when doing 1000's of replications. But it doesn't work - and not sure how how hard it would be to make it work.

  #--- this doesn't work but is how I like to do things

myreplicate <- function(x, def) {
  age_effect <- x
  genData(2, def)
}

lapply(c(0, 5, 10), function(x) myreplicate(x, def))
#> [[1]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100
#> 
#> [[2]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100
#> 
#> [[3]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100

Question: Generate fixed effects in nested structure given the outcome

Hi,
Is there any chance you could help me out with the following?

Is it possible to add x1 and x2 predictors to a dataset storing the outcome y with subject and observation variables defining nesting?

I'm looking for this, because when simulating data I wan't to control the amount of variance in y that is specific to between and within two levels. So that running a NULL model of lmer(y ~ 1 + (1 | subject)) shows that variances of random intercepts and residuals are for example 1 and 1 - 50/50 ratio (or any other of my choice). This part I know how to do using base R.

What I'm looking for, is a way to extend the null model dataset (leaving y as it was) with x1 and x2 predictors that can (but don't have to) be measured at different levels (subject level or observation level) but controlling for their correlations and overall effect on y - as in running an exemplary model of `lmer(y ~ x1 + x2 + (x1 | subject), data)'.

All the best,
blazej

addCondition check for condition defintion fails incorrectly if definition is part of a list

addCondition takes a "condition definition" as its first argument, which is a data.table of conditional definitions, like this:

defS <- defCondition(defS, condition = "ss==2",  
                       formula = "(0.14 + a) * (C_rv==1) + (0.15 + a) * (C_rv==2) + (0.16 + a) * (C_rv==3)", 
                       dist = "nonrandom")
defS <- defCondition(defS, condition = "ss==3",  
                       formula = "(0.19 + a) * (C_rv==1) + (0.20 + a) * (C_rv==2) + (0.21 + a) * (C_rv==3)", 
                       dist = "nonrandom")

addCondition(defS, ...)

addCondidtion checks to see if defS exists, which in this case it does and everything works fine.

However, if we have created several definitions and included them in a list of definitions, so that def_list <- list(defA = defA, defB = defB, defS = defS), this should work:

addCondition(deflist[["defS"]]

but the check fails given the way it is set up.

addColumns under this scenario works fine, and I suspect that is because the check is part of the robust new checking system. addCondition looks like it is still under the old system, and it fails.

genOrdCat with two responses but single adjustment variable generates error

The following snippet generates an error. In the previous multivariate version, do you recall if there was the ability to shift the probabilities using the adjustment variable? I am guessing not. If not, maybe we should consider fixing this, and if it was, then maybe we should definitely fix.

library(simstudy)

d1 <- defData(varname = "z1", formula = 0, variance = .25)

dd <- genData(10, d1)

baseprobs <- matrix(c(.4, .3, .2, .1,
                      .5, .4, .05, .05), byrow = T, nrow = 2)

dd <- genOrdCat(dtName = dd, adjVar = "z1", baseprobs = baseprobs)
#> [1] 2
#> Error in rep(var, n): invalid 'times' argument

Add trtAssignment as a "distribution" to defData to ease flow?

Currently, the treatment assignment process using trtAssign breaks the flow of the creation of a data set. Usually there is an outcome variable that is a function of the treatment assignment - so that we need to add a column to the table after the treatment assignment is made.

# Data definitions (requires two definitions for the same data set)

d1 <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d1 <- defData(d1, varname = "x1", formula = "0;10", dist = "uniform")
d1 <- defData(d1, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")

d2 <- defDataAdd(varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")

# Data generation (requires three function calls)

dd <- genData(500, d1)
dd <- trtAssign(dd, strata = "g", grpName = "rx")
dd <- addColumns(d2, dd)

What if we added a "trtAssign" distribution to the data def table so that the treatment assignment can be part of a single data generation process? It would look like this:

# Only one data definition

d <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d <- defData(d, varname = "x1", formula = "0;10", dist = "uniform")
d <- defData(d, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")
d <- defData(d, varname = "rx", formula = "1;1", variance = "g", dist = "trtAssign")
d <- defData(d, varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")

# Only one function call to generate the data

dd <- genData(500, d)

The formula for trtAssign represents the treatment assignment ratio defaults to "1;1", but could be of any length - so, if it is "1;1;1;2" that would be four groups. The variance parameter represents the stratification. Multiple levels of stratification would be represented as "a;b;c", where a, b, and c are variable names (really need to be categorical or factors). The functionality is exactly as it is in function trtAssign.

Ready for R 4.0

https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

Do we need to change anything?

Functions rbinom(), rgeom(), rhyper(), rpois(), rnbinom(), rsignrank() and rwilcox() which have returned integer since R 3.0.0 and hence NA when the numbers would have been outside the integer range, now return double vectors (without NAs, typically) in these cases.

This sounds (minorly) interesting?

stopifnot() now allows customizing error messages via argument names,

Allow alternative distribution names

bernoulli as "standard" alternative to binary
gaussian as "standard" alternative to normal

Allow user defined functions in formulas

Hey!

While using simstudy I needed certain data that were dependent on variables previously defined through defData but not easily translatable to one of the default distributions. I solved this "outside" of simstudy with a function and mapply.

As this introduces additional elements in the script and breaks definition tables as a complete overview of the defined data, I feel that it would be nice to extend the definition tables/arguments to make that possible.

I have only had a glimpse at the responsible code, so I am not sure how much work this would be. But if you are open to the idea I would like to try. Maybe functions can be added to definition tables to have all needed information in one place? Let me know what you think! 😃

genData with dist = "categorical" is extremely slow

This code is now running very, very, very slow:

d <- defData(varname = "grp", formula = "0.125;0.25;0.375;0.25", dist = "categorical")
genData(100000, d)

Under the release version, this is pretty instantaneous. Under development, it takes about 10 seconds. Other distributions seem fine. It doesn't look like you made any changes that would affect this - probably an Rcpp issue.

Non-proportional odds assumption for genOrdCat

I noticed the problem resolved in Issue #83 in the first place because I need to generate ordinal categorical data without the assumption of proportional cumulative odds, which is what is currently assumed in genOrdCat. I did create a modified function genOrdCatNP in the non-proportional-odds branch (I will start doing it as you do by naming it after an issue).

I set it up as a different function because I am not sure the best way to proceed. I could see just adding it to the current function, Also, the argument names are not final by any means, and obviously would need to add the appropriate checks to these arguments.

I definitely don't want to merge this without your approval and input. If it is not clear exactly what this minor addition is doing, I can explain. Maybe (probably) there is a better way to do this.

I would definitely do a blog entry about this to explain, which we can add to the current vignette. No rush, as I can use this function outside of simstudy for now, but just wanted to alert you.

Add genMixFormula

I had another idea for a useful function, particularly if you implment the vectorization. It could be nice to have a function genMixFormula(functions, probs) that takes two arguments - a vector of function names or a double-dot variable that is itself a vector and a vector of probabilities (which could be NULL, defaulting to equal probability 1/n with n functions being mixed.

Originally posted by @kgoldfeld in #44 (comment)

catProbs should warn when adjusting probabilities

I know it is clearly stated in the documentation but it has caused silent bugs/issues in my simulatons when catProbs adds a category or normalizes the values due to a typo. A warning would have saved me time debugging 😄

Sequential dependent simulation

I just added some code in the non-prop-odds-assumption branch that shows it is possible to generate data that is sequentially dependent over time. The user would specify a data definition that looks like this:

   varname                formula variance   dist     link
1:      y0                      0        2 normal identity
2:      x0                      0        4 normal identity
3:    y<t>  .5*x<t-1> + .8*y<t-1>        2 normal identity
4:    x<t> .6*x<t-1> + 0.4*x<t-2>        4 normal identity

where t represents a time period. The new function essentially converts this definition table into an expanded definition table that builds y1, x1, y2, x2, etc depending on how periods the user specifies. Here is what the new table would look like based on t=5:

    varname          formula variance   dist     link
 1:      y0                0        2 normal identity
 2:      x0                0        4 normal identity
 3:      y1    .5*x0 + .8*y0        2 normal identity
 4:      x1 .6*x0 + 0.4*x0*0        4 normal identity
 5:      y2    .5*x1 + .8*y1        2 normal identity
 6:      x2   .6*x1 + 0.4*x0        4 normal identity
 7:      y3    .5*x2 + .8*y2        2 normal identity
 8:      x3   .6*x2 + 0.4*x1        4 normal identity
 9:      y4    .5*x3 + .8*y3        2 normal identity
10:      x4   .6*x3 + 0.4*x2        4 normal identity
11:      y5    .5*x4 + .8*y4        2 normal identity
12:      x5   .6*x4 + 0.4*x3        4 normal identity

Once this wide file is generated, a long version can easily be created, so the final data set looks like this:

      id period          y          x
  1:   1      0 -0.6700170  0.1641139
  2:   1      1 -2.0767351  0.5040484
  3:   1      2 -1.4610563 -0.4516962
  4:   1      3 -2.4996284 -3.6780939
  5:   1      4 -4.3509236 -2.2907568
 ---                                 
596: 100      1  2.5804474  0.4800748
597: 100      2  1.5806715  1.8110704
598: 100      3  2.6266267 -1.5400280
599: 100      4  1.3507070 -2.9591094
600: 100      5 -0.8595617 -2.9844094

Obviously, much to be remains to be done or improved, but this is the idea.

.evalDef formula check gives errors on valid formulas

~~((-21 - -52) - -75) is a valid expression and can be succesfully eval'd but .evalDef will flag it (pretty sure because of the - -.~~

Set name of id field in genData

At the moment, genData() includes an argument id to specify the name of the id/index field in the data set being generated. This only works if no definition table created by defData is being used. This should be changed so that the argument works all the time. It would override any definition provided in the first defData statement.

Rownames on probability matrix breaks genCorOrdCat

When adding row names to the base probability matrix, the function does not work anymore. Reproducible example:

baseprobs.pre <- matrix(c(
 .336,.121,.543,0,0,
 .259,.237,.504,0,0,
 .462,.161,.377,0,0,
 .024,.165,.329,.412,.070,
 .141,.278,.395,.167,.019,
 .569,.197,.234,0,0,
 .727,.167,.106,0,0,
 .172,.155,.354,.314,.005,
 .513,.273,.178,.034,.002,
 .085,.090,.283,.478,.064,
 .152,.281,.567,0,0,
 .349,.188,.463,0,0,
 .504,.423,.071,.002,0,
 .385,.371,.197,.045,.002,
 .093,.386,.345,.168,.008,
 .536,.247,.181,.036,0,
 .979,.014,.007,0,0
),ncol = 5, byrow = T)

genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = .2, corstr = "cs")

        id grp1 grp10 grp11 grp12 grp13 grp14 grp15 grp16 grp17 grp2 grp3 grp4 grp5 grp6 grp7 grp8 grp9
   1:    1    2     4     3     1     1     2     3     2     1    3    3    2    1    1    1    1    3

rownames(baseprobs.pre) <- c("H04","H05","H06","H01","H10","H12","H16","H02","H03","H07","H13","H14","H08","H09","H11","H15","H17") 

genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = .2, corstr = "cs")

Error in genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = 0.2,  : 
  Probabilities are not properly specified: each row must sum to one

Uniform/Int distribution does not check formula

There is no error, just produces NAs.

 def <- defData( varname = "x1", dist = "uniformInt", formula = "30;20")
genData(5,def)
 
   id x1
1:  1 NA
2:  2 NA
3:  3 NA

trtAssign fails when strata has count of one

trtAssign generates an error when a strata level has a count of one. The following snippet will generate an error:

dx <- data.table(x = rep(c(0, 1, 0 ,1), times = c(7, 1, 4, 4)),
                 y = rep(c(0, 1), each = 8))

dx <- trtAssign(dx, strata = c("x", "y"))

Surprisingly, this does not generate an error:

dx <- data.table(x = rep(c(0, 1, 0 ,1), times = c(1, 7, 4, 4)),
                 y = rep(c(0, 1), each = 8))

dx <- trtAssign(dx, strata = c("x", "y"))

Unify def* function checking behavior

defData, defDataAdd, defRead, defReadAdd should all use the same code/ show same behaviour.

pass-by-value vs. pass-by-reference

I was just looking through the code and I noticed something changed that probably shouldn't have. Whenever we pass a data.table as an argument to a function, it is imperative to copy the data table to another data.table. So you might see something like:

foo <- function(dtName, ...) {
   dT <- copy(dtName)
   .
   .
}

This addresses a weird R problem (and maybe there is a better work around, but this was the only way I could figure out) where the original data.table gets changed by the function call even if that was not the intention. For example:

library(data.table)

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

# doesn't do what you think it does

chk <- function(d) {
  d[, z := c(9, 10, 11)]
  d[]
}

chk(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11

# this does

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

chk2 <- function(d) {
  d2 <- copy(d)
  d2[, z := c(9, 10, 11)]
  d2[]
}

chk2(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

Originally posted by @kgoldfeld in #49 (comment)

Style guide

I would propose that we use the tidyverse style guide just with camelCase instead of snake_case.

Let me know what you think @kgoldfeld :)

User defined distributions

This is of course already possible through using functions in nonrandom distribution. But how about formalizing user defined distributions by allowing to pass either one of the usual dist strings or a double-dot prefixed function name that is then used to generate the data. What do you think @kgoldfeld?

Something like

newDistribution <- function(n, formula, variance, link, dtSim, envir) {
    # some funky new data generation function
}
def <- defData( varname = "ud", dist = "..newDistribution", formula =5)

link = "logit" causes additional category

Hey,

just to make sure: if I understand you correctly here: https://www.rdatagen.net/page/ordinal/
than it is intended to have n+1 categories for n "probabilities" when using logit link, as they are thresholds and not probabilities as when used with identity.

Is this correct? If so we should add this into the documentation (maybe it's in there and im just blind :D) .

specifying "id" field for a new data set

The current approach to providing a the name for the "id" field is not ideal. It can be done either in the first call to defData or if no definition table is provided in genData, then it can be done there. I would like to think about making this consistent.

def <- defData(varname = "x1", dist = "binary", formula = .5, id = "cid")

genData(5, id = "xid")
genData(5, def, id = "xid")

README's example missing dist argument

The example in the README is missing the dist argument -- including it would improve readability.

double-dot Vignette

once #41 #46 are closed: create vignette explaining the new formula system and multi core ability.

new branch structure

I would suggest that we follow a model similar to this with:

stable/release branch - always current released version (that is on cran/in submission)
~~master~~ main - latest version with newest features etc, (but not with in progress code, should be runnable as is)
development/feature branches e.g. "formula" for integrating the new formula system or "restructure" for the new .R file structure etc. we could use prefixes as shown in the linked document but I think that is unnecessary with the size of this project :)

Release simstudy 0.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()
submit to rweekly

Validate genOrdCat corstr

I know this has been merged - but in the future, we should be checking for valid corstr - (which are current "ind", "cs", and "ar1"). Right now, if it isn't one of those, it defaults to "ind", which is fine, but there should be a warning.

Originally posted by @kgoldfeld in #47 (comment)

Unify genOrdCat & genCorOrdCat

Currently genOrdCat requires a variable to be specified in argument adjVar - there is no real reason this needs to be the case. If the value of adjVar is NULL, then the adjustment should be assumed to be 0 in all cases.

Merge genCorOrdCat into genOrdCat (with no correlation as default) allow creation 1 or many new cat vars. Deprecate genCorOrdCa (e.g. like this)

genCorOrdCat

Sincerest apologies why this isn't working for me. I have noticed that in the fifth or sixth decimal place the probabilities in baseprobs don't add to one.

make double dot work with mcapply

as mentioned in #41 thie does not work but should:

  #--- this doesn't work but is how I like to do things

myreplicate <- function(x, def) {
  age_effect <- x
  genData(2, def)
}

lapply(c(0, 5, 10), function(x) myreplicate(x, def))
#> [[1]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100
#> 
#> [[2]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100
#> 
#> [[3]]
#>    id age agemult
#> 1:  1  10     100
#> 2:  2  10     100

testing

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Generate data from correlation matrix with mixed variable types

Hi, thanks for your package!

I am trying to simulate correlated data from a correlation matrix that contains mixed variable types. I have 8 variables with 5 being categorical and 3 numeric ones.

I was trying to do that using genCorGen() but I can only specify distributions for one type of variable at a time, (e.g. binary for categorical or gamma for numeric).

Is it possible to use my one correlation matrix with 8 variables that would simulate data with mixed distribution types?

consolidate .R files into new structure

Changing the file structure should not have any impact on the compiled package, as the function names etc. remain the same (because there are no .R files in a packed package anyway).

In light of this I have written up a new structure that could serve as a basis. After implementing something like this we can see how work feels and if we want to adjust further.

int_util.R
- all general utility functions like .isError etc.

util.R
- utility functions like trimData

generate_data.R
- all gen* functions related to data generation

int_generate_data.R
- all internal gen** functions related to data generation

generate_util.R
- gen* related helper functions like e.g. betaGetShapes

define.R
- def* functions

add.R
- add* functions

update.R
- update* functions

treatments.R
- trt* functions
- maybe all "treatment" simulation related functions should move here too? like addPeriods, genObs, defSurv etc.?

splines.R
- all spline related functions

Vignettes

I just uploaded a new version of the "Overview" vignette to the Vignette branch. We need to see if it can be generated with the package. Tell me what I need to do in order to facilitate that - such as generate the html file. And, I want to make sure that it passes all the rhub checks, because that was giving me a problem a couple years ago, and that is why I moved everything to my website.

If this one works, I can move all the other vignettes over as well, and improve on them, of course, over time.

And if you think the overview is still not good, let me know. One thing I considered was including R examples for each of the distributions, but thought that might be too much. But, it would be easy to add. And any other suggestions would be welcome.

Generating categorical outcome causes R to crash

I just noticed that in your latest version of simstudy in the "more-tests" branch, generating data using the "categorical" distribution causes R to crash. Also, I saw that genOrdCat also caused R to crash - I suspect the cause is the same. Here is a simple case that leads to problems:

defS <- defData(varname = "size", formula = "0.3;0.5;0.2", dist = "categorical")
ds <- genData(12, defS)

Integrate Continuous Benchmarking

use bench::cb with github actions to assert simstudy's good performance.

Another double-dot feature

I am not adding this so that we necessarily think about whether or not it makes before the release of 0.2.0 (maybe this is for 0.3.0). (In general, do we have a list of things that we must complete before 0.2.0 is ready - as opposed to things that might be nice to include?)

The new idea would be to allow for vectors to be included in double-dot notation. Here is a non-vectorized case where it might be nice:

library(simstudy)

size1 <- 2
size2 <- 4

def <- defData(varname = "blksize", 
    formula = "..size1 | .5 + ..size2 | .5", dist = "mixture")
genData(10, def)
#>     id blksize
#>  1:  1       4
#>  2:  2       4
#>  3:  3       4
#>  4:  4       2
#>  5:  5       2
#>  6:  6       2
#>  7:  7       4
#>  8:  8       4
#>  9:  9       4
#> 10: 10       2

However, it would be more compact to write as a vector, but obviously, it doesn't work:

sizes <- c(2, 4)

def <- defData(varname = "blksize", 
               formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")
#> Error in .checkMixture(newform): Invalid variable(s): 
#> Probabilities can only be numeric or numeric ..vars (not arrays). See ?distribution