kgoldfeld / simstudy Goto Github PK
View Code? Open in Web Editor NEWsimstudy: Illuminating research methods through data generation
Home Page: https://kgoldfeld.github.io/simstudy/
License: GNU General Public License v3.0
simstudy: Illuminating research methods through data generation
Home Page: https://kgoldfeld.github.io/simstudy/
License: GNU General Public License v3.0
library(simstudy)
baseProbs1 <- matrix(0.25, 5, 4)
dT <- genData(1000)
dX <- genCorOrdCat(dT,
adjVar = NULL, baseprobs = baseProbs1,
prefix = "q", rho = -.5, corstr = "cs"
)
#> Error in mvnfast::rmvn(n = n, mu = mu, sigma = varMatrix): chol(): decomposition failed
Created on 2020-09-19 by the reprex package (v0.3.0.9001)
I know it is clearly stated in the documentation but it has caused silent bugs/issues in my simulatons when catProbs adds a category or normalizes the values due to a typo. A warning would have saved me time debugging 😄
currently there is no check for negative probabilities in .adjustProbs
Currently, the treatment assignment process using trtAssign
breaks the flow of the creation of a data set. Usually there is an outcome variable that is a function of the treatment assignment - so that we need to add a column to the table after the treatment assignment is made.
# Data definitions (requires two definitions for the same data set)
d1 <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d1 <- defData(d1, varname = "x1", formula = "0;10", dist = "uniform")
d1 <- defData(d1, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")
d2 <- defDataAdd(varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")
# Data generation (requires three function calls)
dd <- genData(500, d1)
dd <- trtAssign(dd, strata = "g", grpName = "rx")
dd <- addColumns(d2, dd)
What if we added a "trtAssign" distribution to the data def table so that the treatment assignment can be part of a single data generation process? It would look like this:
# Only one data definition
d <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d <- defData(d, varname = "x1", formula = "0;10", dist = "uniform")
d <- defData(d, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")
d <- defData(d, varname = "rx", formula = "1;1", variance = "g", dist = "trtAssign")
d <- defData(d, varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")
# Only one function call to generate the data
dd <- genData(500, d)
The formula for trtAssign represents the treatment assignment ratio defaults to "1;1", but could be of any length - so, if it is "1;1;1;2" that would be four groups. The variance parameter represents the stratification. Multiple levels of stratification would be represented as "a;b;c", where a, b, and c are variable names (really need to be categorical or factors). The functionality is exactly as it is in function trtAssign
.
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
cran-comments.md
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
I just uploaded a new version of the "Overview" vignette to the Vignette branch. We need to see if it can be generated with the package. Tell me what I need to do in order to facilitate that - such as generate the html file. And, I want to make sure that it passes all the rhub checks, because that was giving me a problem a couple years ago, and that is why I moved everything to my website.
If this one works, I can move all the other vignettes over as well, and improve on them, of course, over time.
And if you think the overview is still not good, let me know. One thing I considered was including R examples for each of the distributions, but thought that might be too much. But, it would be easy to add. And any other suggestions would be welcome.
After the JOSS paper is published we should create a proper CITATION file for the package.
I have a question about using the simulated correlated data. Suppose I generate three variables a
b
and c
with a particular correlation matrix using genCorData
. I would like to simulate a new variable d
with d = 0.2*a+0.3*b+0.5*c
and variance 1. However, I haven't figured out a way to do that, since addCorData
doesn't have an option to impose variance and the correlated variables a
b
c
are not part of a data definition table.
Thanks!
I just added some code in the non-prop-odds-assumption branch that shows it is possible to generate data that is sequentially dependent over time. The user would specify a data definition that looks like this:
varname formula variance dist link
1: y0 0 2 normal identity
2: x0 0 4 normal identity
3: y<t> .5*x<t-1> + .8*y<t-1> 2 normal identity
4: x<t> .6*x<t-1> + 0.4*x<t-2> 4 normal identity
where t represents a time period. The new function essentially converts this definition table into an expanded definition table that builds y1
, x1
, y2
, x2
, etc depending on how periods the user specifies. Here is what the new table would look like based on t=5
:
varname formula variance dist link
1: y0 0 2 normal identity
2: x0 0 4 normal identity
3: y1 .5*x0 + .8*y0 2 normal identity
4: x1 .6*x0 + 0.4*x0*0 4 normal identity
5: y2 .5*x1 + .8*y1 2 normal identity
6: x2 .6*x1 + 0.4*x0 4 normal identity
7: y3 .5*x2 + .8*y2 2 normal identity
8: x3 .6*x2 + 0.4*x1 4 normal identity
9: y4 .5*x3 + .8*y3 2 normal identity
10: x4 .6*x3 + 0.4*x2 4 normal identity
11: y5 .5*x4 + .8*y4 2 normal identity
12: x5 .6*x4 + 0.4*x3 4 normal identity
Once this wide file is generated, a long version can easily be created, so the final data set looks like this:
id period y x
1: 1 0 -0.6700170 0.1641139
2: 1 1 -2.0767351 0.5040484
3: 1 2 -1.4610563 -0.4516962
4: 1 3 -2.4996284 -3.6780939
5: 1 4 -4.3509236 -2.2907568
---
596: 100 1 2.5804474 0.4800748
597: 100 2 1.5806715 1.8110704
598: 100 3 2.6266267 -1.5400280
599: 100 4 1.3507070 -2.9591094
600: 100 5 -0.8595617 -2.9844094
Obviously, much to be remains to be done or improved, but this is the idea.
While getting the new evalDef working (need to pass link from defData), I noticed that the first defined variable is not checked by evalDef, only tested to see if it is scalar.
The following snippet generates an error. In the previous multivariate version, do you recall if there was the ability to shift the probabilities using the adjustment variable? I am guessing not. If not, maybe we should consider fixing this, and if it was, then maybe we should definitely fix.
library(simstudy)
d1 <- defData(varname = "z1", formula = 0, variance = .25)
dd <- genData(10, d1)
baseprobs <- matrix(c(.4, .3, .2, .1,
.5, .4, .05, .05), byrow = T, nrow = 2)
dd <- genOrdCat(dtName = dd, adjVar = "z1", baseprobs = baseprobs)
#> [1] 2
#> Error in rep(var, n): invalid 'times' argument
I was just looking through the code and I noticed something changed that probably shouldn't have. Whenever we pass a data.table as an argument to a function, it is imperative to copy the data table to another data.table. So you might see something like:
foo <- function(dtName, ...) {
dT <- copy(dtName)
.
.
}
This addresses a weird R problem (and maybe there is a better work around, but this was the only way I could figure out) where the original data.table gets changed by the function call even if that was not the intention. For example:
library(data.table)
dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#> x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5
# doesn't do what you think it does
chk <- function(d) {
d[, z := c(9, 10, 11)]
d[]
}
chk(dd)
#> x y z
#> 1: 1 3 9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#> x y z
#> 1: 1 3 9
#> 2: 2 4 10
#> 3: 3 5 11
# this does
dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#> x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5
chk2 <- function(d) {
d2 <- copy(d)
d2[, z := c(9, 10, 11)]
d2[]
}
chk2(dd)
#> x y z
#> 1: 1 3 9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#> x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5
Originally posted by @kgoldfeld in #49 (comment)
trtAssign
generates an error when a strata level has a count of one. The following snippet will generate an error:
dx <- data.table(x = rep(c(0, 1, 0 ,1), times = c(7, 1, 4, 4)),
y = rep(c(0, 1), each = 8))
dx <- trtAssign(dx, strata = c("x", "y"))
Surprisingly, this does not generate an error:
dx <- data.table(x = rep(c(0, 1, 0 ,1), times = c(1, 7, 4, 4)),
y = rep(c(0, 1), each = 8))
dx <- trtAssign(dx, strata = c("x", "y"))
I just noticed that in your latest version of simstudy in the "more-tests" branch, generating data using the "categorical" distribution causes R to crash. Also, I saw that genOrdCat also caused R to crash - I suspect the cause is the same. Here is a simple case that leads to problems:
defS <- defData(varname = "size", formula = "0.3;0.5;0.2", dist = "categorical")
ds <- genData(12, defS)
There is no error, just produces NAs.
def <- defData( varname = "x1", dist = "uniformInt", formula = "30;20")
genData(5,def)
id x1
1: 1 NA
2: 2 NA
3: 3 NA
Hi,
Is there any chance you could help me out with the following?
Is it possible to add x1
and x2
predictors to a dataset storing the outcome y
with subject
and observation
variables defining nesting?
I'm looking for this, because when simulating data I wan't to control the amount of variance in y
that is specific to between and within two levels. So that running a NULL model of lmer(y ~ 1 + (1 | subject))
shows that variances of random intercepts and residuals are for example 1 and 1 - 50/50 ratio (or any other of my choice). This part I know how to do using base R.
What I'm looking for, is a way to extend the null model dataset (leaving y
as it was) with x1
and x2
predictors that can (but don't have to) be measured at different levels (subject level or observation level) but controlling for their correlations and overall effect on y
- as in running an exemplary model of `lmer(y ~ x1 + x2 + (x1 | subject), data)'.
All the best,
blazej
I had another idea for a useful function, particularly if you implment the vectorization. It could be nice to have a function genMixFormula(functions, probs)
that takes two arguments - a vector of function names or a double-dot variable that is itself a vector and a vector of probabilities (which could be NULL, defaulting to equal probability 1/n with n functions being mixed.
Originally posted by @kgoldfeld in #44 (comment)
I noticed the problem resolved in Issue #83 in the first place because I need to generate ordinal categorical data without the assumption of proportional cumulative odds, which is what is currently assumed in genOrdCat. I did create a modified function genOrdCatNP in the non-proportional-odds branch (I will start doing it as you do by naming it after an issue).
I set it up as a different function because I am not sure the best way to proceed. I could see just adding it to the current function, Also, the argument names are not final by any means, and obviously would need to add the appropriate checks to these arguments.
I definitely don't want to merge this without your approval and input. If it is not clear exactly what this minor addition is doing, I can explain. Maybe (probably) there is a better way to do this.
I would definitely do a blog entry about this to explain, which we can add to the current vignette. No rush, as I can use this function outside of simstudy
for now, but just wanted to alert you.
Changing the file structure should not have any impact on the compiled package, as the function names etc. remain the same (because there are no .R files in a packed package anyway).
In light of this I have written up a new structure that could serve as a basis. After implementing something like this we can see how work feels and if we want to adjust further.
int_util.R
- all general utility functions like .isError etc.
util.R
- utility functions like trimData
generate_data.R
- all gen* functions related to data generation
int_generate_data.R
- all internal gen** functions related to data generation
generate_util.R
- gen* related helper functions like e.g. betaGetShapes
define.R
- def* functions
add.R
- add* functions
update.R
- update* functions
treatments.R
- trt* functions
- maybe all "treatment" simulation related functions should move here too? like addPeriods, genObs, defSurv etc.?
splines.R
- all spline related functions
addCondition
takes a "condition definition" as its first argument, which is a data.table of conditional definitions, like this:
defS <- defCondition(defS, condition = "ss==2",
formula = "(0.14 + a) * (C_rv==1) + (0.15 + a) * (C_rv==2) + (0.16 + a) * (C_rv==3)",
dist = "nonrandom")
defS <- defCondition(defS, condition = "ss==3",
formula = "(0.19 + a) * (C_rv==1) + (0.20 + a) * (C_rv==2) + (0.21 + a) * (C_rv==3)",
dist = "nonrandom")
addCondition(defS, ...)
addCondidtion checks to see if defS exists, which in this case it does and everything works fine.
However, if we have created several definitions and included them in a list of definitions, so that def_list <- list(defA = defA, defB = defB, defS = defS)
, this should work:
addCondition(deflist[["defS"]]
but the check fails given the way it is set up.
addColumns
under this scenario works fine, and I suspect that is because the check is part of the robust new checking system. addCondition
looks like it is still under the old system, and it fails.
I am not adding this so that we necessarily think about whether or not it makes before the release of 0.2.0 (maybe this is for 0.3.0). (In general, do we have a list of things that we must complete before 0.2.0 is ready - as opposed to things that might be nice to include?)
The new idea would be to allow for vectors to be included in double-dot notation. Here is a non-vectorized case where it might be nice:
library(simstudy)
size1 <- 2
size2 <- 4
def <- defData(varname = "blksize",
formula = "..size1 | .5 + ..size2 | .5", dist = "mixture")
genData(10, def)
#> id blksize
#> 1: 1 4
#> 2: 2 4
#> 3: 3 4
#> 4: 4 2
#> 5: 5 2
#> 6: 6 2
#> 7: 7 4
#> 8: 8 4
#> 9: 9 4
#> 10: 10 2
However, it would be more compact to write as a vector, but obviously, it doesn't work:
sizes <- c(2, 4)
def <- defData(varname = "blksize",
formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")
#> Error in .checkMixture(newform): Invalid variable(s):
#> Probabilities can only be numeric or numeric ..vars (not arrays). See ?distribution
I would suggest that we follow a model similar to this with:
use bench::cb
with github actions to assert simstudy's good performance.
The current approach to providing a the name for the "id" field is not ideal. It can be done either in the first call to defData or if no definition table is provided in genData, then it can be done there. I would like to think about making this consistent.
def <- defData(varname = "x1", dist = "binary", formula = .5, id = "cid")
genData(5, id = "xid")
genData(5, def, id = "xid")
as mentioned in #41 thie does not work but should:
#--- this doesn't work but is how I like to do things
myreplicate <- function(x, def) {
age_effect <- x
genData(2, def)
}
lapply(c(0, 5, 10), function(x) myreplicate(x, def))
#> [[1]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
#>
#> [[2]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
#>
#> [[3]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
The example in the README is missing the dist
argument -- including it would improve readability.
Hi, thanks for your package!
I am trying to simulate correlated data from a correlation matrix that contains mixed variable types. I have 8 variables with 5 being categorical and 3 numeric ones.
I was trying to do that using genCorGen() but I can only specify distributions for one type of variable at a time, (e.g. binary for categorical or gamma for numeric).
Is it possible to use my one correlation matrix with 8 variables that would simulate data with mixed distribution types?
Currently genOrdCat requires a variable to be specified in argument adjVar - there is no real reason this needs to be the case. If the value of adjVar is NULL, then the adjustment should be assumed to be 0 in all cases.
Merge genCorOrdCat into genOrdCat (with no correlation as default) allow creation 1 or many new cat vars. Deprecate genCorOrdCa (e.g. like this)
https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
Do we need to change anything?
Functions rbinom(), rgeom(), rhyper(), rpois(), rnbinom(), rsignrank() and rwilcox() which have returned integer since R 3.0.0 and hence NA when the numbers would have been outside the integer range, now return double vectors (without NAs, typically) in these cases.
This sounds (minorly) interesting?
stopifnot() now allows customizing error messages via argument names,
Hey!
While using simstudy
I needed certain data that were dependent on variables previously defined through defData
but not easily translatable to one of the default distributions. I solved this "outside" of simstudy with a function and mapply
.
As this introduces additional elements in the script and breaks definition tables as a complete overview of the defined data, I feel that it would be nice to extend the definition tables/arguments to make that possible.
I have only had a glimpse at the responsible code, so I am not sure how much work this would be. But if you are open to the idea I would like to try. Maybe functions can be added to definition tables to have all needed information in one place? Let me know what you think! 😃
Feel free to add or change anything as you see fit!
The goal is to refactor .evalDef
to be easily testable, maintainable and extensible.
..var
and exist in .GlobalEnv..vars
.Hello!
During the implementation of the pseudorandom
distribution I came about .evalDef and noticed that it needs to be changed at several places to support a new distribution. It also applies certain checks to all formula
arguments irrespective of use in the chosen distribution (i had to switch from "1,2,3;4" to "1+2+3;4"
).
I think the function can be refactored to be easily extendable with new distributions and to allow for more possibilities when parsing arguments.
The function could use templates/patterns/helper functions (naming is hard) to check the applicable arguments but with tests dependent on the chosen distribution.
btw should we rename catProbs to genCatFormula to be inline with genFormula and genMixFormula?
Originally posted by @assignUser in #51 (comment)
At the moment, genData()
includes an argument id
to specify the name of the id/index field in the data set being generated. This only works if no definition table created by defData
is being used. This should be changed so that the argument works all the time. It would override any definition provided in the first defData
statement.
Hey,
just to make sure: if I understand you correctly here: https://www.rdatagen.net/page/ordinal/
than it is intended to have n+1 categories for n "probabilities" when using logit link, as they are thresholds and not probabilities as when used with identity.
Is this correct? If so we should add this into the documentation (maybe it's in there and im just blind :D) .
When adding row names to the base probability matrix, the function does not work anymore. Reproducible example:
baseprobs.pre <- matrix(c(
.336,.121,.543,0,0,
.259,.237,.504,0,0,
.462,.161,.377,0,0,
.024,.165,.329,.412,.070,
.141,.278,.395,.167,.019,
.569,.197,.234,0,0,
.727,.167,.106,0,0,
.172,.155,.354,.314,.005,
.513,.273,.178,.034,.002,
.085,.090,.283,.478,.064,
.152,.281,.567,0,0,
.349,.188,.463,0,0,
.504,.423,.071,.002,0,
.385,.371,.197,.045,.002,
.093,.386,.345,.168,.008,
.536,.247,.181,.036,0,
.979,.014,.007,0,0
),ncol = 5, byrow = T)
genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = .2, corstr = "cs")
id grp1 grp10 grp11 grp12 grp13 grp14 grp15 grp16 grp17 grp2 grp3 grp4 grp5 grp6 grp7 grp8 grp9
1: 1 2 4 3 1 1 2 3 2 1 3 3 2 1 1 1 1 3
rownames(baseprobs.pre) <- c("H04","H05","H06","H01","H10","H12","H16","H02","H03","H07","H13","H14","H08","H09","H11","H15","H17")
genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = .2, corstr = "cs")
Error in genCorOrdCat(data_pre, baseprobs = baseprobs.pre, rho = 0.2, :
Probabilities are not properly specified: each row must sum to one
The double-dot notation for external variable reference is awesome. A couple of things.
(1) In my workflow, I often put the data definitions at the top, and it seems natural to sometimes first reference the external below
#--- this doesn't work but is the way I like to do things
def <- defData(varname = "age", formula=10, dist = "nonrandom")
def <- defData(def, varname="agemult",
formula="age * ..age_effect", dist="nonrandom")
#> Error: Escaped variables referenced not defined (or not numeric): age_effect
age_effect <- 3
genData(2, def)
#> id age
#> 1: 1 10
#> 2: 2 10
#--- this does work as I am sure you know
age_effect <- 3
def <- defData(varname = "age", formula=10, dist = "nonrandom")
def <- defData(def, varname="agemult",
formula="age * ..age_effect", dist="nonrandom")
genData(2, def)
#> id age agemult
#> 1: 1 10 30
#> 2: 2 10 30
(2) The strength of the double-dot notation is how it makes dynamic data definition so easy. To generate different data sets in R using different assumptions, there are at least two ways. The first approach, using for loops, works fine, but I find it a little clunky and a bit less amenable to parallelization. (And, by the way, this is the perfect kind of case where it doesn't make sense to introduce age_effect
before creating def
. In fact, it is just not possible - except to create a dummy version of age_effect
, which is not so appealing. But, maybe there's no way around this.
#--- this works but is a little clunky
list_of_data <- list()
age_effects <- c(0, 5, 10)
for (i in seq_along(age_effects)) {
age_effect <- age_effects[i]
list_of_data[[i]] <- genData(2, def)
}
list_of_data
#> [[1]]
#> id age agemult
#> 1: 1 10 0
#> 2: 2 10 0
#>
#> [[2]]
#> id age agemult
#> 1: 1 10 50
#> 2: 2 10 50
#>
#> [[3]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
I prefer to use lapply or mclapply - seems more compact and faster than the for loop, especially when doing 1000's of replications. But it doesn't work - and not sure how how hard it would be to make it work.
#--- this doesn't work but is how I like to do things
myreplicate <- function(x, def) {
age_effect <- x
genData(2, def)
}
lapply(c(0, 5, 10), function(x) myreplicate(x, def))
#> [[1]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
#>
#> [[2]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
#>
#> [[3]]
#> id age agemult
#> 1: 1 10 100
#> 2: 2 10 100
This code is now running very, very, very slow:
d <- defData(varname = "grp", formula = "0.125;0.25;0.375;0.25", dist = "categorical")
genData(100000, d)
Under the release version, this is pretty instantaneous. Under development, it takes about 10 seconds. Other distributions seem fine. It doesn't look like you made any changes that would affect this - probably an Rcpp issue.
I guess a more R-way (which is a matter of taste) would be to specify formula = -2+age*0.1, i.e., as an R expression, see ?deparse
and ?substitute. The same with varname and dist. See base R functions transform() and subset() for inspiration.
As brought up by gagolews in openjournals/joss-reviews/issues/2763 we should think about this long term. It is a breaking change so not a small issue.
I know this has been merged - but in the future, we should be checking for valid corstr - (which are current "ind", "cs", and "ar1"). Right now, if it isn't one of those, it defaults to "ind", which is fine, but there should be a warning.
Originally posted by @kgoldfeld in #47 (comment)
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
I would propose that we use the tidyverse style guide just with camelCase instead of snake_case.
Let me know what you think @kgoldfeld :)
Sincerest apologies why this isn't working for me. I have noticed that in the fifth or sixth decimal place the probabilities in baseprobs don't add to one.
((-21 - -52) - -75)
is a valid expression and can be succesfully eval'd but .evalDef will flag it (pretty sure because of the - -
.
My goal is to cover each function we work on during other issues (like #23 ) with tests while fixing these. This is pretty easy for deterministic functions like trimData
but requires some thought when the function in question has a random element (aka all gen* functions etc.).
The most "basic" way to test these would be to create sample output using a seed, saving the output and during test use the same seed and compare if the function output and sample output are identical.
I am not in favor of this approach as it limits testing (e.g. just one input. no property based testing, just simple comparison) and does not really do what tests should do. We want to test the API (or behavior) of the function, not it's internals. When we use exact output to check against we bind ourself to the implementation, as any change in the generation function (either on our end or on packages we depend on) would break the test. And we would not know if something is actually wrong with the generated data or if just some other equally valid values where generated.
So I think the best way would be to actually analyse the output for the specified characteristics. Sounds easy enough but at this point I will need your help in some (most? 😁 ) cases.
For genOrdCat it is easy enough just generate a large number of observations, calculate frequencies and compare to the baseprobs. But for e.g. genCorOrdCat I am not sure what is specifically required to check (in the example you show how to check the baseprobs vs. generated data so thats a start). So it would be great if I could depend on you in these cases to write some code to check for the relevant characteristics of the functions in question (currently: genCorOrdCat) and I will use that code to create proper full-featured tests.
defData, defDataAdd, defRead, defReadAdd should all use the same code/ show same behaviour.
This is of course already possible through using functions in nonrandom distribution. But how about formalizing user defined distributions by allowing to pass either one of the usual dist strings or a double-dot prefixed function name that is then used to generate the data. What do you think @kgoldfeld?
Something like
newDistribution <- function(n, formula, variance, link, dtSim, envir) {
# some funky new data generation function
}
def <- defData( varname = "ud", dist = "..newDistribution", formula =5)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.