squidgroup / squidsim Goto Github PK

squidSim in tool for simulating data from multi-level/hierarchical models, including genetic, phylogenetic, temporal and spatial effects

License: Other

R 100.00%

squidsim's People

Contributors

Stargazers

Watchers

squidsim's Issues

covariance structures

Need to work out how best to allow a user to simulate phylogenetic effects or breeding values, and perhaps more broadly give a correlation matrix, to allow for spatial effects for example.

add in simulating non-stochastic environments

work out how to put temporal effects in

create function to generate missing data

This would be based on the simulated phenotype, and via the three different processes MCAR, MAR and MNAR.

example in vignette produces warning

The example on this page: http://squidgroup.org/squidSim_vignette/2.3-simulating-predictors-at-different-hierarchical-levels.html
produces the warning:

Warning message:
In length(x) == 1 && x == "" :
 'length(x) = 1000 > 1' in coercion to 'logical(1)'.

named factors in data_structure

potential bug - if something is input as a fixed factor (fixed =TRUE in parameter list), the user can give names. If these were the same names as in the data_structure, there is no reference back to them in the code, so it would be easy for a user to put them in the wrong way round and the function wouldn't detect this, so the effects would be the wrong way in the generated data.

Also the default naming should be better. At the moment, it comes out as e.g. sex_effect1, sex_effect2. It would be much better if it linked back the names in the data_structure, so was sex_male and sex_female for example

simulating indirect genetic effects

I think this can be done through indexing in the model formula, which is already implemented in sim_population

sampling ideas

sampling to create survival data - binomial data, remove records after the first 1 (or 0) within a certain group
sampling to create structure that realistically creates number of observations per individual if you measure them across lifespan (geometric distribution)

speed up internal matrix functions

":" in make_structure

Good for specifying, but shouldn't be in the names in the resulting data.frame - year:month should end up as something like year_month

allow zero variance? or NA variance

allow naming in make_structure

at the moment the names of the levels of each factor are just 1:n. Would be nice if we could add names

Concistency in parameter naming

Here some thoughts on naming parameters:

The population id in the output dataset is named squid_pop. None of the other parameters is preceded by squid_. May be have it just pop or population.

All parameter names are in lower case except N and N_pop. What about n and n_pop?

if betas in known_predictors is named then this affects names of responses

wrapper functions

is there an easy way to to generate code for common model types. E.g random regression is a little complex, but maybe could have a function taht would generate random regression code structure?

Sampling from population

I'm trying to sample 25 individuals from simulated data of 100 individuals. However, the number of individuals sampled is different in each simulation.

library(squidSim)

sim_res <- simulate_population(
  
  data_structure = make_structure("individual(100)", repeat_obs=1),
  response_name = "Female",
  
  parameters = list(
    
    intercept = 0.19360,
    
    observation = list(
        names = c("Elo_score"),
        beta  = c(-0.11368)
      ),
    
    residual = list(
            vcov = 6.95e-10
          )
  ),
  
  family = "binomial", 
  link   = "logit",
  
  n_pop = 1,
  
  sample_type  = "nested",
  sample_param = cbind(individual=25, observation=1)
)

sample_data <- get_sample_data(sim_res, sample_set=1)

different sampling regimes for different predictors/responses

Think the best way to do this would be to specify different sampling regimes, and then have a way of combining them, by specifying which variables have which sampling regime.

Could do this in get_sample_data(), something along the lines of giving a list
list(1=c("rain"), 2=c("y"))
which would indicate that the first sampling regime was used for "rain" and the second for "y". This would then enable different response variables to be sampled differently, for example
list(1=c("y1"), 2=c("y2"))

It would also be a good idea to allow the sampled data to be returned with NAs in it, for example, have an argument include_NA=TRUE (with FALSE as default)

Known predictiors needs to be a dataframe and names are lost

know predictors need to be matrix or data frame. Names of variables need to be inputed manually, because they are lost if inputed as vectors.

Covariance between know predictors and simulated predictors

When simulating predictors, squidSim allows to specify the covariance matrix between all these predictors. squidSim also allows to input the user data. However, the covariance between simulated and know predictors is not possible. This could be a very useful feature.

Function to estimate the expected (co)variances

I think it will be nice to have a function that summarizes the expected patterns of variance and covariance at the different levels due to known and unknown sources. As well as the proportion of (co)variance explained by each process. This is something that is implemented in the squidApp

squidgroup / squidsim Goto Github PK

squidsim's People

Contributors

Stargazers

Watchers

squidsim's Issues

Recommend Projects

Recommend Topics

Recommend Org