brandmaier / semtree Goto Github PK

Recursive Partitioning for Structural Equation Models

Home Page: https://brandmaier.github.io/semtree/

License: GNU General Public License v3.0

R 99.82% Makefile 0.18%

forest randomforest decision-tree multivariate sem statistical-modeling structural-equation-modeling structural-equation-models bigdata r

semtree's People

Contributors

Stargazers

Watchers

Forkers

trbrick aaronpeikert manuelarnold tiger-liang weibokong27 octomike cjvanlissa stjordanis manuelarnold2 emstruong

semtree's Issues

semtree does not appear to recognise interaction terms in lavaan

Hi there!

Love the package
Works well for me, except when I specify an interaction term while writing my model in lavaan
E.g.
lav_obj <- 'y ~ x +z + x:z'
run_lav_obj <- sem(model = lav_obj , data = Data)

Above terminates fine. But when I run:
tree <- semtree(model=run_lav_obj , data=Data,
verbose = T, predictors=c('c'))

I get an error where it looks as though it is searching for the column x:z - which is created in the lavaan object.

re-structuring semtree.control()

This is a discussion on whether we want to improve the accessibility of the semtree.control-object based on ideas of @manuelarnold. He suggested to restructure as following:

method: "LR" is default or "score"
LR_tests: "maxLR" (default) or "naive","bonferroni","fair"
score.tests: list(nominal = 'LMuo', ordinal = 'maxLMo', metric = 'maxLM') (default settings)
fast_tree: FALSE (default), TRUE

I like the separation of the score vs LR and a second argument to specify how the LR tests are done. However, I suggest that instead of fast_tree as an argument, we provide a separate constructor for the control object, that is, you call semtree.control.fast() which then returns a control-object that has score.tests=list(nominal = 'LMuo', ordinal = 'WDM', metric = 'DM' and method="score".

use.maxlm should be use_maxLR

Argument name is misleading.

Make package imports leaner

The package dependencies on stringr and bitops should be removed with base R code.

Compute CSP only once for the evaluation of a split

Move calculation of CSP from scoreTest.R to naiveSplitScoreTest.R

Remove all cats 🐱

Remove all cat and print commands and replace by ui_* commands.

toTable appears to issue an error when applied to tree with multiple covariates

#See code below that recreates the issue.
library(psychTools)
data(affect)

affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))

tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1

knitr::kable(head(tree.data))

library(OpenMx)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE), value=c(1.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(1.0) , arrows=2, label=c("sigma2") ),
mxData(tree.data, type = "raw")
);

result <- mxRun(model)

library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)

toTable(tree,round.param=12)

remove debug output from toLaTeX()

create package vignette

CRAN check fail: table_results() missing

The package fails the CRAN checks because of an undefined function:

> checking R code for possible problems ... NOTE
  plot_growth.MxModel: no visible global function definition for
    'table_results'
  Undefined global functions or variables:
    table_results

@cjvanlissa, would you mind fixing this?

Demo semforestLGCModelOpenMxParallel.R throws error:

I see this:

> bigforest <- merge(forest, forest2)
Error in merge.internal(list(x, y)) : 
  Cannot merge forests! Models differ.

Install package wrong in read me

devtools::install_github("brandmaier/semtree")

even better: install with package vignette (extra documentation)

devtools::install_github("brandmaier/semtree",force=TRUE, build_opts = c())

in the first line in the readme the name of repo and author are mixed up, which returns an error in R

OpenMx GMM are incompatible with semtree

OpenMx GMMs have a meta-model that doesn't have any manifest variables. semtree assumes the existence of manifest variables and exits with an error if GMMs are used.

definition variables are predictors

semtree does not detect that definition variables are part of the model and uses them as splitters.

Add more documentation to control object

@manuelarnold suggested (in 2021) to improve documentation in semtree control for:

Linear: Boolean. Identifies non-linear constraints. Can we derive this automatically?
strucchange.from, strucchange.to: Those are important for maxLR and maxLM.
strucchange.nrep: add more info
scaled_scores: Should probably always be TRUE and be removed from control object.

CRAN Warnings

Semtree 0.9.16 has warnings on
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64

semtree returns an error for lavaan models

semtree gives the following error message for lavaan models:

Error occured!
Error in fitSubmodels(model, subset1, subset2, control, invariance = NULL): no slot of name "data" for this object of class "lavaan"
No traceback available
✔ Tree construction finished [took less than a second].

The resulting tree consists of only one node, even though semtree would split the sample for an equivalent OpenMx model.

Here is an example:

library(semtree)
library(lavaan)
library(OpenMx)

set.seed(1808)

Data <- data.frame(y = c(rnorm(150, mean = -1), rnorm(150, mean = 1)),
z = sort(rnorm(300)))

m_lav <- '
y ~~ y
y ~ 1
'

####Testing semtree with lavaan models ####
fit_lav <- lavaan(model = m, data = Data)

tree_lav <- semtree::semtree(model = fit_lav, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
'# => returns an error

tree_lav <- semtree::semtree(model = fit_lav, data = Data)
'# => also returns an error

plot(tree_lav)
'# Only one node

Testing semtree with OpenMx models

m_mx <- mxModel(manifestVars = "y", type = "RAM",
mxData(observed = Data, type = "raw"),
mxPath(from = "y", arrows = 2, free = TRUE, values = 1, labels = "sigma2"),
mxPath(from = "one", to = "y", arrows = 1, free = TRUE, values = 0, labels = "mu"))

fit_mx <- mxTryHard(model = m_mx)

tree_mx <- semtree::semtree(model = fit_mx, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
plot(tree_mx)
'# => OpenMx models work

boruta - percentile

add a percentile threshold

btn.matrix format is bad

btn.matrix encodes everything as strings since mixed types are used here. Better, store in transposed data.frame to preserve variable types.

renaming of stripTree

The package has a function to remove information from trees to reduce their memory footprint, e.g., it may removes the fitted model, the datasets, etc. This makes operations on SEM forests more efficient in particular. This function is currently called stripTree(). @cjvanlissa re-implemented this function and called it clear_underbrush(). I want to open the discussion on how to best name this function. I am happy to have a cute name here, which still needs to be informative about what's going on. Clearing the underbrush is neat but my image here is more that we do something to every branch and every leaf. The metaphor of pruning is already used for cutting back entire branches. So, what about a function called exfoliate()? Or do we just stick with the old name? Specifically asking for opinons of @manuelarnold and @cjvanlissa. Thanks!

Score-based statistics: add check whether sum of scores is zero

To prevent problems with model specification (mean structure) and score-based tests, @manuelarnold suggested to add a routine check whether the sum of scores is (close to) zero.

add semtree example to readme file

varimp() fails for lavaan forest

The error below suggests that error catching is necessary, as sometimes the number of cases in terminal nodes is too low to compute the covariance matrix. This even happens though

> controls$semtree.control$min.bucket
[1] 10

> vim <- semtree::varimp(res_rf)
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge.eps, x.idx = x.idx[[g]],  : 
  lavaan ERROR: sample covariance matrix is not positive-definite
In addition: Warning message:
In lav_data_full(data = data, group = group, cluster = cluster,  :
  lavaan WARNING: small number of observations (nobs < nvar)
  nobs = 3 nvar = 5

Is bootstrap / subsampling split stored anywhere?

I see the bootstrapped datasets, but I do not see which rows of the original data these are. It is desirable to have this information, for example to obtain oob predictions of observed scores per case across all trees.

dynamically adjust min_N and min_bucket?

maybe add a heuristic for those values if they are not set by the user.

Say, count the number of free parameters and multiply by 5 to set min_bucket, then set min_N as 2*min_bucket.

import error in CRAN checks

semtree checks currently produce the following problem:

N  checking dependencies in R code (4s)
   '::' oder ':::' importieren unangemeldet aus:
     'ctsemOMX' 'expm'
   Unexported object imported by a ':::' call: 'OpenMx:::vcov.MxModel'

Please fix, @manuelarnold

move ctsem dependency to suggests

This entails adding a couple of if(require(...)) conditions at various places.
Offering ctsem tree is not core functionality.

toTable doesn't appear to process round.param

#See example below
#library(githubinstall)
#githubinstall("semtree",force=TRUE)

library(psychTools)
data(affect)

affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))

tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1

knitr::kable(head(tree.data))

result <- mxRun(model)

library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)

toTable(tree,round.param=0)
toTable(tree,round.param=12)

Run error in SEM forests using lavaan

Hi Andreas. I found a run error that appears when lavaan-based SEM forests are parallelized using the plan(multisession) function from the package future. It works fine when future is not used.

issue_future_lavaan.txt

add documentation of futures

Add documentation explaining how to use parallelization with semforest.

surrogate splits

Score-based tests without labelled parameters in OpenMx broken

The following code does not work with current semtree and score-based tests because the model has no labelled parameters. Thus, the score computation returns only zeros and the tree will find no split:

library(OpenMx)
library(semtree)

set.seed(2349)

manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model", 
                 type="RAM",
                 manifestVars = manifests,
                 latentVars = latents,
                 mxPath(from="one",to=manifests,
                        free=c(TRUE), 
                        value=c(1.0) , arrows=1#, 
                       # label=c("mu") 
                        ),
                 mxPath(from=manifests,to=manifests,
                        free=c(TRUE), value=c(1.0) , 
                        arrows=2#, 
                       # label=c("sigma2") 
                        ),
                 mxData(tree.data, type = "raw")
);
fitted <- mxRun(model)
ctrl = semtree.control(
  method="score", 
  bonferroni = TRUE)
tree = semtree( model = model, 
                data = tree.data, 
                control=ctrl)

inefficient LR computation

@manuelarnold reported that semtree currently computes LR that are ignored by maxLR. This unnecessarily increases computation time of the maxLR tests.

Improve package vignettes

apply and transpose behaviour in R

Is there a more elegant way in R to handle the transpose behaviour of apply() that depends on whether it is a single or more dimensions? Related to changes in 4246751.

bug in partialDependence

The following minimal example produces an error in the new implementation of partialDependence, @cjvanlissa

library(semtree)
library(lavaan)
library(tictoc)

# sample some univariate data in x and two covariates y,z
n<-1000
data <- data.frame(x=rnorm(n),
                   y=ordered(rbinom(n = n,1,.25)),
                   z=ordered(sample(c(0,1),n,TRUE))
                   )

model <- "x~~x; x~0"
model <- lavaan::lavaan(model, data)

forest <- semforest(model, data,control = semforest.control(num.trees=20))


pd <- partialDependence(forest, reference.var = "y")

Correct name of test statistics

With the introduction of score-based tests, more split criteria are used in semtree than the likelihood ratio test statistic. Currently, all test statistics are incorrectly labeled as LR in outputs and plots.
I suggest the following changes:
We split tree$result$LL.max into tree$result$test.type (character value; e.g., LR or maxLM and so on) and tree$result$test.value (numeric value). I could then display the information stored in tree$result$test.type in the plot function.
Could we change that by copying and pasting, or will we break something?

score tests ignore minimum bucket size

@manuelarnold , the score-tests seem to ignore the min.bucket parameter. This leads to trees with quite unstable models in the leafs (sometimes N=1). Please fix this asap! Thanks!

Semtree fails when using tibble instead of data.frame

Hi Andreas,

I just tried using semtree with a data set that I had imported into R using haven. The resulting data.frame is a tibble data.frame. When running semtree based on a lavaan model, I was surprised that there were no splits for my data set. Only when casting the tibble to a base R data.frame, the model was partitioned correctly. When using OpenMx models, I found that tibbles result in errors. Here is an example:

library(semtree)
library(tibble)
library(OpenMx)
library(lavaan)

# For lavaan, semtree does not throw any warnings / errors when using tibble,
# but also does not run correctly:

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

HolzingerSwineford1939 <- HolzingerSwineford1939[complete.cases(HolzingerSwineford1939), ]

fit <- cfa(HS.model, data = tibble::as_tibble(HolzingerSwineford1939))
# Using tibble does not work:
tree <- semtree(model = fit, 
                data  = tibble::as_tibble(HolzingerSwineford1939))
print(tree)

# With data.frames, the tree is built correctly
tree <- semtree(model = fit, 
                data  = HolzingerSwineford1939,
                control = semtree.control(method = "score"))
print(tree)

## For OpenMx, I get an error, but one that reports issues with P1 not being
## a factor:

set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
p1 <- sample(c(0, 1), size = N, replace = TRUE)
loadings <- 0:4
x <- (slope + p1 * 5) %*% t(loadings) + 
  matrix(rep(icept, each = M), byrow = TRUE, ncol = M) + 
  rnorm(N * M, sd = .08)
colnames(x) <- paste0("X", 1:M)

# Use tibble
growth.data    <- tibble::as_tibble(x)
growth.data$P1 <- as.factor(p1)


manifests <- names(growth.data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
                            type="RAM",
                            manifestVars=manifests,
                            latentVars=c("intercept","slope"),
                            mxData(growth.data, type="raw"),
                            # residual variances
                            mxPath(
                              from=manifests,
                              arrows=2,
                              free=TRUE,
                              values = c(.1, .1, .1, .1, .1),
                              labels=c("residual","residual","residual","residual","residual")
                            ),
                            # latent variances and covariance
                            mxPath(
                              from=c("intercept","slope"),
                              arrows=2,
                              connect="unique.pairs",
                              free=TRUE,
                              values=c(2, 0, 1),
                              labels=c("vari", "cov", "vars")
                            ),
                            # intercept loadings
                            mxPath(
                              from="intercept",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(1, 1, 1, 1, 1)
                            ),
                            # slope loadings
                            mxPath(
                              from="slope",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 1, 2, 3, 4)
                            ),
                            # manifest means
                            mxPath(
                              from="one",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 0, 0, 0, 0)
                            ),
                            # latent means
                            mxPath(
                              from="one",
                              to=c("intercept", "slope"),
                              arrows=1,
                              free=TRUE,
                              values=c(1, 1),
                              labels=c("meani", "means")
                            )
) # close model

# fit the model to the entire dataset
growthCurveModel <- mxRun(growthCurveModel)


# Using tibble does not work:
tree <- semtree(model = growthCurveModel, 
                data  = growth.data)
# Using data.frame does work:
tree <- semtree(model = growthCurveModel, 
                data  = as.data.frame(growth.data))

Best,
Jannik

DESCRIPTION: URL and BugReports

Please consider adding URL and BugReports pointing to this GitHub repo and issue tracker to your DESCRIPTION file. That way, it'll be easier to find this repo, e.g. from https://cran.r-project.org/package=semtree.

Different cur.type for metric and ordinal variables

Currently, the cur.type is 1 for categorical variables and 2 for metric and ordinal variables. Since the distinction between ordinal and metric variables is important for both maxLR test statistics and score-based tests, it would make sense to use different cur.type values for both types of variables.
1: categorical
2: ordinal
3: metric