brandmaier / semtree Goto Github PK
View Code? Open in Web Editor NEWRecursive Partitioning for Structural Equation Models
Home Page: https://brandmaier.github.io/semtree/
License: GNU General Public License v3.0
Recursive Partitioning for Structural Equation Models
Home Page: https://brandmaier.github.io/semtree/
License: GNU General Public License v3.0
Hi there!
Love the package
Works well for me, except when I specify an interaction term while writing my model in lavaan
E.g.
lav_obj <- 'y ~ x +z + x:z'
run_lav_obj <- sem(model = lav_obj , data = Data)
Above terminates fine. But when I run:
tree <- semtree(model=run_lav_obj , data=Data,
verbose = T, predictors=c('c'))
I get an error where it looks as though it is searching for the column x:z - which is created in the lavaan object.
This is a discussion on whether we want to improve the accessibility of the semtree.control
-object based on ideas of @manuelarnold. He suggested to restructure as following:
method
: "LR" is default or "score"LR_tests
: "maxLR" (default) or "naive","bonferroni","fair"score.tests
: list(nominal = 'LMuo', ordinal = 'maxLMo', metric = 'maxLM') (default settings)fast_tree
: FALSE (default), TRUEI like the separation of the score vs LR and a second argument to specify how the LR tests are done. However, I suggest that instead of fast_tree
as an argument, we provide a separate constructor for the control object, that is, you call semtree.control.fast()
which then returns a control-object that has score.tests=list(nominal = 'LMuo', ordinal = 'WDM', metric = 'DM'
and method="score"
.
Argument name is misleading.
The package dependencies on stringr
and bitops
should be removed with base R code.
Move calculation of CSP from scoreTest.R to naiveSplitScoreTest.R
Remove all cat
and print
commands and replace by ui_* commands.
#See code below that recreates the issue.
library(psychTools)
data(affect)
affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))
tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1
knitr::kable(head(tree.data))
library(OpenMx)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE), value=c(1.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(1.0) , arrows=2, label=c("sigma2") ),
mxData(tree.data, type = "raw")
);
result <- mxRun(model)
library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)
toTable(tree,round.param=12)
The package fails the CRAN checks because of an undefined function:
> checking R code for possible problems ... NOTE
plot_growth.MxModel: no visible global function definition for
'table_results'
Undefined global functions or variables:
table_results
@cjvanlissa, would you mind fixing this?
I see this:
> bigforest <- merge(forest, forest2)
Error in merge.internal(list(x, y)) :
Cannot merge forests! Models differ.
devtools::install_github("brandmaier/semtree")
devtools::install_github("brandmaier/semtree",force=TRUE, build_opts = c())
in the first line in the readme the name of repo and author are mixed up, which returns an error in R
OpenMx GMMs have a meta-model that doesn't have any manifest variables. semtree assumes the existence of manifest variables and exits with an error if GMMs are used.
semtree does not detect that definition variables are part of the model and uses them as splitters.
@manuelarnold suggested (in 2021) to improve documentation in semtree control for:
Linear: Boolean. Identifies non-linear constraints. Can we derive this automatically?
strucchange.from, strucchange.to: Those are important for maxLR and maxLM.
strucchange.nrep: add more info
scaled_scores: Should probably always be TRUE and be removed from control object.
Semtree 0.9.16 has warnings on
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64
semtree gives the following error message for lavaan models:
Error occured!
Error in fitSubmodels(model, subset1, subset2, control, invariance = NULL): no slot of name "data" for this object of class "lavaan"
No traceback available
โ Tree construction finished [took less than a second].
The resulting tree consists of only one node, even though semtree would split the sample for an equivalent OpenMx model.
Here is an example:
library(semtree)
library(lavaan)
library(OpenMx)
set.seed(1808)
Data <- data.frame(y = c(rnorm(150, mean = -1), rnorm(150, mean = 1)),
z = sort(rnorm(300)))
m_lav <- '
y ~~ y
y ~ 1
'
####Testing semtree with lavaan models ####
fit_lav <- lavaan(model = m, data = Data)
tree_lav <- semtree::semtree(model = fit_lav, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
'# => returns an error
tree_lav <- semtree::semtree(model = fit_lav, data = Data)
'# => also returns an error
plot(tree_lav)
'# Only one node
m_mx <- mxModel(manifestVars = "y", type = "RAM",
mxData(observed = Data, type = "raw"),
mxPath(from = "y", arrows = 2, free = TRUE, values = 1, labels = "sigma2"),
mxPath(from = "one", to = "y", arrows = 1, free = TRUE, values = 0, labels = "mu"))
fit_mx <- mxTryHard(model = m_mx)
tree_mx <- semtree::semtree(model = fit_mx, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
plot(tree_mx)
'# => OpenMx models work
add a percentile threshold
btn.matrix encodes everything as strings since mixed types are used here. Better, store in transposed data.frame to preserve variable types.
The package has a function to remove information from trees to reduce their memory footprint, e.g., it may removes the fitted model, the datasets, etc. This makes operations on SEM forests more efficient in particular. This function is currently called stripTree()
. @cjvanlissa re-implemented this function and called it clear_underbrush()
. I want to open the discussion on how to best name this function. I am happy to have a cute name here, which still needs to be informative about what's going on. Clearing the underbrush is neat but my image here is more that we do something to every branch and every leaf. The metaphor of pruning is already used for cutting back entire branches. So, what about a function called exfoliate()
? Or do we just stick with the old name? Specifically asking for opinons of @manuelarnold and @cjvanlissa. Thanks!
To prevent problems with model specification (mean structure) and score-based tests, @manuelarnold suggested to add a routine check whether the sum of scores is (close to) zero.
The error below suggests that error catching is necessary, as sometimes the number of cases in terminal nodes is too low to compute the covariance matrix. This even happens though
> controls$semtree.control$min.bucket
[1] 10
> vim <- semtree::varimp(res_rf)
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge.eps, x.idx = x.idx[[g]], :
lavaan ERROR: sample covariance matrix is not positive-definite
In addition: Warning message:
In lav_data_full(data = data, group = group, cluster = cluster, :
lavaan WARNING: small number of observations (nobs < nvar)
nobs = 3 nvar = 5
I see the bootstrapped datasets, but I do not see which rows of the original data these are. It is desirable to have this information, for example to obtain oob predictions of observed scores per case across all trees.
maybe add a heuristic for those values if they are not set by the user.
Say, count the number of free parameters and multiply by 5 to set min_bucket, then set min_N as 2*min_bucket.
semtree checks currently produce the following problem:
N checking dependencies in R code (4s)
'::' oder ':::' importieren unangemeldet aus:
'ctsemOMX' 'expm'
Unexported object imported by a ':::' call: 'OpenMx:::vcov.MxModel'
Please fix, @manuelarnold
This entails adding a couple of if(require(...)) conditions at various places.
Offering ctsem tree is not core functionality.
#See example below
#library(githubinstall)
#githubinstall("semtree",force=TRUE)
library(psychTools)
data(affect)
affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))
tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1
knitr::kable(head(tree.data))
library(OpenMx)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE), value=c(1.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(1.0) , arrows=2, label=c("sigma2") ),
mxData(tree.data, type = "raw")
);
result <- mxRun(model)
library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)
toTable(tree,round.param=0)
toTable(tree,round.param=12)
Hi Andreas. I found a run error that appears when lavaan-based SEM forests are parallelized using the plan(multisession) function from the package future. It works fine when future is not used.
Add documentation explaining how to use parallelization with semforest.
The following code does not work with current semtree
and score-based tests because the model has no labelled parameters. Thus, the score computation returns only zeros and the tree will find no split:
library(OpenMx)
library(semtree)
set.seed(2349)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests,
free=c(TRUE),
value=c(1.0) , arrows=1#,
# label=c("mu")
),
mxPath(from=manifests,to=manifests,
free=c(TRUE), value=c(1.0) ,
arrows=2#,
# label=c("sigma2")
),
mxData(tree.data, type = "raw")
);
fitted <- mxRun(model)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = model,
data = tree.data,
control=ctrl)
@manuelarnold reported that semtree currently computes LR that are ignored by maxLR. This unnecessarily increases computation time of the maxLR tests.
Is there a more elegant way in R to handle the transpose behaviour of apply() that depends on whether it is a single or more dimensions? Related to changes in 4246751.
The following minimal example produces an error in the new implementation of partialDependence, @cjvanlissa
library(semtree)
library(lavaan)
library(tictoc)
# sample some univariate data in x and two covariates y,z
n<-1000
data <- data.frame(x=rnorm(n),
y=ordered(rbinom(n = n,1,.25)),
z=ordered(sample(c(0,1),n,TRUE))
)
model <- "x~~x; x~0"
model <- lavaan::lavaan(model, data)
forest <- semforest(model, data,control = semforest.control(num.trees=20))
pd <- partialDependence(forest, reference.var = "y")
With the introduction of score-based tests, more split criteria are used in semtree than the likelihood ratio test statistic. Currently, all test statistics are incorrectly labeled as LR in outputs and plots.
I suggest the following changes:
We split tree$result$LL.max into tree$result$test.type (character value; e.g., LR or maxLM and so on) and tree$result$test.value (numeric value). I could then display the information stored in tree$result$test.type in the plot function.
Could we change that by copying and pasting, or will we break something?
@manuelarnold , the score-tests seem to ignore the min.bucket
parameter. This leads to trees with quite unstable models in the leafs (sometimes N=1). Please fix this asap! Thanks!
Hi Andreas,
I just tried using semtree with a data set that I had imported into R using haven. The resulting data.frame is a tibble data.frame. When running semtree based on a lavaan model, I was surprised that there were no splits for my data set. Only when casting the tibble to a base R data.frame, the model was partitioned correctly. When using OpenMx models, I found that tibbles result in errors. Here is an example:
library(semtree)
library(tibble)
library(OpenMx)
library(lavaan)
# For lavaan, semtree does not throw any warnings / errors when using tibble,
# but also does not run correctly:
HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
HolzingerSwineford1939 <- HolzingerSwineford1939[complete.cases(HolzingerSwineford1939), ]
fit <- cfa(HS.model, data = tibble::as_tibble(HolzingerSwineford1939))
# Using tibble does not work:
tree <- semtree(model = fit,
data = tibble::as_tibble(HolzingerSwineford1939))
print(tree)
# With data.frames, the tree is built correctly
tree <- semtree(model = fit,
data = HolzingerSwineford1939,
control = semtree.control(method = "score"))
print(tree)
## For OpenMx, I get an error, but one that reports issues with P1 not being
## a factor:
set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
p1 <- sample(c(0, 1), size = N, replace = TRUE)
loadings <- 0:4
x <- (slope + p1 * 5) %*% t(loadings) +
matrix(rep(icept, each = M), byrow = TRUE, ncol = M) +
rnorm(N * M, sd = .08)
colnames(x) <- paste0("X", 1:M)
# Use tibble
growth.data <- tibble::as_tibble(x)
growth.data$P1 <- as.factor(p1)
manifests <- names(growth.data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
type="RAM",
manifestVars=manifests,
latentVars=c("intercept","slope"),
mxData(growth.data, type="raw"),
# residual variances
mxPath(
from=manifests,
arrows=2,
free=TRUE,
values = c(.1, .1, .1, .1, .1),
labels=c("residual","residual","residual","residual","residual")
),
# latent variances and covariance
mxPath(
from=c("intercept","slope"),
arrows=2,
connect="unique.pairs",
free=TRUE,
values=c(2, 0, 1),
labels=c("vari", "cov", "vars")
),
# intercept loadings
mxPath(
from="intercept",
to=manifests,
arrows=1,
free=FALSE,
values=c(1, 1, 1, 1, 1)
),
# slope loadings
mxPath(
from="slope",
to=manifests,
arrows=1,
free=FALSE,
values=c(0, 1, 2, 3, 4)
),
# manifest means
mxPath(
from="one",
to=manifests,
arrows=1,
free=FALSE,
values=c(0, 0, 0, 0, 0)
),
# latent means
mxPath(
from="one",
to=c("intercept", "slope"),
arrows=1,
free=TRUE,
values=c(1, 1),
labels=c("meani", "means")
)
) # close model
# fit the model to the entire dataset
growthCurveModel <- mxRun(growthCurveModel)
# Using tibble does not work:
tree <- semtree(model = growthCurveModel,
data = growth.data)
# Using data.frame does work:
tree <- semtree(model = growthCurveModel,
data = as.data.frame(growth.data))
Best,
Jannik
Please consider adding URL
and BugReports
pointing to this GitHub repo and issue tracker to your DESCRIPTION file. That way, it'll be easier to find this repo, e.g. from https://cran.r-project.org/package=semtree.
Currently, the cur.type is 1 for categorical variables and 2 for metric and ordinal variables. Since the distinction between ordinal and metric variables is important for both maxLR test statistics and score-based tests, it would make sense to use different cur.type values for both types of variables.
1: categorical
2: ordinal
3: metric
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.