Giter Club home page Giter Club logo

semtree's People

Contributors

brandmaier avatar manuelarnold avatar manuelarnold2 avatar octomike avatar opecheese avatar yihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

semtree's Issues

semtree does not appear to recognise interaction terms in lavaan

Hi there!

Love the package
Works well for me, except when I specify an interaction term while writing my model in lavaan
E.g.
lav_obj <- 'y ~ x +z + x:z'
run_lav_obj <- sem(model = lav_obj , data = Data)

Above terminates fine. But when I run:
tree <- semtree(model=run_lav_obj , data=Data,
verbose = T, predictors=c('c'))

I get an error where it looks as though it is searching for the column x:z - which is created in the lavaan object.

re-structuring semtree.control()

This is a discussion on whether we want to improve the accessibility of the semtree.control-object based on ideas of @manuelarnold. He suggested to restructure as following:

  • method: "LR" is default or "score"
  • LR_tests: "maxLR" (default) or "naive","bonferroni","fair"
  • score.tests: list(nominal = 'LMuo', ordinal = 'maxLMo', metric = 'maxLM') (default settings)
  • fast_tree: FALSE (default), TRUE

I like the separation of the score vs LR and a second argument to specify how the LR tests are done. However, I suggest that instead of fast_tree as an argument, we provide a separate constructor for the control object, that is, you call semtree.control.fast() which then returns a control-object that has score.tests=list(nominal = 'LMuo', ordinal = 'WDM', metric = 'DM' and method="score".

toTable appears to issue an error when applied to tree with multiple covariates

#See code below that recreates the issue.
library(psychTools)
data(affect)

affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))

tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1

knitr::kable(head(tree.data))

library(OpenMx)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE), value=c(1.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(1.0) , arrows=2, label=c("sigma2") ),
mxData(tree.data, type = "raw")
);

result <- mxRun(model)

library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)

toTable(tree,round.param=12)

CRAN check fail: table_results() missing

The package fails the CRAN checks because of an undefined function:

> checking R code for possible problems ... NOTE
  plot_growth.MxModel: no visible global function definition for
    'table_results'
  Undefined global functions or variables:
    table_results

@cjvanlissa, would you mind fixing this?

Install package wrong in read me

devtools::install_github("brandmaier/semtree")

even better: install with package vignette (extra documentation)

devtools::install_github("brandmaier/semtree",force=TRUE, build_opts = c())

in the first line in the readme the name of repo and author are mixed up, which returns an error in R

OpenMx GMM are incompatible with semtree

OpenMx GMMs have a meta-model that doesn't have any manifest variables. semtree assumes the existence of manifest variables and exits with an error if GMMs are used.

Add more documentation to control object

@manuelarnold suggested (in 2021) to improve documentation in semtree control for:

Linear: Boolean. Identifies non-linear constraints. Can we derive this automatically?
strucchange.from, strucchange.to: Those are important for maxLR and maxLM.
strucchange.nrep: add more info
scaled_scores: Should probably always be TRUE and be removed from control object.

CRAN Warnings

Semtree 0.9.16 has warnings on
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64

semtree returns an error for lavaan models

semtree gives the following error message for lavaan models:

Error occured!
Error in fitSubmodels(model, subset1, subset2, control, invariance = NULL): no slot of name "data" for this object of class "lavaan"
No traceback available
โœ” Tree construction finished [took less than a second].

The resulting tree consists of only one node, even though semtree would split the sample for an equivalent OpenMx model.

Here is an example:

library(semtree)
library(lavaan)
library(OpenMx)

set.seed(1808)

Data <- data.frame(y = c(rnorm(150, mean = -1), rnorm(150, mean = 1)),
z = sort(rnorm(300)))

m_lav <- '
y ~~ y
y ~ 1
'

####Testing semtree with lavaan models ####
fit_lav <- lavaan(model = m, data = Data)

tree_lav <- semtree::semtree(model = fit_lav, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
'# => returns an error

tree_lav <- semtree::semtree(model = fit_lav, data = Data)
'# => also returns an error

plot(tree_lav)
'# Only one node

Testing semtree with OpenMx models

m_mx <- mxModel(manifestVars = "y", type = "RAM",
mxData(observed = Data, type = "raw"),
mxPath(from = "y", arrows = 2, free = TRUE, values = 1, labels = "sigma2"),
mxPath(from = "one", to = "y", arrows = 1, free = TRUE, values = 0, labels = "mu"))

fit_mx <- mxTryHard(model = m_mx)

tree_mx <- semtree::semtree(model = fit_mx, data = Data,
control = semtree.control(use.maxlm = TRUE, method = "naive"))
plot(tree_mx)
'# => OpenMx models work

btn.matrix format is bad

btn.matrix encodes everything as strings since mixed types are used here. Better, store in transposed data.frame to preserve variable types.

renaming of stripTree

The package has a function to remove information from trees to reduce their memory footprint, e.g., it may removes the fitted model, the datasets, etc. This makes operations on SEM forests more efficient in particular. This function is currently called stripTree(). @cjvanlissa re-implemented this function and called it clear_underbrush(). I want to open the discussion on how to best name this function. I am happy to have a cute name here, which still needs to be informative about what's going on. Clearing the underbrush is neat but my image here is more that we do something to every branch and every leaf. The metaphor of pruning is already used for cutting back entire branches. So, what about a function called exfoliate()? Or do we just stick with the old name? Specifically asking for opinons of @manuelarnold and @cjvanlissa. Thanks!

varimp() fails for lavaan forest

The error below suggests that error catching is necessary, as sometimes the number of cases in terminal nodes is too low to compute the covariance matrix. This even happens though

> controls$semtree.control$min.bucket
[1] 10
> vim <- semtree::varimp(res_rf)
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge.eps, x.idx = x.idx[[g]],  : 
  lavaan ERROR: sample covariance matrix is not positive-definite
In addition: Warning message:
In lav_data_full(data = data, group = group, cluster = cluster,  :
  lavaan WARNING: small number of observations (nobs < nvar)
  nobs = 3 nvar = 5

Is bootstrap / subsampling split stored anywhere?

I see the bootstrapped datasets, but I do not see which rows of the original data these are. It is desirable to have this information, for example to obtain oob predictions of observed scores per case across all trees.

dynamically adjust min_N and min_bucket?

maybe add a heuristic for those values if they are not set by the user.

Say, count the number of free parameters and multiply by 5 to set min_bucket, then set min_N as 2*min_bucket.

import error in CRAN checks

semtree checks currently produce the following problem:

N  checking dependencies in R code (4s)
   '::' oder ':::' importieren unangemeldet aus:
     'ctsemOMX' 'expm'
   Unexported object imported by a ':::' call: 'OpenMx:::vcov.MxModel'

Please fix, @manuelarnold

toTable doesn't appear to process round.param

#See example below
#library(githubinstall)
#githubinstall("semtree",force=TRUE)

library(psychTools)
data(affect)

affect$Film <- factor(affect$Film, ordered = FALSE,
labels=c("Frontline", "Halloween", "Nat. Geographic","Parenthood"))

tree.data <- affect[,c("Film","neur","ext","soc","traitanx","NA1","PA1")]
tree.data$DeltaPA <- affect$PA2-affect$PA1

knitr::kable(head(tree.data))

library(OpenMx)
manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE), value=c(1.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(1.0) , arrows=2, label=c("sigma2") ),
mxData(tree.data, type = "raw")
);

result <- mxRun(model)

library(semtree)
ctrl = semtree.control(
method="score",
bonferroni = TRUE)
tree = semtree( model = result,
data = tree.data,
control=ctrl)

toTable(tree,round.param=0)
toTable(tree,round.param=12)

Score-based tests without labelled parameters in OpenMx broken

The following code does not work with current semtree and score-based tests because the model has no labelled parameters. Thus, the score computation returns only zeros and the tree will find no split:

library(OpenMx)
library(semtree)

set.seed(2349)

manifests<-c("DeltaPA")
latents<-c()
model <- mxModel("Simple Model", 
                 type="RAM",
                 manifestVars = manifests,
                 latentVars = latents,
                 mxPath(from="one",to=manifests,
                        free=c(TRUE), 
                        value=c(1.0) , arrows=1#, 
                       # label=c("mu") 
                        ),
                 mxPath(from=manifests,to=manifests,
                        free=c(TRUE), value=c(1.0) , 
                        arrows=2#, 
                       # label=c("sigma2") 
                        ),
                 mxData(tree.data, type = "raw")
);
fitted <- mxRun(model)
ctrl = semtree.control(
  method="score", 
  bonferroni = TRUE)
tree = semtree( model = model, 
                data = tree.data, 
                control=ctrl)

bug in partialDependence

The following minimal example produces an error in the new implementation of partialDependence, @cjvanlissa

library(semtree)
library(lavaan)
library(tictoc)

# sample some univariate data in x and two covariates y,z
n<-1000
data <- data.frame(x=rnorm(n),
                   y=ordered(rbinom(n = n,1,.25)),
                   z=ordered(sample(c(0,1),n,TRUE))
                   )

model <- "x~~x; x~0"
model <- lavaan::lavaan(model, data)

forest <- semforest(model, data,control = semforest.control(num.trees=20))


pd <- partialDependence(forest, reference.var = "y")

Correct name of test statistics

With the introduction of score-based tests, more split criteria are used in semtree than the likelihood ratio test statistic. Currently, all test statistics are incorrectly labeled as LR in outputs and plots.
I suggest the following changes:
We split tree$result$LL.max into tree$result$test.type (character value; e.g., LR or maxLM and so on) and tree$result$test.value (numeric value). I could then display the information stored in tree$result$test.type in the plot function.
Could we change that by copying and pasting, or will we break something?

Semtree fails when using tibble instead of data.frame

Hi Andreas,

I just tried using semtree with a data set that I had imported into R using haven. The resulting data.frame is a tibble data.frame. When running semtree based on a lavaan model, I was surprised that there were no splits for my data set. Only when casting the tibble to a base R data.frame, the model was partitioned correctly. When using OpenMx models, I found that tibbles result in errors. Here is an example:

library(semtree)
library(tibble)
library(OpenMx)
library(lavaan)

# For lavaan, semtree does not throw any warnings / errors when using tibble,
# but also does not run correctly:

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

HolzingerSwineford1939 <- HolzingerSwineford1939[complete.cases(HolzingerSwineford1939), ]

fit <- cfa(HS.model, data = tibble::as_tibble(HolzingerSwineford1939))
# Using tibble does not work:
tree <- semtree(model = fit, 
                data  = tibble::as_tibble(HolzingerSwineford1939))
print(tree)

# With data.frames, the tree is built correctly
tree <- semtree(model = fit, 
                data  = HolzingerSwineford1939,
                control = semtree.control(method = "score"))
print(tree)

## For OpenMx, I get an error, but one that reports issues with P1 not being
## a factor:

set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
p1 <- sample(c(0, 1), size = N, replace = TRUE)
loadings <- 0:4
x <- (slope + p1 * 5) %*% t(loadings) + 
  matrix(rep(icept, each = M), byrow = TRUE, ncol = M) + 
  rnorm(N * M, sd = .08)
colnames(x) <- paste0("X", 1:M)

# Use tibble
growth.data    <- tibble::as_tibble(x)
growth.data$P1 <- as.factor(p1)


manifests <- names(growth.data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
                            type="RAM",
                            manifestVars=manifests,
                            latentVars=c("intercept","slope"),
                            mxData(growth.data, type="raw"),
                            # residual variances
                            mxPath(
                              from=manifests,
                              arrows=2,
                              free=TRUE,
                              values = c(.1, .1, .1, .1, .1),
                              labels=c("residual","residual","residual","residual","residual")
                            ),
                            # latent variances and covariance
                            mxPath(
                              from=c("intercept","slope"),
                              arrows=2,
                              connect="unique.pairs",
                              free=TRUE,
                              values=c(2, 0, 1),
                              labels=c("vari", "cov", "vars")
                            ),
                            # intercept loadings
                            mxPath(
                              from="intercept",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(1, 1, 1, 1, 1)
                            ),
                            # slope loadings
                            mxPath(
                              from="slope",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 1, 2, 3, 4)
                            ),
                            # manifest means
                            mxPath(
                              from="one",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 0, 0, 0, 0)
                            ),
                            # latent means
                            mxPath(
                              from="one",
                              to=c("intercept", "slope"),
                              arrows=1,
                              free=TRUE,
                              values=c(1, 1),
                              labels=c("meani", "means")
                            )
) # close model

# fit the model to the entire dataset
growthCurveModel <- mxRun(growthCurveModel)


# Using tibble does not work:
tree <- semtree(model = growthCurveModel, 
                data  = growth.data)
# Using data.frame does work:
tree <- semtree(model = growthCurveModel, 
                data  = as.data.frame(growth.data))

Best,
Jannik

Different cur.type for metric and ordinal variables

Currently, the cur.type is 1 for categorical variables and 2 for metric and ordinal variables. Since the distinction between ordinal and metric variables is important for both maxLR test statistics and score-based tests, it would make sense to use different cur.type values for both types of variables.
1: categorical
2: ordinal
3: metric

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.