Note that adjust (<a href="https://github.com/easysta

circular dependencies,about easystats/datawizard

Comments (23)

bwiernik commented on August 17, 2024 2

Another option if there is a small number of utility functions (like the console coloring, etc.) that both insight and datawizard need, we could use github submodules to copy them over from insight to datawizard (or vice versa) at build time so that they can exist internally in both, but only be exported by one.

from datawizard.

IndrajeetPatil commented on August 17, 2024 2

In light of all this discussion, I think the most pragmatic solution for now (since we want to get this package on CRAN asap) would be that datawizard relies on insight, making the latter de facto the lowest level dependency in easystats.

from datawizard.

DominiqueMakowski commented on August 17, 2024 1

Actually, I was about to say that we should also consider moving down some of the essential helpers, like check_if_installed (and indeed, following @bwiernik ~~add an argument to check version~~ [edit: u did this already]) and format_table(). So that we can use them in datawizard (and regarding the formatting/printing of dataframes, it makes perfect sense to have them in the "data preparing" package). These would be used by insight...

Conceptually I think the hierarchy is logical and meaningful too: first we start by functions to work with data, then functions to retrieve data from models, then functions to analyze/postprocess specific aspects of models, then functions to report/visualize them. So I'd vote for datawizard as the lowest level (in the long run, tho we could start by having them independent I suppose).

The only reasons why we might want to refrain from (that that I can see) are

it might deter other developers from depending on insight if they see that it has a dependency: imo it's always the same story, we just have to explain that all the easystats packages could in fact well be just one package and that they were split for maintenance reasons but there is minimum risk of breakage due to some obscure upstream change (since the stream is actually closed) - which is the main problem with dependencies
because insight is such a key foundational stone that you do not want to depend on another package in case it breaks or something. But I'd say that the risk is mitigated since insight would actually depend on only a few robust functions (check_required, format_table and things like that), so I don't see how it can go wrong... And then we have to trust @IndrajeetPatil with his maintaining skills 😋 but so far he's been pretty aware/reactive when it comes to compatibility and stability so 🤷‍♂️
any other reasons you have in mind?

from datawizard.

bwiernik commented on August 17, 2024 1

See https://github.blog/2016-02-01-working-with-submodules/

For example, https://github.com/zotero/zotero here tracks the https://github.com/citation-style-language/styles repository and pulls in the latest versions of its built-in styles at build time.

I think that simulate_simpson() could live in either correlation or bayestestR. It's a pedagogical function and I think it reasonable can live with the other correlation analysis functions.

from datawizard.

strengejacke commented on August 17, 2024

But only those functions that are affected. The "new" data management functions can be safely moved to datawizard, I guess. But yes, we should prefer to not include circular dependencies here.

from datawizard.

IndrajeetPatil commented on August 17, 2024

I will list currently included functions in datawizard that have this issue:

adjust

datawizard/R/adjust.R

Lines 193 to 201 in d757fdc

 adjusted <- insight::get_residuals(model) 

 # Re-add intercept if need be 

 if (keep_intercept) { 

 intercept <- insight::get_intercept(model) 

 if (length(intercept) > 1) intercept <- stats::median(intercept) # For bayesian model 

 if (is.na(intercept)) intercept <- 0 

 adjusted <- adjusted + intercept 

 }

~~center (this can be solved by also moving format_message from insight to datawizard)~~

datawizard/R/center.R

Lines 191 to 201 in d757fdc

 if (length(unique(x)) == 2 && !is.factor(x) && !is.character(x)) { 

 if (verbose) { 

 if (is.null(name)) { 

 message(insight::format_message("The variable contains only two different values. Consider converting it to a factor.")) 

 } else { 

 message(insight::format_message(paste0("Variable `", name, "` contains only two different values. Consider converting it to a factor."))) 

 } 

 } 

 } 

 x 

 }

check_clusterstructure

datawizard/R/check_clusterstructure.R

Lines 31 to 45 in d757fdc

 if (H < 0.5) { 

 text <- paste0( 

 "The dataset is suitable for clustering (Hopkins' H = ", 

 insight::format_value(H), 

 ").\n" 

 ) 

 color <- "green" 

 } else { 

 text <- paste0( 

 "The dataset is not suitable for clustering (Hopkins' H = ", 

 insight::format_value(H), 

 ").\n" 

 ) 

 color <- "red" 

 }

demean (this is easy to solve; just not use color to print the message)

datawizard/R/demean.R

Lines 281 to 288 in d757fdc

 if (isTRUE(verbose)) { 

 insight::print_color( 

 sprintf( 

 "Categorical predictors (%s) have been coerced to numeric values to compute de- and group-meaned variables.\n", 

 paste0(names(categorical_predictors)[categorical_predictors], collapse = ", ") 

 ), 

 "yellow" 

 )

factor_analysis

datawizard/R/factor_analysis.R

Lines 70 to 88 in d757fdc

 if (!is.null(cor)) { 

 out <- model_parameters( 

 psych::fa( 

 cor, 

 nfactors = n, 

 rotate = rotation, 

 n.obs = nrow(x), 

 ... 

 ), 

 sort = sort, 

 threshold = threshold 

 ) 

 } else { 

 out <- model_parameters( 

 psych::fa(x, nfactors = n, rotate = rotation, ...), 

 sort = sort, 

 threshold = threshold 

 ) 

 }

n_factors (this is easy to solve; just not use color to print the message)

datawizard/R/n_factors.R

Line 348 in d757fdc

insight::print_color("# Method Agreement Procedure:\n\n", "blue")

reduce_parameters

datawizard/R/reduce_parameters.R

Lines 146 to 153 in d757fdc

 data <- reduce_parameters(convert_data_to_numeric(insight::get_predictors(x, ...), ...), method = method, n = n, distance = distance) 

 y <- data.frame(.row = 1:length(insight::get_response(x))) 

 y[insight::find_response(x)] <- insight::get_response(x) 

 y$.row <- NULL 

 formula <- paste(insight::find_response(x), "~", paste(paste0("`", names(data), "`"), collapse = " + ")) 

 stats::update(x, formula = formula, data = cbind(data, y))

~~rescale_weight (this can be solved by also moving format_message from insight to datawizard)~~

datawizard/R/rescale_weights.R

Line 110 in d757fdc

warning(insight::format_message(sprintf("Only one group variable selected, no nested structure possible. Rescaling weights for grout '%s' now.", group)), call. = FALSE)
~~simulate_simpson~~

datawizard/R/simulate_simpson.R

Line 38 in d757fdc

dat <- bayestestR::simulate_correlation(n = n, r = r)

from datawizard.

strengejacke commented on August 17, 2024

Does insight actually need to depend on datawizard?

from datawizard.

strengejacke commented on August 17, 2024

I'd vote for insight still being the package with lowest dependencies. I don't see where insight needs datawizard.

from datawizard.

IndrajeetPatil commented on August 17, 2024

Hmm, I thought datawizard is going to be the lowest level dependency with all the "helper" functions needed by other packages. But I see no reason why we can't have two such low-level dependencies:
one-containing data-wrangling related helpers, while the other containing modeling-related helpers.

So, no, insight won't need to depend on datawizard.

That still leaves open the issue of what to do with datawizard functions that rely on parameters and bayestestR.
Not include them here?

from datawizard.

bwiernik commented on August 17, 2024

Where did simulate_simpson() come from?

from datawizard.

DominiqueMakowski commented on August 17, 2024

correlation

from datawizard.

DominiqueMakowski commented on August 17, 2024

github submodules

what is this dark magic?

from datawizard.

IndrajeetPatil commented on August 17, 2024

I think that simulate_simpson() could live in either correlation or bayestestR. It's a pedagogical function and I think it reasonable can live with the other correlation analysis functions.

Good point. I think it can continue to stay in correlation. Removing it from datawizard now.

from datawizard.

DominiqueMakowski commented on August 17, 2024

I would prefer to have all the data simulation function together, mostly because it's tidy (besides making sense to be in the data package). I don't see any benefit of having its implementation in correlation it's not like the Simpson's paradox is inherently (or even particularly) related to correlations, we just had it there to illustrate a specific subtype of correlation method, but actually the being able to generate such data could be a useful thing beyond the scope of the correlation package

from datawizard.

IndrajeetPatil commented on August 17, 2024

Okay, retaining it then with the other simulate functions.

from datawizard.

IndrajeetPatil commented on August 17, 2024

Do you think that check_clusterstructure might sit better with other check_ functions in performance, although it's a bird of a different feather?

That'd solve our issue since performance already relies on insight.

from datawizard.

DominiqueMakowski commented on August 17, 2024

conceptually, it makes sense in datawizard, as it's about the structure in dataframe. Now, I'm far from being a stubborn radical idealist, so I'm okay with exceptions due to practical considerations :p

from datawizard.

IndrajeetPatil commented on August 17, 2024

haha, okay

But I think the most problematic is this one. datawizard can't for sure rely on parameters:

datawizard/R/factor_analysis.R

Lines 70 to 88 in d757fdc

 if (!is.null(cor)) { 

 out <- model_parameters( 

 psych::fa( 

 cor, 

 nfactors = n, 

 rotate = rotation, 

 n.obs = nrow(x), 

 ... 

 ), 

 sort = sort, 

 threshold = threshold 

 ) 

 } else { 

 out <- model_parameters( 

 psych::fa(x, nfactors = n, rotate = rotation, ...), 

 sort = sort, 

 threshold = threshold 

 ) 

 }

from datawizard.

strengejacke commented on August 17, 2024

I think functions like format_table() etc. are neither perfectly fitting into insight nor datawizard, these would actually require a new formatting package - but we shouldn't add another package for this. So, I still see no needs for insight to depend on datawizard. We should try to design those packages in a way that both are on the lowest level, independent from each other. To do this, we can indeed look at what @bwiernik suggested, importing certain functions like format_value() internally in datawizard. Currently, there are 5 packages that rely on insight, which are not from the easystats team. Therefore, I would vote against a) too radical breaking changes in insight, b) adding more package dependencies.

from datawizard.

IndrajeetPatil commented on August 17, 2024

@DominiqueMakowski I am removing simulate_ functions from datawizard because they require distributio_* functions from bayestestR.

Let them stay for now in correlation.

from datawizard.

IndrajeetPatil commented on August 17, 2024

I think these functions should actually be in bayestestR and then we can import and re-export them in correlation to not cause a breaking change.

from datawizard.

DominiqueMakowski commented on August 17, 2024

ok 😞

from datawizard.

IndrajeetPatil commented on August 17, 2024

Closing now since the builds are finally passing and we no longer have circular dependency issues.

from datawizard.

circular dependencies about datawizard HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	adjusted <- insight::get_residuals(model)

	# Re-add intercept if need be
	if (keep_intercept) {
	intercept <- insight::get_intercept(model)
	if (length(intercept) > 1) intercept <- stats::median(intercept) # For bayesian model
	if (is.na(intercept)) intercept <- 0
	adjusted <- adjusted + intercept
	}

	if (length(unique(x)) == 2 && !is.factor(x) && !is.character(x)) {
	if (verbose) {
	if (is.null(name)) {
	message(insight::format_message("The variable contains only two different values. Consider converting it to a factor."))
	} else {
	message(insight::format_message(paste0("Variable `", name, "` contains only two different values. Consider converting it to a factor.")))
	}
	}
	}
	x
	}

	if (H < 0.5) {
	text <- paste0(
	"The dataset is suitable for clustering (Hopkins' H = ",
	insight::format_value(H),
	").\n"
	)
	color <- "green"
	} else {
	text <- paste0(
	"The dataset is not suitable for clustering (Hopkins' H = ",
	insight::format_value(H),
	").\n"
	)
	color <- "red"
	}

	if (isTRUE(verbose)) {
	insight::print_color(
	sprintf(
	"Categorical predictors (%s) have been coerced to numeric values to compute de- and group-meaned variables.\n",
	paste0(names(categorical_predictors)[categorical_predictors], collapse = ", ")
	),
	"yellow"
	)

	if (!is.null(cor)) {
	out <- model_parameters(
	psych::fa(
	cor,
	nfactors = n,
	rotate = rotation,
	n.obs = nrow(x),
	...
	),
	sort = sort,
	threshold = threshold
	)
	} else {
	out <- model_parameters(
	psych::fa(x, nfactors = n, rotate = rotation, ...),
	sort = sort,
	threshold = threshold
	)
	}

	data <- reduce_parameters(convert_data_to_numeric(insight::get_predictors(x, ...), ...), method = method, n = n, distance = distance)

	y <- data.frame(.row = 1:length(insight::get_response(x)))
	y[insight::find_response(x)] <- insight::get_response(x)
	y$.row <- NULL

	formula <- paste(insight::find_response(x), "~", paste(paste0("`", names(data), "`"), collapse = " + "))
	stats::update(x, formula = formula, data = cbind(data, y))