Giter Club home page Giter Club logo

Comments (23)

bwiernik avatar bwiernik commented on August 17, 2024 2

Another option if there is a small number of utility functions (like the console coloring, etc.) that both insight and datawizard need, we could use github submodules to copy them over from insight to datawizard (or vice versa) at build time so that they can exist internally in both, but only be exported by one.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024 2

In light of all this discussion, I think the most pragmatic solution for now (since we want to get this package on CRAN asap) would be that datawizard relies on insight, making the latter de facto the lowest level dependency in easystats.

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024 1

Actually, I was about to say that we should also consider moving down some of the essential helpers, like check_if_installed (and indeed, following @bwiernik add an argument to check version [edit: u did this already]) and format_table(). So that we can use them in datawizard (and regarding the formatting/printing of dataframes, it makes perfect sense to have them in the "data preparing" package). These would be used by insight...

Conceptually I think the hierarchy is logical and meaningful too: first we start by functions to work with data, then functions to retrieve data from models, then functions to analyze/postprocess specific aspects of models, then functions to report/visualize them. So I'd vote for datawizard as the lowest level (in the long run, tho we could start by having them independent I suppose).

The only reasons why we might want to refrain from (that that I can see) are

  1. it might deter other developers from depending on insight if they see that it has a dependency: imo it's always the same story, we just have to explain that all the easystats packages could in fact well be just one package and that they were split for maintenance reasons but there is minimum risk of breakage due to some obscure upstream change (since the stream is actually closed) - which is the main problem with dependencies
  2. because insight is such a key foundational stone that you do not want to depend on another package in case it breaks or something. But I'd say that the risk is mitigated since insight would actually depend on only a few robust functions (check_required, format_table and things like that), so I don't see how it can go wrong... And then we have to trust @IndrajeetPatil with his maintaining skills 😋 but so far he's been pretty aware/reactive when it comes to compatibility and stability so 🤷‍♂️
  3. any other reasons you have in mind?

from datawizard.

bwiernik avatar bwiernik commented on August 17, 2024 1

See https://github.blog/2016-02-01-working-with-submodules/

For example, https://github.com/zotero/zotero here tracks the https://github.com/citation-style-language/styles repository and pulls in the latest versions of its built-in styles at build time.

I think that simulate_simpson() could live in either correlation or bayestestR. It's a pedagogical function and I think it reasonable can live with the other correlation analysis functions.

from datawizard.

strengejacke avatar strengejacke commented on August 17, 2024

But only those functions that are affected. The "new" data management functions can be safely moved to datawizard, I guess. But yes, we should prefer to not include circular dependencies here.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

I will list currently included functions in datawizard that have this issue:

  1. adjust

    datawizard/R/adjust.R

    Lines 193 to 201 in d757fdc

    adjusted <- insight::get_residuals(model)
    # Re-add intercept if need be
    if (keep_intercept) {
    intercept <- insight::get_intercept(model)
    if (length(intercept) > 1) intercept <- stats::median(intercept) # For bayesian model
    if (is.na(intercept)) intercept <- 0
    adjusted <- adjusted + intercept
    }

  2. center (this can be solved by also moving format_message from insight to datawizard)

    datawizard/R/center.R

    Lines 191 to 201 in d757fdc

    if (length(unique(x)) == 2 && !is.factor(x) && !is.character(x)) {
    if (verbose) {
    if (is.null(name)) {
    message(insight::format_message("The variable contains only two different values. Consider converting it to a factor."))
    } else {
    message(insight::format_message(paste0("Variable `", name, "` contains only two different values. Consider converting it to a factor.")))
    }
    }
    }
    x
    }

  3. check_clusterstructure

    if (H < 0.5) {
    text <- paste0(
    "The dataset is suitable for clustering (Hopkins' H = ",
    insight::format_value(H),
    ").\n"
    )
    color <- "green"
    } else {
    text <- paste0(
    "The dataset is not suitable for clustering (Hopkins' H = ",
    insight::format_value(H),
    ").\n"
    )
    color <- "red"
    }

  4. demean (this is easy to solve; just not use color to print the message)

    datawizard/R/demean.R

    Lines 281 to 288 in d757fdc

    if (isTRUE(verbose)) {
    insight::print_color(
    sprintf(
    "Categorical predictors (%s) have been coerced to numeric values to compute de- and group-meaned variables.\n",
    paste0(names(categorical_predictors)[categorical_predictors], collapse = ", ")
    ),
    "yellow"
    )

  5. factor_analysis

    if (!is.null(cor)) {
    out <- model_parameters(
    psych::fa(
    cor,
    nfactors = n,
    rotate = rotation,
    n.obs = nrow(x),
    ...
    ),
    sort = sort,
    threshold = threshold
    )
    } else {
    out <- model_parameters(
    psych::fa(x, nfactors = n, rotate = rotation, ...),
    sort = sort,
    threshold = threshold
    )
    }

  6. n_factors (this is easy to solve; just not use color to print the message)

    insight::print_color("# Method Agreement Procedure:\n\n", "blue")

  7. reduce_parameters

    data <- reduce_parameters(convert_data_to_numeric(insight::get_predictors(x, ...), ...), method = method, n = n, distance = distance)
    y <- data.frame(.row = 1:length(insight::get_response(x)))
    y[insight::find_response(x)] <- insight::get_response(x)
    y$.row <- NULL
    formula <- paste(insight::find_response(x), "~", paste(paste0("`", names(data), "`"), collapse = " + "))
    stats::update(x, formula = formula, data = cbind(data, y))

  8. rescale_weight (this can be solved by also moving format_message from insight to datawizard)

    warning(insight::format_message(sprintf("Only one group variable selected, no nested structure possible. Rescaling weights for grout '%s' now.", group)), call. = FALSE)

  9. simulate_simpson

    dat <- bayestestR::simulate_correlation(n = n, r = r)

from datawizard.

strengejacke avatar strengejacke commented on August 17, 2024

Does insight actually need to depend on datawizard?

from datawizard.

strengejacke avatar strengejacke commented on August 17, 2024

I'd vote for insight still being the package with lowest dependencies. I don't see where insight needs datawizard.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

Hmm, I thought datawizard is going to be the lowest level dependency with all the "helper" functions needed by other packages. But I see no reason why we can't have two such low-level dependencies:
one-containing data-wrangling related helpers, while the other containing modeling-related helpers.

So, no, insight won't need to depend on datawizard.

That still leaves open the issue of what to do with datawizard functions that rely on parameters and bayestestR.
Not include them here?

from datawizard.

bwiernik avatar bwiernik commented on August 17, 2024

Where did simulate_simpson() come from?

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024

correlation

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024

github submodules

what is this dark magic?

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

I think that simulate_simpson() could live in either correlation or bayestestR. It's a pedagogical function and I think it reasonable can live with the other correlation analysis functions.

Good point. I think it can continue to stay in correlation. Removing it from datawizard now.

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024

I would prefer to have all the data simulation function together, mostly because it's tidy (besides making sense to be in the data package). I don't see any benefit of having its implementation in correlation it's not like the Simpson's paradox is inherently (or even particularly) related to correlations, we just had it there to illustrate a specific subtype of correlation method, but actually the being able to generate such data could be a useful thing beyond the scope of the correlation package

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

Okay, retaining it then with the other simulate functions.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

Do you think that check_clusterstructure might sit better with other check_ functions in performance, although it's a bird of a different feather?

That'd solve our issue since performance already relies on insight.

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024

conceptually, it makes sense in datawizard, as it's about the structure in dataframe. Now, I'm far from being a stubborn radical idealist, so I'm okay with exceptions due to practical considerations :p

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

haha, okay

But I think the most problematic is this one. datawizard can't for sure rely on parameters:

if (!is.null(cor)) {
out <- model_parameters(
psych::fa(
cor,
nfactors = n,
rotate = rotation,
n.obs = nrow(x),
...
),
sort = sort,
threshold = threshold
)
} else {
out <- model_parameters(
psych::fa(x, nfactors = n, rotate = rotation, ...),
sort = sort,
threshold = threshold
)
}

from datawizard.

strengejacke avatar strengejacke commented on August 17, 2024

I think functions like format_table() etc. are neither perfectly fitting into insight nor datawizard, these would actually require a new formatting package - but we shouldn't add another package for this. So, I still see no needs for insight to depend on datawizard. We should try to design those packages in a way that both are on the lowest level, independent from each other. To do this, we can indeed look at what @bwiernik suggested, importing certain functions like format_value() internally in datawizard. Currently, there are 5 packages that rely on insight, which are not from the easystats team. Therefore, I would vote against a) too radical breaking changes in insight, b) adding more package dependencies.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

@DominiqueMakowski I am removing simulate_ functions from datawizard because they require distributio_* functions from bayestestR.

Let them stay for now in correlation.

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

I think these functions should actually be in bayestestR and then we can import and re-export them in correlation to not cause a breaking change.

from datawizard.

DominiqueMakowski avatar DominiqueMakowski commented on August 17, 2024

ok 😞

from datawizard.

IndrajeetPatil avatar IndrajeetPatil commented on August 17, 2024

Closing now since the builds are finally passing and we no longer have circular dependency issues.

from datawizard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.