Number of functions associated with PCA/FA and cluster analysis is increasing in <a hr

I think that everything from performance under <a hre

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

new package about "data reduction"? about easystats HOT 18 CLOSED

strengejacke commented on May 9, 2024 2

new package about "data reduction"?

from easystats.

Comments (18)

strengejacke commented on May 9, 2024 3

We could call the package b-sides, which implies no narrow scope but rather a lose collection of cool ("underground") functions that didn't make it into the main packages. :-)

from easystats.

DominiqueMakowski commented on May 9, 2024 2

I'd say first let us stabilize the current ecosystem (release all the current packages), so people can finally have an easily installable ecosystem.

Then yes, it'd probably make sense, for visibility and consistency, to move them into a new home. The question is what would be the scope of that home, as data reduction might be a bit too specific (?)

@IndrajeetPatil was mentioning something along the lines of "multivariate", indeed making sense as these functions are dealing with multiple variables. Or it could be dedicated to functions dealing with dataframes, "easydata", to summarise, describe reduce variables and observations. Or something else...

brainstorming call turned on ☺️

from easystats.

mattansb commented on May 9, 2024 2

I think that everything from performance under Check Items and Scales sould also have a place in such a package.

There are also a number of functions for data prep in general that might have a place in such a package? I like the idea of an easydata package ^_^

parameters:
- check_heterogeneity() demean() | Compute group-meaned and de-meaned variables
- rescale_weights() | Rescale design weights for multilevel analysis
- data_partition() | Partition data into a test and a training set
- convert_data_to_numeric() data_to_numeric() | Convert data to numeric
effectsize:
- adjust() | Adjust data for the effect of other variable(s)
- change_scale() | Rescale a numeric variable
- normalize() | Normalization
- ranktransform() | (Signed) rank transformation

from easystats.

pdwaggoner commented on May 9, 2024 1

@strengejacke I really like the idea of a data reduction package. My two cents: from FA, PCA, and clustering, to other data reduction techniques, e.g., autoencoders and multidimensional scaling, more and more people are incorporating data reduction into their broader research projects these days as people are dealing with bigger and bigger data. Further, in the parametric world, techniques like regularization are also useful in reducing the complexity of some feature space. So I think a package dedicated to data reduction is a good idea, however, we would need to have a very clear philosophy underlying the package, as there are so many ways to conceptualize and thus implement data reduction.

from easystats.

strengejacke commented on May 9, 2024 1

ok, I see your point. bayestestR has for historical reasons the trailing upper case R, but in general, I think that dataprep would fit better into the "package naming conventions" than dataprepR. Furthermore, prepR sounds similar to prepper, which has a touch of conspiracy theories ;-)

from easystats.

strengejacke commented on May 9, 2024

Yeah, moving functions form performance and parameters that are related to PCA/FA/Cluster Analysis/Items/Scales and related checks into an own package makes sense, indeed.

Now we need a name for that package...

from easystats.

DominiqueMakowski commented on May 9, 2024

we could also put there some of the functions to simulate data, that are currently in bayestestR and modelbased.

As for the name...

"easydata", "data", "datamanip", "datacleaner", "datacleanR", ...

We can scope it as "the package to clean, manipulate, transform, summarise or reduce your data"

from easystats.

strengejacke commented on May 9, 2024

I'm thinking in terms of "analysis type". Thus, "item analysis" etc. are these PCA/FA/Cluster functions, which are not regression models, but rather "summarise" or "reduce" the data somehow, and can go into an own package - however, I would focus on these functions, and not necessarily include data cleaning functions in such a package as well. What do you say?

I would also argue that very focused data preparation like demean() or rescale_weight() are so connected to regression modelling, that I would consider this as "tool" functions in the related package(s), rather than going into an own "data preparation" package, but here it might make sense to generally collect data preparation functions in an own package.

So, one or two packages? And, which name(s)?

from easystats.

DominiqueMakowski commented on May 9, 2024

I see what you mean, you would divide it like "item analysis" vs. "regression analysis"? For me it's more like based on the main input. For instance most of our other packages take in a model and work on it. But there are also functions that take data in (or out) and that really work on the "data level".

At the same time, I am worried that two packages would spread the butter too much (and we'll become our own dependency nightmare). Also data preparation (i.e., for regression), reduction (PCA), summary (potentially clustering stuff), structure exploration (FA, cluster) are kind of about data so it fits within a data-oriented package. Same for demean(), even though its scope is a bit narrower. I'm not sure about rescale_weight() but then I mean if there are a few exceptions for now I'd say it's okay.

"easydata", "dataprep", datacleaner", "datacleanR", "datawork", "dataworker"...

from easystats.

mattansb commented on May 9, 2024

For me it's more like based on the main input. For instance most of our other packages take in a model and work on it. But there are also functions that take data in (or out) and that really work on the "data level".

I definitely agree with this idea!

The package would take data as input, and return data that can then be used in models etc. So both demean and all the item/reduction/PCA/clustering stuff (a lot of which is an easy-wrapper from other packages) would also go here.

I like easydata or dataprepR - "the package to clean, manipulate, transform, summarise or reduce your data"

from easystats.

IndrajeetPatil commented on May 9, 2024

How about easyprepR (as in “easily prepare your data for analysis”)?

from easystats.

DominiqueMakowski commented on May 9, 2024

If the data is the focus of the package than having data in it would be clear imo, but dataprepR or dataprepare or dataprep is rather transparent indeed :)

from easystats.

DominiqueMakowski commented on May 9, 2024

I agree to avoid the trailing upper case R (its true that bayestestR is a historical exception). I like easydata but it would be a bit of an exception because no other easystats package has easy in the name... so maybe dataprep or dataready (get your data ready for analysis) or dataperfect (make your data perfect before analysis) or datamanage (manage your data before analysis) or...

from easystats.

pdwaggoner commented on May 9, 2024

This is a great conversation, all. I have just a couple thoughts as I am currently working on a couple books related to a few of these topics. From @mattansb:

the package to clean, manipulate, transform, summarise or reduce your data

I worry that this is much too broad for a single or even couple of packages. For example, what is meant by "cleaning"? The answer to this is pretty much all currently covered by dplyr for a strictly data-level operation. For a modeling-specific task, data cleaning varies so widely that a single package could be tricky to make a lot of people happy (or at the very least be beneficial to a wide range of users). The same could be said for each of the other utilities listed above. E.g., the definition and use of "reducing" data of this sort (e.g., manifold learning, dimension reduction, representation/embedding, etc.) varies so widely by domain and function that including some sort of pca_*() function might be at best redundant or at worst limited. And so on for feature engineering in the cleaning, summary, and transformation realms, etc.

I like the idea of thinking about data explicitly. But I am not sure such a wide net make theoretical sense.

Instead, perhaps, we could focus (at least at the beginning) with just very basic mathematical transformations of features/data, such as scaling, weighting schemes, demeaning, centering, and so on. But I am a bit hesitant to push further to clusering, dimension reduction, etc.

Not to be a naysayer... by anyone have any thoughts here? Thanks all!

from easystats.

mattansb commented on May 9, 2024

Come to think of it, I agree with @pdwaggoner. Changing the scope of easystats to such a package is a big task to undertake...

If I may pivot completely,..
Many of the functionality were talking about here, which to some extent falls under "statistical variable engineering" (?) is done by recipes, which I love. Perhaps we can offer custom recipes, such as demean, adjust, change_scale, normalize, rescale_weights in a mini package with custom recipes (easyrecipes?).

This is completely outside the easystats-low-dependency-philosophy, but (so is see, and also) if the package is small and focused, I think it might work.... (This is what I would probably do if I were working with a blank canvas...)

from easystats.

pdwaggoner commented on May 9, 2024

Great thoughts @mattansb ! I too love and use recipes all the time. I had no idea though that tidymodels supported a democratized effort like this. Very cool and super useful! I definitely think there is something here...

from easystats.

strengejacke commented on May 9, 2024

Come to think of it, I agree with @pdwaggoner. Changing the scope of easystats to such a package is a big task to undertake...

If I may pivot completely,..
Many of the functionality were talking about here, which to some extent falls under "statistical variable engineering" (?) is done by recipes, which I love. Perhaps we can offer custom recipes, such as demean, adjust, change_scale, normalize, rescale_weights in a mini package with custom recipes (easyrecipes?).

This is completely outside the easystats-low-dependency-philosophy, but (so is see, and also) if the package is small and focused, I think it might work.... (This is what I would probably do if I were working with a blank canvas...)

I personally would refrain from connecting a package with our functions to the tidymodels API, since it is indeed very "bloated" in terms of dependencies (and requires tibbles (?)). I also think the API we use for such functions as demean() (with data frame as input, select, group etc. as arguments) is quite beginner-friendly and more "natural".

from easystats.

strengejacke commented on May 9, 2024

closing in favor of #97

from easystats.

new package about "data reduction"? about easystats HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent