Comments (18)
We could call the package b-sides, which implies no narrow scope but rather a lose collection of cool ("underground") functions that didn't make it into the main packages. :-)
from easystats.
I'd say first let us stabilize the current ecosystem (release all the current packages), so people can finally have an easily installable ecosystem.
Then yes, it'd probably make sense, for visibility and consistency, to move them into a new home. The question is what would be the scope of that home, as data reduction might be a bit too specific (?)
@IndrajeetPatil was mentioning something along the lines of "multivariate", indeed making sense as these functions are dealing with multiple variables. Or it could be dedicated to functions dealing with dataframes, "easydata", to summarise, describe reduce variables and observations. Or something else...
brainstorming call turned on
from easystats.
I think that everything from performance
under Check Items and Scales sould also have a place in such a package.
There are also a number of functions for data prep in general that might have a place in such a package? I like the idea of an easydata
package ^_^
parameters
:check_heterogeneity()
demean()
| Compute group-meaned and de-meaned variablesrescale_weights()
| Rescale design weights for multilevel analysisdata_partition()
| Partition data into a test and a training setconvert_data_to_numeric()
data_to_numeric()
| Convert data to numeric
effectsize
:adjust()
| Adjust data for the effect of other variable(s)change_scale()
| Rescale a numeric variablenormalize()
| Normalizationranktransform()
| (Signed) rank transformation
from easystats.
@strengejacke I really like the idea of a data reduction package. My two cents: from FA, PCA, and clustering, to other data reduction techniques, e.g., autoencoders and multidimensional scaling, more and more people are incorporating data reduction into their broader research projects these days as people are dealing with bigger and bigger data. Further, in the parametric world, techniques like regularization are also useful in reducing the complexity of some feature space. So I think a package dedicated to data reduction is a good idea, however, we would need to have a very clear philosophy underlying the package, as there are so many ways to conceptualize and thus implement data reduction.
from easystats.
ok, I see your point. bayestestR has for historical reasons the trailing upper case R, but in general, I think that dataprep would fit better into the "package naming conventions" than dataprepR. Furthermore, prepR sounds similar to prepper, which has a touch of conspiracy theories ;-)
from easystats.
Yeah, moving functions form performance and parameters that are related to PCA/FA/Cluster Analysis/Items/Scales and related checks into an own package makes sense, indeed.
Now we need a name for that package...
from easystats.
we could also put there some of the functions to simulate data, that are currently in bayestestR and modelbased.
As for the name...
"easydata", "data", "datamanip", "datacleaner", "datacleanR", ...
We can scope it as "the package to clean, manipulate, transform, summarise or reduce your data"
from easystats.
I'm thinking in terms of "analysis type". Thus, "item analysis" etc. are these PCA/FA/Cluster functions, which are not regression models, but rather "summarise" or "reduce" the data somehow, and can go into an own package - however, I would focus on these functions, and not necessarily include data cleaning functions in such a package as well. What do you say?
I would also argue that very focused data preparation like demean()
or rescale_weight()
are so connected to regression modelling, that I would consider this as "tool" functions in the related package(s), rather than going into an own "data preparation" package, but here it might make sense to generally collect data preparation functions in an own package.
So, one or two packages? And, which name(s)?
from easystats.
I see what you mean, you would divide it like "item analysis" vs. "regression analysis"? For me it's more like based on the main input. For instance most of our other packages take in a model and work on it. But there are also functions that take data in (or out) and that really work on the "data level".
At the same time, I am worried that two packages would spread the butter too much (and we'll become our own dependency nightmare). Also data preparation (i.e., for regression), reduction (PCA), summary (potentially clustering stuff), structure exploration (FA, cluster) are kind of about data so it fits within a data-oriented package. Same for demean(), even though its scope is a bit narrower. I'm not sure about rescale_weight() but then I mean if there are a few exceptions for now I'd say it's okay.
"easydata", "dataprep", datacleaner", "datacleanR", "datawork", "dataworker"...
from easystats.
For me it's more like based on the main input. For instance most of our other packages take in a model and work on it. But there are also functions that take data in (or out) and that really work on the "data level".
I definitely agree with this idea!
The package would take data as input, and return data that can then be used in models etc. So both demean
and all the item/reduction/PCA/clustering stuff (a lot of which is an easy
-wrapper from other packages) would also go here.
I like easydata
or dataprepR
- "the package to clean, manipulate, transform, summarise or reduce your data"
from easystats.
How about easyprepR (as in “easily prepare your data for analysis”)?
from easystats.
If the data is the focus of the package than having data in it would be clear imo, but dataprepR or dataprepare or dataprep is rather transparent indeed :)
from easystats.
I agree to avoid the trailing upper case R (its true that bayestestR is a historical exception). I like easydata
but it would be a bit of an exception because no other easystats package has easy in the name... so maybe dataprep
or dataready
(get your data ready for analysis) or dataperfect
(make your data perfect before analysis) or datamanage
(manage your data before analysis) or...
from easystats.
This is a great conversation, all. I have just a couple thoughts as I am currently working on a couple books related to a few of these topics. From @mattansb:
the package to clean, manipulate, transform, summarise or reduce your data
I worry that this is much too broad for a single or even couple of packages. For example, what is meant by "cleaning"? The answer to this is pretty much all currently covered by dplyr
for a strictly data-level operation. For a modeling-specific task, data cleaning varies so widely that a single package could be tricky to make a lot of people happy (or at the very least be beneficial to a wide range of users). The same could be said for each of the other utilities listed above. E.g., the definition and use of "reducing" data of this sort (e.g., manifold learning, dimension reduction, representation/embedding, etc.) varies so widely by domain and function that including some sort of pca_*()
function might be at best redundant or at worst limited. And so on for feature engineering in the cleaning, summary, and transformation realms, etc.
I like the idea of thinking about data explicitly. But I am not sure such a wide net make theoretical sense.
Instead, perhaps, we could focus (at least at the beginning) with just very basic mathematical transformations of features/data, such as scaling, weighting schemes, demeaning, centering, and so on. But I am a bit hesitant to push further to clusering, dimension reduction, etc.
Not to be a naysayer... by anyone have any thoughts here? Thanks all!
from easystats.
Come to think of it, I agree with @pdwaggoner. Changing the scope of easystats
to such a package is a big task to undertake...
If I may pivot completely,..
Many of the functionality were talking about here, which to some extent falls under "statistical variable engineering" (?) is done by recipes
, which I love. Perhaps we can offer custom recipes, such as demean
, adjust
, change_scale
, normalize
, rescale_weights
in a mini package with custom recipes (easyrecipes
?).
This is completely outside the easystats
-low-dependency-philosophy, but (so is see
, and also) if the package is small and focused, I think it might work.... (This is what I would probably do if I were working with a blank canvas...)
from easystats.
Great thoughts @mattansb ! I too love and use recipes
all the time. I had no idea though that tidymodels
supported a democratized effort like this. Very cool and super useful! I definitely think there is something here...
from easystats.
Come to think of it, I agree with @pdwaggoner. Changing the scope of
easystats
to such a package is a big task to undertake...If I may pivot completely,..
Many of the functionality were talking about here, which to some extent falls under "statistical variable engineering" (?) is done byrecipes
, which I love. Perhaps we can offer custom recipes, such asdemean
,adjust
,change_scale
,normalize
,rescale_weights
in a mini package with custom recipes (easyrecipes
?).This is completely outside the
easystats
-low-dependency-philosophy, but (so issee
, and also) if the package is small and focused, I think it might work.... (This is what I would probably do if I were working with a blank canvas...)
I personally would refrain from connecting a package with our functions to the tidymodels API, since it is indeed very "bloated" in terms of dependencies (and requires tibbles (?)). I also think the API we use for such functions as demean()
(with data frame as input, select
, group
etc. as arguments) is quite beginner-friendly and more "natural".
from easystats.
closing in favor of #97
from easystats.
Related Issues (20)
- Don't comment out tests HOT 4
- Evaluate example and vignettes conditionally if they require internet access HOT 3
- About loading packages in tests HOT 11
- Harmonize `README`s across the ecosystem HOT 3
- Use `.R` file extension instead of `.r` HOT 5
- Synchronize GitHub release titles with CRAN releases HOT 3
- Adding links to `see` vignettes in individual packages HOT 1
- Move from GPL-3 to MIT license HOT 8
- Figure out why total download counts are not included in the README table HOT 1
- Namespace error with sjPlot after installing easy stats HOT 3
- Use the new `check_if_installed()` with `.get_dep_version()` in all easystats packages HOT 1
- Should we leave out `date` and `author` from vignette metadata?
- Collect roxygen import tags and re-exports in a single location HOT 2
- Use `\donttest` instead of `\dontrun` tag for skipping examples on CRAN machines HOT 6
- Next meta-package CRAN release HOT 6
- Going workflow green HOT 2
- Surprising behaviour of tests on Windows in some specific contexts HOT 5
- Getting rid of a warning in a vignette HOT 2
- Check that the tests don't change the global state
- Running tests on 2 cores only HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from easystats.