Giter Club home page Giter Club logo

Comments (9)

aeturrell avatar aeturrell commented on July 4, 2024 1

from lost-stats.github.io.

NickCH-K avatar NickCH-K commented on July 4, 2024

It's not just a different syntax but also much, much faster overall than dplyr, not just on reading in data but also manipulating it. Syntax differences too, but those are cosmetic of course - there's actually a dtplyr package that converts most dplyr code into data.table syntax.

As for usage, my understanding is that it's considerably less popular than dplyr overall, especially in academia, but is perhaps more popular than dplyr among the data science crowd, since they work with big data sets. Neither dplyr nor data.table are base-R though (although data.table is way closer). So it doesn't really resolve the non-tidyverse thing.

A page on big-data ingestion and storage methods would be great. There's already a page under Other on importing foreign files generally, but a different page that focuses on large data would be different enough to justify a different page I think.

from lost-stats.github.io.

aeturrell avatar aeturrell commented on July 4, 2024

I see, thanks for explaining (and I might check it out next time I'm using R!).

It's definitely not my call but if the main difference is speed of manipulation then my view would be that it might be better on a different page addressing data manipulation of large datasets (/computationally intensive methods) because that seems to be the main use case.

I say this as I think the typical user looking for info on data manipulation might find the extra detail covering special cases (here, scaling to bigger problems) a bit overwhelming. Essentially, I wonder if it could detract from the clarity of the pages.

And then one could have a page explicitly addressing scaling/speed/big data issues, which are not trivial to assess and explain. (Lazy execution versus on-the-fly and in-memory for a start.)

However, I'm not familiar with the packages in question and I can see it both ways so 100% supportive if you decide to go another way.

from lost-stats.github.io.

khwilson avatar khwilson commented on July 4, 2024

R deeply confuses me on these points because I'm pretty sure that with the addition of data.table, we're up to 4 major competing syntaxes for manipulating data in R http://www.amelia.mn/Syntax-cheatsheet.pdf

There are many other distinctions about the internals of data.table or dask or d[tb]plyr or pandas or... but the original idea of this repository was to show you how to do the same things in many languages. In that way, the syntax issue probably overpowers the niceties of "when should I choose what" for this particular repository (though I look to the BDFL Nick's determination on that :-) ).

Perhaps the solution here is to actually lean into that conception of LOST? It may make sense, for instance, to setup every page as an L x E matrix where L is the number of languages LOST supports and E is the number of examples we have for a particular method? This also would make it pretty clear to newcomers what work there is to be accomplished. :-)

To Arthur's other point, having a "philosophy" page on "when you should use what" would I also think be great. It would help people answer the question, "Should I invest now in learning technology X instead of just hacking around with what I know?"

from lost-stats.github.io.

NickCH-K avatar NickCH-K commented on July 4, 2024

It's more like three competing syntaxes - the formula syntax mentioned there is very rarely used for data manipulation, it's more for model creation (although it does sometimes pop up for stuff like reshaping wide/long). And two of those syntaxes - base (called "dollar sign syntax" there) and data.table are pretty similar in a lot of ways (and base is IMO pretty clearly the inferior of the three).

I like the idea of "when you should use what" as a page - it would also help set apart the examples here more effectively from something like StackExchange, which answers a question and provides some code, but usually isn't great at comparing different approaches (or languages!).

The other alternative is treating R as a special case and, for the purposes of data manipulation pages, basically treating R dplyr and R data.table as two separate languages. They nearly are!

from lost-stats.github.io.

grantmcdermott avatar grantmcdermott commented on July 4, 2024

Quick 5c:

I've been thinking of doing this for quite a while. So I'm certainly in favour. Nick we can divide and conquer if you'd like.

I agree tyranny of choice is a problem. But I think most R users would agree that dplyr (tidyverse) and data.table provide the canonical data manipulation methods. So having both makes sense to me. Yes, base R can do nearly everything too, but it's slower and more cumbersome to type.

On a less structured note, I'd like to see data.table included if for no other reason that I'm such a fan of it, personally. The syntax is not to everyone's liking (I don't see a problem), but it really is the fastest and most powerful game in town for an astonishing number of applications.

from lost-stats.github.io.

grantmcdermott avatar grantmcdermott commented on July 4, 2024

Oh, should also add: We're really lacking on the Julia examples atm. I'll try to add some when I get time (ha!).

from lost-stats.github.io.

grantmcdermott avatar grantmcdermott commented on July 4, 2024

I think we can close this now. I've add one example here and will be encouraging my students to submit data.table equivalents (i.e. as part of their OSS contribution requirement in my class).

from lost-stats.github.io.

aeturrell avatar aeturrell commented on July 4, 2024

In case you're interested, there's now also Py datatable, polars, and even cuDF in the Python alternatives-to-pandas ecosystem. (Not sure adding examples with these is high value for now though.)

from lost-stats.github.io.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.