matloff / r-vs.-python-for-data-science Goto Github PK

View Code? Open in Web Editor NEW

422.0 422.0 37.0 55 KB

r-vs.-python-for-data-science's People

Contributors

Stargazers

Watchers

r-vs.-python-for-data-science's Issues

Graphical User Interfaces

An area where R has a clear edge is in the availability of point-and-click Graphical User Interfaces. I prefer to code, but I'm generally part of a research team where at least half the members prefer to use menus & dialog boxes. I like that I can put my R code behind their dialogs, or take the code their GUI writes and modify it. For my comparison of the GUI options for R, see http://r4stats.com/articles/software-reviews/r-gui-comparison/. I'm unaware of any GUIs for Python yet, but if you know of any, please let me know.

Some comments

Elegance

I appreciate Python's elegance too, but it's also true that spaces-vs-tabs issues are a real pain.

Learning curve

I wouldn't call it a huge win for R. It's true that for data science, you have mostly everything you need in base R compared to Python. But how many people work with plain R?

And putting aside the tools that are specific to data science, i.e., talking about the language itself (which is the first thing you need to master to start learning data science), that's a win for Python, because I think it's far more intuitive and easy to learn. R has many many strange things that are unique to R, such as the ability to modify itself, NSE, etc. These are versatile features, but hard understand and master.

All in all, I would call it a tie.

Available libraries

I don't see many Python data science libraries backed by an academic publication, and that's a small win for R, in my opinion.

Machine learning

The big actors are pushing for Python here, that's the truth. R tries to follow, but it's still behind.

Statistical correctness

I reaffirm what I said before about academic publications. I think it's important to highlight this point.

Object orientation, metaprogramming

I think that these categories deserve separate comments. I also like very much R's metaprogramming capabilities (which are great, but make it harder to learn, as I argued before). But I don't think it's fair to defend R's seriousness treating functions as objects, and, at the same time, to defend the R's OOP mess over Python's seriousness in this regard.

Language unity

Whether RStudio people see data.table as a competitor, that I don't know. But I don't think so for some reasons. I don't think that dplyr and data.table are in the same league, or serve the same purpose, because dplyr does not provide a new data frame backend (that would be tibble, but tibbles are just data frames with attributes, so it's not competing either). dplyr's purpose is to define a standard data wrangling interface that is independent from the source: a data frame, a database... or even a data table, because there's even the dtplyr package, a data.table backend for dplyr, developed by Hadley himself.

I don't understand what's exactly The Tidyverse Curse. Is it the pipe? (Which was there before the tidyverse, BTW). Because you can use tidyverse functions without the pipe, and the look and feel would be very similar to the subset/transform/aggregate/reshape workflow you could do with base R. And you could use base R with the pipe too.

Linked data structures

Many times I need this, probably due to my CS background, and I would call it a big win for Python.

Packages

Package development and the CRAN infrastructure are a huge win for R. I was surprised that there's no mention to this.

R elegance

No R style I know of suggests opening curly brace on its own line and all the guides I've read dictate it should be on the opening line. This current version is not representative of R practices and would be more elegant with the one line less.

Could you recommend libraries and books for learning data science in R and Python?

Hi Prof Matloff and other friends,

I am not a beginner but not much different
I studied a bit (though not much) about statistics and machine learning a while ago, before switching to study about computer science (general non machine learning side). I am thinking about coming back to statistics and machine learning, or data science (a term which I don't fully understand and is popular nowadays), and start with some self study.

I was wondering if any of you could provide your opinions/recommendations about:

What libraries in R and Python shall I learn? (I am not afraid of learning many, but I am not sure about the many choices of libraries).
What books in R and Python would you recommend? (mainly for the pragmatic side for statistics, machine learning, or data science. Some books for programming and languages are also appreciated.)
Which applications of statistics, machine learning or data science are popular in industry? NLP, computer vision, ...? (In academia, I guess biostatistics, bioinformatics, econometrics?)

Thanks.

Data.table being sidelined by Rstudio?

Thanks for the great write up, it surely provided some food for thought.

I have a question though:Under language unity you write:

"For instance, consider the lightning-fast data.table package. Rather than welcoming it as a hugely valuable contribution to R, RStudio has treated it as a competitor, downplaying it and promoting their own product, dplyr. This is simply not healthy for an open-source language."

I wonder if you could give examples or elaborate on that. I've actually seen some evidence to the contrary in the dtplyr package which is actively being developed by Hadley Wickham and is designed to integrate both packages.

Different Python versions

I'm a CRAN package developer and I have had many people call for a Python implementation of one of my packages. I got a student to create this and we have it all working in base Python but I wanted to take advantage of the same C code that the R package uses.

This is when I ran into Python problems. I've tried several recommended ways of interfacing existing C code to Python but for each way there are Python installs that cannot download and use my package! You have to be using the "right" version of Python in order to be able to use the package. This is just my understanding so please do correct me if i'm wrong on this.

This, to me, is a huge downside of Python right now. Why should a user have to change their software to run my package? I can see an argument for forcing users to upgrade, but most Python packages that run on 2.7 don't run on 2.6 or 3.0. At least with R packages get checked and removed if authors don't update them.

You didn't show any benchmarks

Facts are better than opinion.

The entire draft is just what you feel like. You didn't mention oops in r from s3 to s4 r6 and proto

And there are no benchmarks for parallel programming.

It must have took you a lot of time but its just not satisfactory for any programmer to read. Unless you show hard facts and proof.

Thanks again for such an articles

Available libraries a tie?

Thank you very much for your overview!

However, you call the race about "available libraries" a tie even though you mention that quite basic statistical procedures are not available (or hard to find) in Python:

The following searches in PyPI turned up nothing: log-linear model; Poisson regression; instrumental variables; spatial data; familywise error rate; etc.

I would say that this is a huge minus for Python?! In my own experience, the package availability for time series modeling is even worse.

In general, a search for statistics at PyPI (https://pypi.org/search/?q=statistics) returns only 2,541 packages (what do I care about packages like (https://pypi.org/project/plone.event/) 😄).

(Sidenote: I am a fan of the tidyverse and would say that it is also beneficial for professional users but I recognize that this is open for debate)

Indents in Python and maintainability

Perhaps mention something about the relative advantages and disadvantages of Python's indenting. Simple syntax may be an advantage for educational purposes but for actual production use in the real world, the more structured syntax is far clearer in intent and much easier to maintain in the long term. I have found (and probably caused) numerous errors in production Python code because of stray indents that cause unexpected behaviour. The lack of start/end tokens around code blocks, instead relying on indents for parsing syntax is an abomination in my view :-)

graphs

great article! short and very concise
i think that one of the strong points of R is the great capacity to generate amazing graphs. Is it the same for Python?

Nearest neighbor in python

For example, I once needed code to do fast calculation of nearest-neighbors of a given data point

First google hit for "nearest neighbor" is scikit-learns k nearest neighbor classifier, which links to the BallTree and KDTree classes it uses for neighbor finding:

Little more googling about those algorithms gives you:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html

Also, researching nearest neighbors in generall, will give you the answer, that a KDTree is one of the most efficient algorithms to do this and googling
kdtree python will yield the scipy implementation as first hit.

Nice writeup, but you should learn the Tidyverse

Yes, there could be 60+ functions, but rectangling data is complex. I am also sure it is long-tailed, i.e., 5-10 will cover 80-90% of your cases. Also, moving away from base R will help you address your very first point: syntax aesthetics.

R is data-first language while python a general purpouse

This is very nice Q and A, quite relevant in this repo
https://stackoverflow.com/questions/57490815/simple-data-operations-r-vs-python

Your R example does look more succinct, but Python is much more general purpose so oneliners like that don't necessarily fit within the design goals. You're right that there are more characters to represent certain operations, but that is because pandas was designed for python, which is not a "data-first" type language.

cython IS a C/C++ interface for python

The answer which references cython suggests a lack of understanding of what cython is.

Tidyverse vs 'base' R (Language Unity)

I have been learning R in a mix using base R and the tidyverse (mostly dplyr). I got concerned about Language Unity, because I could started to use R thanks to the tidyverse (just my experience of course) before using dplyr it was just too hard.
Have you ever written in detail about this issue so I can read more?

Discussion on Hacker News

This is being discussed on Hacker News as well.

Include meta-programming capabilities?

I think this is a great comparison that maintains a level of objectivity we don't usually see with these types of comparisons.

Would it be worth including meta-programming facilities? I think R is far and away the winner and I think it's one of the core reasons R is able to facilitate the construction of DSLs (like a large portion of the tidyverse) in ways that are currently not possible in Python.

TechRepublic

A "journalist" on TechRepublic has basically copypastad your article and added a liberal sprinking of "wrote Matloff" to make this mess:

https://www.techrepublic.com/article/r-vs-python-which-is-a-better-programming-language-for-data-science/

enjoy....