I don't want to say "advantage", so much as preference. But a few things come to mind.
- Lots of high quality statistical libraries, for one thing.
- RStudio's RMarkown is great; I prefer it to Jupyter Notebook.
- I personally found the syntax more intuitive, easier to pick up. I don't usually find myself confused about the structure of the objects I'm looking at. For whatever reason, the "syntax" of pandas doesn't square well (in my opinion) with python generally. I certainly want to just use python. But, shrug.
- The tidyverse package, especially the pipe operator %>%, which afaik doesn't have an equivalent in Python. E.g.
Here I'm filtering participants in an mturk study by those who have completed more than 40 trials at least six times across multiple sessions. It's not that I couldn't do the same transformation in pandas, but it feels very intuitive to me doing it this way.
- ggplot2 for plotting; its really powerful data visualization package.
Truthfully, I often do my data text parsing in Python, and then switch over to R for analysis, E.g. python's JSON parsing works really well.
There were 50 trials in each session; so I counted a session completed if they did more than 40 in that session. They needed to have completed at least six sessions.
The mutate is unnecessary. I forget why I did that.
Tabular data manipulation packages are better, easier to make nontrivial charts, many R stats packages have no counterparts in Python, less bureaucracy, more batteries-included.
R is a language by and for statisticians. Python is a programming language that can do some statistics.
For me, I use R data.table a lot and I see as the main advantages are performance and the terse syntax. The terse syntax does come with a steep learning curve though.
Indeed, data.table is just awesome for productivity. When you're manipulating data for exploration you want the least number of keystrokes to bring an idea to life and data.table gives you that.
The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries.
But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table
flights[, head(.SD, 2), by = month]
That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!
Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.
Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.
It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:
- How many events by type
- When did they happen
- Are there any breaks in the count, why?
- Some statistics on these events like average, min, max
On top of what has been said, if you want to do some more advanced statistical analyses (in the inference area, not ML/predictive field), then chances are that these algorithms are either published as R or STATA packages (usually R).
In Python, there is statsmodels. Here, you'll find a lot of GLM stuff, which is sort of an older approach. Modern inferential statistics, if not just Bayesian, is usually in the flavor of semi-parametric models that rely on asymptotics.
As R is used by professional researchers, it is simply more on the edge of things. Python has most of the "statistics course" schoolbook methods, but not much beyond that.
For example, it has become very common to have dynamic panel data which require dynamic models.
Now if you want to do a Blundell-Bond type model in PYthon you have to... code it yourself using GMM, if it exists even.
For statistics, that's pretty much like saying you have a Deep Learning package that maybe has GRU but no transformer modules at all.
So yeah, you can code it yourself. Or you use the other one.
> What's the advantage of Python if you already know R?
AFAIK in statistical modelling Python is better only in neural networks, so if you do not need to do fancy things with images, text, etc. you do not need Python. R is still the king.
In terms of charting and dashboards, I would say that if you work high level R and Python are both pleasant. R has ggplot, but Python has Plotly Express. R has Shiny, but Python has Dash and Streamlit. You can do great with both.
One difference I've noticed is that R libraries are usually authored and maintained by academics in the associated field; the same can't always be said about equivalent Python libraries. This means that R library authors generally use their own libraries for publication and have an academic stake in its correctness.