What's the advantage, if you already know Python? (genuine interest)

jonnycomputer · on Jan 16, 2022

I don't want to say "advantage", so much as preference. But a few things come to mind.

- Lots of high quality statistical libraries, for one thing.

- RStudio's RMarkown is great; I prefer it to Jupyter Notebook.

- I personally found the syntax more intuitive, easier to pick up. I don't usually find myself confused about the structure of the objects I'm looking at. For whatever reason, the "syntax" of pandas doesn't square well (in my opinion) with python generally. I certainly want to just use python. But, shrug.

- The tidyverse package, especially the pipe operator %>%, which afaik doesn't have an equivalent in Python. E.g.

    with_six_visits <- task_df %>%
      group_by(turker_id, visit) %>%
      summarise(n_trials = n_distinct(trial_num)) %>%
      mutate(completed_visit = n_trials>40) %>%
      filter(completed_visit) %>%
      summarise(n_visits = n_distinct(visit)) %>%
      mutate(six_visits = n_visits >= 6) %>%
      filter(six_visits) %>%
      ungroup()

Here I'm filtering participants in an mturk study by those who have completed more than 40 trials at least six times across multiple sessions. It's not that I couldn't do the same transformation in pandas, but it feels very intuitive to me doing it this way.

- ggplot2 for plotting; its really powerful data visualization package.

Truthfully, I often do my data text parsing in Python, and then switch over to R for analysis, E.g. python's JSON parsing works really well.

rahimnathwani · on Jan 17, 2022

I can see how this is more intuitive. In pandas I'd assign the output of groupby to a variable, and then add the new column in a separate statement.

(The below is off topic, but I don't use R so I'd love to know whether I'm reading the code correctly)

"Here I'm filtering participants in an mturk study by those who have completed more than 40 trials at least six times across multiple sessions."

A user with this pattern of trials seems like they would fit the above definition:

Session 1: 82 trials Session 2: 82 trials Session 3: 82 trials

But the code seems to want 6 distinct sessions with >40 trials each. Have I misunderstood?

Also, is 'mutate' necessary before 'filter' or is that just to make the intent of the code clearer to your future self?

jonnycomputer · on Jan 17, 2022

My initial wording was sloppy.

There were 50 trials in each session; so I counted a session completed if they did more than 40 in that session. They needed to have completed at least six sessions.

The mutate is unnecessary. I forget why I did that.

jonnycomputer · on Jan 17, 2022

What it woul take to recreate dplyr in python:

https://mchow.com/posts/2020-02-11-dplyr-in-python/

melling · on Jan 17, 2022

Didn’t R introduce the native pipe operator?

%>% is now simply >|

jonnycomputer · on Jan 17, 2022

They did. I just haven't gotten around to using it yet!

civilized · on Jan 16, 2022

Tabular data manipulation packages are better, easier to make nontrivial charts, many R stats packages have no counterparts in Python, less bureaucracy, more batteries-included.

R is a language by and for statisticians. Python is a programming language that can do some statistics.

Bootvis · on Jan 16, 2022

For me, I use R data.table a lot and I see as the main advantages are performance and the terse syntax. The terse syntax does come with a steep learning curve though.

jarenmf · on Jan 16, 2022

Indeed, data.table is just awesome for productivity. When you're manipulating data for exploration you want the least number of keystrokes to bring an idea to life and data.table gives you that.

VeninVidiaVicii · on Jan 16, 2022

I totally agree. I often find myself wanting data.table as a standalone database platform or ORM-type interface for non-statistical programming too.

boppo1 · on Jan 16, 2022

What is terse syntax? I can parse lisp and C, how would this be different and challenging?

bckygldstn · on Jan 16, 2022

The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries.

Here's an example from the docs

  flights[carrier == "AA",
    lapply(.SD, mean),
    by = .(origin, dest, month),
    .SDcols = c("arr_delay", "dep_delay")]

that's clearly less clear than SQL

  SELECT
    origin, dest, month,
    MEAN(arr_delay), MEAN(dep_delay)
  FROM flights
  WHERE carrier == "AA"
  GROUP BY arr_delay, dep_delay

or pandas

  flights[filghts.carrier == 'AA'].groupby(['arr_delay', 'dep_delay']).mean()

But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table

  flights[, head(.SD, 2), by = month]

That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!

hervature · on Jan 16, 2022

Taking the first two rows is a mess in pandas?

flights.groupby("month").head(2)

Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.

Bootvis · on Jan 17, 2022

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |

and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]

Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?

hervature · on Jan 17, 2022

To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.

Bootvis · on Jan 17, 2022

It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.

kgwgk · on Jan 16, 2022

You mean something like

    SELECT
    origin, dest, month, AVG(arr_delay), AVG(dep_delay)
    FROM flights
    WHERE carrier == 'AA'
    GROUP BY origin, dest, month

and

    flights[flights.carrier == 'AA'].groupby(['origin', 'dest', 'month'])[['arr_delay', 'dep_delay']].mean()

bckygldstn · on Jan 16, 2022

Yep thanks, you can tell I use a "guess and check" approach to writing sql and pandas...

zwaps · on Jan 17, 2022

On top of what has been said, if you want to do some more advanced statistical analyses (in the inference area, not ML/predictive field), then chances are that these algorithms are either published as R or STATA packages (usually R).

In Python, there is statsmodels. Here, you'll find a lot of GLM stuff, which is sort of an older approach. Modern inferential statistics, if not just Bayesian, is usually in the flavor of semi-parametric models that rely on asymptotics.

As R is used by professional researchers, it is simply more on the edge of things. Python has most of the "statistics course" schoolbook methods, but not much beyond that.

For example, it has become very common to have dynamic panel data which require dynamic models. Now if you want to do a Blundell-Bond type model in PYthon you have to... code it yourself using GMM, if it exists even.

For statistics, that's pretty much like saying you have a Deep Learning package that maybe has GRU but no transformer modules at all. So yeah, you can code it yourself. Or you use the other one.

clove · on Jan 17, 2022

I have the opposite question. I have been programming in R since I was 19. I know no other programming languages.

Hence my question:

What's the advantage of Python if you already know R?

I've heard they have similarities. Is there anything Python does better than R in terms of statistical analysis, charting, etc.?

noiwillnot · on Jan 17, 2022

> What's the advantage of Python if you already know R?

AFAIK in statistical modelling Python is better only in neural networks, so if you do not need to do fancy things with images, text, etc. you do not need Python. R is still the king.

In terms of charting and dashboards, I would say that if you work high level R and Python are both pleasant. R has ggplot, but Python has Plotly Express. R has Shiny, but Python has Dash and Streamlit. You can do great with both.

iandinwoodie · on Jan 17, 2022

One difference I've noticed is that R libraries are usually authored and maintained by academics in the associated field; the same can't always be said about equivalent Python libraries. This means that R library authors generally use their own libraries for publication and have an academic stake in its correctness.

vcdimension · on Jan 16, 2022

R is used by many researchers and consequentially has many more statistical libraries (e.g. try doing a dynamic panel modelling in python).