The syntax isn't self-describing and uses lots of abbreviations; it relies on so...

hervature · on Jan 16, 2022

Taking the first two rows is a mess in pandas?

flights.groupby("month").head(2)

Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.

Bootvis · on Jan 17, 2022

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |

and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]

Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?

hervature · on Jan 17, 2022

To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.

Bootvis · on Jan 17, 2022

It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.

kgwgk · on Jan 16, 2022

You mean something like

    SELECT
    origin, dest, month, AVG(arr_delay), AVG(dep_delay)
    FROM flights
    WHERE carrier == 'AA'
    GROUP BY origin, dest, month

and

    flights[flights.carrier == 'AA'].groupby(['origin', 'dest', 'month'])[['arr_delay', 'dep_delay']].mean()

bckygldstn · on Jan 16, 2022

Yep thanks, you can tell I use a "guess and check" approach to writing sql and pandas...