Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |
and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]
Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?



To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.


It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: