The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries.
But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table
flights[, head(.SD, 2), by = month]
That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!
Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.
Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.
It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:
- How many events by type
- When did they happen
- Are there any breaks in the count, why?
- Some statistics on these events like average, min, max
Here's an example from the docs
that's clearly less clear than SQL or pandas But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!