I *want* to love Julia. The core language is excellently designed and better tha...

johnmyleswhite · on Dec 6, 2014

FWIW, I think the DataFrames package and its dependencies have consistently operated at the boundaries of what we know how to do efficiently in Julia. The package has had lackluster performance in many contexts primarily because it adopted many idioms from R and Python that were sharply at odds with Julia's type inference system. We're starting to clear those problems up, but there are still lots of unsolved challenges we need to resolve.

If you have any ideas about how we should modify the basic data types and functions defined in DataFrames, those ideas would go a long way to making Julia a better language.

xaa · on Dec 19, 2014

Sorry, really late reply.

I fully appreciate that the type system imposes constraints that don't exist in Python or R. For my purposes in particular, and I think many people, I don't actually need a full-fledged data frame with heterogeneous types. What I actually want is a numeric matrix with labels on both axes and good methods for querying, group-by operations, etc. (And an equivalent numeric Series type). Big bonus for memory mapping and/or fast I/O.

I think this is an easier problem to solve, especially since factors and ordinals can be considered as a special type of numeric.

It has been too long since I've looked at the internal code structure of DataFrames.jl, but I think the biggest design flaws at the time were the requirements of index names to be symbols (probably should either be a flat String, or a choice between String and Int64), and axes on columns only. I can only assume the symbol decision was made for performance but you surely have worked with datasets given by investigators that use all kinds of random conventions for index names that don't fit the constraints of a symbol. Not to mention the very common case of numeric index names. I find it very annoying to read such a file in R and get "X1000" or whatever as my index names.

I actually tried briefly to dive in and fix the I/O problems, but the code style was daunting -- a few, very huge functions. If it hasn't been done, I would suggest breaking it up a little.

Anyway, I didn't mean to be overly critical -- I think you're doing a very important task -- but as an honest assessment of why I, as a busy scientist, found Julia to be more trouble than it was worth.