Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Parsing R code: Freedom of expression is not always a good idea (2012) (shape-of-code.com)
50 points by pxeger1 on June 14, 2022 | hide | past | favorite | 36 comments


This seems very much like an outsider's critique. It's super interesting, but if the goal is actually improving R I'm almost sure a better approach would be to take a hand with development, fix some bugs, gain respect and push for incremental - or even less-than-incremental - changes. Core R are definitely trying to encourage new developers to join in, so the opportunity is there.


> This seems very much like an outsider's critique

Spot on! This same sentiment was expressed very well in a paper evaluating the R language [1]:

"This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular."

It would be nice to see a post about R that analyses _why_ it's so hugely popular with data scientists. It's easy to write "R doesn't do what we computer scientists think languages should do, so it's no good". It's harder to analyse what R gets right (for its domain) that other languages get wrong. Personally, I think it's not just that R has the best data-handling libraries (ggplot2, plyr, data.table), it's its "unlikely linguistic cocktail" that is perfectly suited for data exploration.

I think that maybe we hear the views of software engineers who get handed a messy R script and are asked to make it run in production, or make it run on big datasets, and so they only ever see the downsides of R. R wasn't designed to make life easy for production! It's designed to make it easy to explore datasets, which often means one-off code, 99% of which you run and then delete because your hypothesis about the data was wrong.

[1] https://www.researchgate.net/publication/240040602_Evaluatin...


I wholly disagree with this sentiment. I have seen, time and again, awful language design choices. Granted, they might have been the best option at the time, or maybe no one could think of a better method.

But things change with time. And quite often, the response to suggestions ends up being "it's too late to change it", "it's good enough", or any number of comments to minimize the obvious negatives. Too often people get stuck in "local maximums", thinking their way is best, until years later when time has proven them wrong.


I didn't say the guy was wrong. I said he wasn't going to change anything by standing outside and complaining. I think your argument supports that idea. Maybe in an ideal world people would be very open to drive-by critique. In reality, not so.


From experience, I find the better option is to just move on.

You can only shout at a wall for so long, before you realize that you're wasting your time. So you make your concerns heard (which this guy did), and you move on. If people want to take the advice to heart, great, but its not likely. Plenty of other languages to use, and plenty of other software to use.


APL grammar has its flaws, but one thing it gets really _right_ is operator precedence. In an expression like this:

    A op B op C op D
No matter what _op_ is, it parses right-to-left:

    A op (B op (C op D))
Much nicer than having to learn a subtly-different version of C's already convoluted operator precedence[0] with each new language that comes out.

[0]: https://en.cppreference.com/w/c/language/operator_precedence


I genuinely really like it - strict left-to-right or strict right-to-left are so much more predictable at least for arithmetical type expressions (I might still prefer specific precedences for =/==/etc.).

However at this point something PEMDAS-like seems to be substantially easier to understand for most people since it's AFAICT the common rule taught in (high school) mathematics these days.

Trade-offs all the way down, as ever.


I've just started reading about APL, so maybe I'm wrong, but I think there is one caveat: you need to know which operators are monadic, and which are dyadic, because if op is monadic, then it may be:

  A (op (B (op (C (op D)))))


Awesome stuff.

As someone who occasionally writes parsers for real languages, and as someone who was really into R in university, I am happy that I stepped back on this one. ;-)

R's syntax belongs in the same category as Ruby and JavaScript:

Too much freedom of expression makes the meaning of a program highly dependent on its execution. It is hard to say concise things about a program without running it.

It is the murky side of (untyped) Lisp, if you ask me.


Freedom of expression helps you think though. Lisp is helping the thoughts, other languages obstruct them (that certainly includes Python).

For a scientific language like R this quality is important.

Perhaps the ideal data science language would be a Lisp/R with excellent embedding qualities like Lua for the scientific parts. People could then choose their favorite language for shoveling data around.


Why would say Python obstructs the thought?


Because with original thinking there is more than one way to do it. Pythonic conformity might make long term maintenance easier, but it's rigidity exacts a cost on expressing new thoughts. R is basically the epitome of Greenspun's 10th rule: it's really their implementation of a common lisp that looks like C with metaprogramming and conditions and restarts and all. They tried standardizing on Common Lisp first (XLisp-Stat) but S from Bell Labs was too popular. In short, today R is a lisp with access to modern numeric libraries.


Duck typing allows for freedom of expression/thinking.

You can declare a new property on an object simply by assigning, and consistency is not required either.

It also potentially leaves a mess over time.


Python doesn't, but pandas definitely does.


I'm constantly discovering oddities about the R language. Since I use it interactively, it's extremely rare that such oddities cause any problems. Here's an example I found yesterday (lines 1, 2, 3, and 4 make sense, 5 is interesting, and 6 is perplexing!):

1 == TRUE # TRUE

as.logical(1) # TRUE

0 == FALSE # TRUE

as.logical(0) # FALSE

2 == TRUE # FALSE

as.logical(2) # TRUE


I think it it makes sense.

TRUE and FALSE are 1 and 0, while `as.logical` will transform your value to the "closest" of those two.

If you are used to Pythons truthiness, that is what `as.logical` is similar to.


Truthy values go through most C type languages. Python included. bool(2) is True


Python goes even further beyond.

  >>> True + True
  2

  >>> True * 13 + (1 - False) * 17
  30


What you show there is that int(True) is one and int(False) is zero.

(The same happens in R, for what it's worth. as.integer(TRUE) is one and as.integer(FALSE) is zero and the operations you wrote work just he same.)


That's history coming back to bite it though. There was a period where the symbols True and False existed in Python but the bool class did not. True was _literally_ 1, and False was 0. Because of this, for backward compat, bool is a subclass of int.


TIL. Also in python bool("a") is True, whereas in R as.logical("a") is NA.


It makes sense if you know C or any architecture's assembly.

But it is an issue, and high level languages shouldn't just replicate it. In a high-level language `1 == TRUE` should be either an error or false.


If you like these oddities, then check out The R Inferno [1]. It’s from 2011, but still holds up since the language hasn’t changed that much.

[1] https://www.burns-stat.com/pages/Tutor/R_inferno.pdf


My favourite odd R snippet is this, which prints the numbers 0 to 100 without using any digits in the code:

    F:volcano
(from https://codegolf.stackexchange.com/a/219617)


As someone who actually likes R I think this makes sense. Line 5 is checking for equality, whereas in R as.* functions actually convert types or structures. The documentation on as.logical is also pretty clear on what would happen.

https://www.rdocumentation.org/packages/raster/versions/3.5-...

> Change values of a Raster* object to logical or integer values. With as.logical, zero becomes FALSE, all other values become TRUE. With as.integer values are truncated.


Yeah. There's a difference between "This language isn't being internally consistent" and "Due to my experience using language X, I find this confusing". As you say, this is well documented behavior, albeit perhaps unexpected to a novice R user.


Seems perfectly fine to me? It's R, for interactive data analysis and statistics. I love it, but would not use it for anything else.


as.logical(2) being TRUE is perplexing only if you interpreted

2 == TRUE

as

as.logical(2)==TRUE

rather than as

2==as.integer(TRUE)

[Edit: "If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw."]


I'm sad that R didn't go closer to JavaScript style for Lambda instead of going toward Haskell, e.g., ‘\(x) x + 1’ vs x=> x+1


I've always liked that lambda and functions are the same thing in R and write `function(x){x + 1}`

I dislike both of your examples and am glad to have never seen that in any R code I've met.


it's new as of R 4.1 along with the new pipe operator

https://www.r-bloggers.com/2021/05/the-new-r-pipe/


I prefer the R formula syntax. i.e

~ . + 1 ~ .x + .y


I like this syntax better than the new lambda syntax, but I think it's good that a proper lambda syntax exists now. Not all higher-order functions accept formulas (they have to wrap their function arguments with rlang::as_function or equivalent), and there are probably some obscure cases where the distinction between a formula and a function matters.


What's sad about it?


JS version is more readable for most people.

(I'm no big fan of JS in general, unless it is TS but I consider that a different language.)


Exactly and it could bring R to a larger audience.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: