It just tells you the top N words by frequency in its input (default N=100) with words of the same frequency ordered alphabetically and all words converted to lowercase. Knuth's version was about 7 pages of Pascal, maybe 3 pages without comments. It took akkartik 50 lines of idiomatic, simple Lua. I tried doing it in Perl; it was 6 lines, or 13 without relying on any of the questionable Perl shorthands. Idiomatic and readable Perl would be somewhere in between.
#!/usr/bin/perl -w
use strict;
my $n = @ARGV > 1 ? pop @ARGV : 100;
my %freq;
while (my $line = <>) {
for my $w ($line =~ /(\w+)/g) {
$freq{(lc $w)}++;
}
}
for my $w (sort { $freq{$b} <=> $freq{$a} || $a cmp $b } keys %freq) {
print "$w\t$freq{$w}\n";
last unless --$n;
}
I think Python, Ruby, or JS would be about the same.
Then I tried writing a Common Lisp version. Opening a file, iterating over lines, hashing words and getting 0 as default, and sorting are all reasonably easy in CL, but splitting a line into words is a whole project on its own. And getting a command-line argument requires implementation-specific facilities that aren't standardized by CL! At least string-downcase exists. It was a lark, so I didn't finish.
(In Forth you'd almost have to write something equivalent to Knuth's Pascal, because it doesn't come with even hash tables and case conversion.)
My experience with Smalltalk is more limited but similar. You can do anything you want in it, it's super flexible, the tooling is great, but almost everything requires you to just write quite a bit more code than you would in Perl, Python, Ruby, JS, etc. And that means you have more bugs, so it takes you longer. And it doesn't really want to talk to the rest of the world—you can forget about calling a Squeak method from the Unix command line.
Smalltalk and CL have native code compilers available, which ought to be a performance advantage over things like Perl. Often enough, though, it's not. Part of the problem is that their compilers don't produce highly performant code, but they certainly ought to beat a dumb bytecode interpreter, right? Well, maybe not if the program's hot loop is inside a regular expression match or Numpy array operation.
And a decent native code compiler (GCC, HotSpot, LuaJIT, the Golang compilers, even ocamlopt) will beat any CL or Smalltalk compiler I have tried by a large margin. This is a shame because a lot of the extra hassle in Smalltalk and CL seems to be aimed at efficiency.
(Scheme might actually deliver the hoped-for efficiency in the form of Chez, but not Chicken. But Chicken can build executables and easily call C. Still, you'd need more code to solve this problem in Scheme than in Lua, much less Ruby.)
—·—
One of the key design principles of the WWW was the "principle of least power", which says that you should do each job with the least expressive language that you can. So the URL is a very stupid language, just some literal character strings glued together with delimiters. HTML is slightly less stupid, but you still can't program in it; you can only mark up documents. HTTP messages are similarly unexpressive. As much as possible of the Web is built out of these very limited languages, with only small parts being written in programming languages, where these limited DSLs can't do the job.
Lisp, Smalltalk, and Forth people tend to think this is a bad thing, because it makes some things—important things—unnecessarily hard to write. Alan Kay has frequently deplored the WWW being built this way. He would have made it out of mobile code, not dead text files with markup.
But the limited expressivity of these formats makes them easier to read and to edit.
I have two speech synthesis programs, eSpeak and Festival. Festival is written in Scheme, a wonderful, liberating, highly expressive language. eSpeak is in C++, which is a terrible language, so as much as possible of its functionality is in dumb data files that list pronunciations for particular letter sequences or entire words and whatnot. Festival does all of this configuration in Scheme files, and consequently I have no idea where to start. Fixing problems in eSpeak is easy, as long as they aren't in the C++ core; fixing problems in Festival is, so far, beyond my abilities.
(I'm not an expert in Scheme, but I don't think that's the problem—I mean, my Scheme is good enough that I wrote a compiler in it that implements enough of Scheme to compile itself.)
—·—
SQL is, or until recently was, non-Turing-complete, but expressive enough that 6 lines of SQL can often replace a page or three of straightforward procedural code—much like Perl in the example above, but more readable rather than less.
Similarly, HTML (or JSX) is often many times smaller than the code to produce the same layout with, say, GTK. And when it goes wrong, you can inspect the CSS rules applying to your DOM elements in a way that relies on them being sort of dumb, passive data. It makes them much more tractable in practice than Turing-complete layout systems like LaTeX and Qt3.
—·—
Perl and Forth both have some readability problems, but I think their main difficulty is that they are too error-prone. Forth, aside from being as typeless as conventional assembly, is one of the few languages where you can accidentally pass a parameter to the wrong call.
This sort of rhymes with what I was saying in 02001 in https://paulgraham.com/redund.html, that often we intentionally include redundancy in our expressions of programs to make them less error-prone, or to make the errors easily detectable.
The article in CACM that presents Knuth's solution [1] also includes some criticism of Knuth's approach, and provides an alternate that uses a shell pipeline:
With great respect to Doug McIlroy (in the CACM article), the shell pipeline has a serious problem that Knuth's Pascal program doesn't have. (I'm assuming Knuth's program is written in standard Pascal.) You could have compiled and run Knuth's program on an IBM PC XT running MS-DOS; indeed on any computer having a standard Pascal compiler. Not so the shell pipeline, where you must be running under an operating system with pipes and 4 additional programs: tr, sort, uniq, and sed.
McIlroy also discusses how a program "built for the ages" should have "a large factor of safety". McIlroy was worried about how Knuth's program would scale up to larger bodies of text. Also, Bentley's/McIlroy's critique was published in 1986, which I think was well before there was a major look into Unix tools and their susceptibility to buffer overruns, etc. In 1986, could people have determined the limits of tr, sort, uniq, sed, and pipes--both individually and collectively--when handling large bodies of text? With a lot of effort, yes, but if there was a problem, Knuth at least only had one program to look at. With the shell pipeline, one would have to examine the 4 programs plus the shell's implementation of pipes.
(I'm not defending Pascal and Knuth, Bentley, and McIlroy are always worth reading on any topic -- thanks for posting the link!)
Bringing this back to Forth, Bernd Paysan, who needs no introduction to the people in the Forth community, wrote "A Web-Server in Forth", https://bernd-paysan.de/httpd-en.html . It only took him a few hours, but in fairness to us mortals, it's an HTTP request processor that reads a single HTTP request from stdin, processes it, and writes it output to stdout. In other words, it's not really a full web server because it depends on an operating system with an inetd daemon for all the networking. As with McIlroy's shell pipeline, there is a lot of heavy lifting done by operating system tools. (Paysan's article is highly recommended for people learning Forth, like me when I read it back in the 2000s.)
> splitting a line into words is a whole project on its own
Is it[1]? My version below accumulates alphabetical characters until it encounters a non-alphabetical one, then increments the count for the accumulated word and resets the accumulator.
It does look a lot like what I was thinking would be necessary. About 9 of the 19 lines are concerned with splitting the input into words. Also, I think you have omitted the secondary key sort (alphabetical ascending), although that's only about one more line of code, something like
#'(lambda (a b)
(or (< (car a) (car b))
(and (= (car a) (car b))
(string> (cadr a) (cadr b)))))
Because the lines of code are longer, it's about 3× as much code as the verbose Perl version.
In SBCL on my phone it's consistently slower than Perl on my test file (the King James Bible), but only slightly: 2.11 seconds to Perl's 2.05–2.07. It's pretty surprising that they are so close.
Were I trying to optimise this, I would test to see if a hash table of alphabetical characters is better, or just checking (or (and (char>= c #\A) (char<= c #\Z)) (and (char>= c #\a) (char<= c #\z))). The accumulator would probably be better as an adjustable array with a fill pointer allocated once, filled with VECTOR-PUSH-EXTEND and reset each time. It might be better to use DO, initializing C and declaring its type.
Also worth giving it a shot with (optimize (speed 3) (safety 0)) just to see if it makes a difference.
Yes, definitely more verbose. Perl is good at this sort of task!
> You can do anything you want in it, it's super flexible, the tooling is great, but almost everything requires you to just write quite a bit more code than you would in Perl, Python, Ruby, JS, etc.
Given that Smalltalk precedes JS by many years: if it is true, then it was not always true.
Given that Smalltalk was early to the GUI WIMP party: if it is true, then it was not always true for GUI WIMP use.
—·—
The other day akkartik wrote an implementation of the program Knuth used to introduce literate programming to the CACM readers: https://basiclang.solarpunk.au/d/7-don-knuths-original-liter...
It just tells you the top N words by frequency in its input (default N=100) with words of the same frequency ordered alphabetically and all words converted to lowercase. Knuth's version was about 7 pages of Pascal, maybe 3 pages without comments. It took akkartik 50 lines of idiomatic, simple Lua. I tried doing it in Perl; it was 6 lines, or 13 without relying on any of the questionable Perl shorthands. Idiomatic and readable Perl would be somewhere in between.
I think Python, Ruby, or JS would be about the same.Then I tried writing a Common Lisp version. Opening a file, iterating over lines, hashing words and getting 0 as default, and sorting are all reasonably easy in CL, but splitting a line into words is a whole project on its own. And getting a command-line argument requires implementation-specific facilities that aren't standardized by CL! At least string-downcase exists. It was a lark, so I didn't finish.
(In Forth you'd almost have to write something equivalent to Knuth's Pascal, because it doesn't come with even hash tables and case conversion.)
My experience with Smalltalk is more limited but similar. You can do anything you want in it, it's super flexible, the tooling is great, but almost everything requires you to just write quite a bit more code than you would in Perl, Python, Ruby, JS, etc. And that means you have more bugs, so it takes you longer. And it doesn't really want to talk to the rest of the world—you can forget about calling a Squeak method from the Unix command line.
Smalltalk and CL have native code compilers available, which ought to be a performance advantage over things like Perl. Often enough, though, it's not. Part of the problem is that their compilers don't produce highly performant code, but they certainly ought to beat a dumb bytecode interpreter, right? Well, maybe not if the program's hot loop is inside a regular expression match or Numpy array operation.
And a decent native code compiler (GCC, HotSpot, LuaJIT, the Golang compilers, even ocamlopt) will beat any CL or Smalltalk compiler I have tried by a large margin. This is a shame because a lot of the extra hassle in Smalltalk and CL seems to be aimed at efficiency.
(Scheme might actually deliver the hoped-for efficiency in the form of Chez, but not Chicken. But Chicken can build executables and easily call C. Still, you'd need more code to solve this problem in Scheme than in Lua, much less Ruby.)
—·—
One of the key design principles of the WWW was the "principle of least power", which says that you should do each job with the least expressive language that you can. So the URL is a very stupid language, just some literal character strings glued together with delimiters. HTML is slightly less stupid, but you still can't program in it; you can only mark up documents. HTTP messages are similarly unexpressive. As much as possible of the Web is built out of these very limited languages, with only small parts being written in programming languages, where these limited DSLs can't do the job.
Lisp, Smalltalk, and Forth people tend to think this is a bad thing, because it makes some things—important things—unnecessarily hard to write. Alan Kay has frequently deplored the WWW being built this way. He would have made it out of mobile code, not dead text files with markup.
But the limited expressivity of these formats makes them easier to read and to edit.
I have two speech synthesis programs, eSpeak and Festival. Festival is written in Scheme, a wonderful, liberating, highly expressive language. eSpeak is in C++, which is a terrible language, so as much as possible of its functionality is in dumb data files that list pronunciations for particular letter sequences or entire words and whatnot. Festival does all of this configuration in Scheme files, and consequently I have no idea where to start. Fixing problems in eSpeak is easy, as long as they aren't in the C++ core; fixing problems in Festival is, so far, beyond my abilities.
(I'm not an expert in Scheme, but I don't think that's the problem—I mean, my Scheme is good enough that I wrote a compiler in it that implements enough of Scheme to compile itself.)
—·—
SQL is, or until recently was, non-Turing-complete, but expressive enough that 6 lines of SQL can often replace a page or three of straightforward procedural code—much like Perl in the example above, but more readable rather than less.
Similarly, HTML (or JSX) is often many times smaller than the code to produce the same layout with, say, GTK. And when it goes wrong, you can inspect the CSS rules applying to your DOM elements in a way that relies on them being sort of dumb, passive data. It makes them much more tractable in practice than Turing-complete layout systems like LaTeX and Qt3.
—·—
Perl and Forth both have some readability problems, but I think their main difficulty is that they are too error-prone. Forth, aside from being as typeless as conventional assembly, is one of the few languages where you can accidentally pass a parameter to the wrong call.
This sort of rhymes with what I was saying in 02001 in https://paulgraham.com/redund.html, that often we intentionally include redundancy in our expressions of programs to make them less error-prone, or to make the errors easily detectable.