I really wish when counting code that people would use one of the more modern code counters, or at least use cloc. Tokei, loc, scc, polyglot, loccount or gocloc give a much better idea over wc because they account for comments and blank lines and in the case of scc and tokei strings.
I had a similar experience in university. Class had to implement a modified Turing machine. We could use whatever language we wanted. On person did it in C++ and it was several hundred lines. Another in Java which was slightly smaller. I implemented mine in Python and it was small enough to print on a single piece of paper. I think it was something like 30 lines or so. I exploited break statements to make it terse but readable. It did mean I was marked down a bit for them but it was by far the shortest program in the class.
A few groups did give me output from a modern line counter, often Tokei or loc. I mention in the post that the ratios between source lines, word count lines and bytes were pretty similar across all projects. If people want to get a sense of how that compares to their intuitions in source lines they can use the ratio from our Rust project to convert.
I ended up using raw lines for comparison because for relative measurement it didn't matter for the above reason, and I knew everyone had the same `wc` and I could get them to send me the output of the `wc {glob}` version that lists lines and bytes of all files so that I could drill down into which parts were different.
> I really wish when counting code that people would use one of the more modern code counters, or at least use cloc. Tokei, loc, scc, polyglot, loccount or gocloc give a much better idea over wc because they account for comments and blank lines and in the case of scc and tokei strings.
And even that only tells some of the story e.g. do code counters count separators ({ or } alone on a line) as blank or as code? Are multiline strings (e.g. python docstrings) counted as code or comments?
Usually they count { on a single line as code. Multi line strings are code in all but scc and Tokei. Tokei counts them as comments or string depending on user settings. Scc will count hem as comments in the next few months.
For anyone interested in this sort of thing, there's also Unified Code Count (UCC) [1]. It has a lot of interesting design goals like being open and explicit about the counting rules, which is really useful if you want to predict things like cost and reliability.
I actually lot ucc in the not so great categories. Counters like scc and Tokei are getting close to having the same accuracy as a compiler when it comes to code while being much faster.
They are also support for more languages and are updated way more often. Very much second generation tools that learnt from the first.
Your use cases seem to prioritize language support, update frequency, and speed (what do you mean by the accuracy part?). For this, Scc and Tokei would of course be better than UCC.
The (admittedly niche) use cases I described require understanding the counting rules very well, and keeping those rules stable. For this, scc and Tokei are as useless as anything else, while UCC does exactly what's needed.
Have a look at the tests in scc and tokei. They support nested multiline comments, string escapes and other odd language features. As such both get very close to counting lines the way a full tokeniser used by the compiler or interpreter do making them very accurate.
I see your point. I’d argue however the rules for counting should be language rules not some higher level generic set.
I had a similar experience in university. Class had to implement a modified Turing machine. We could use whatever language we wanted. On person did it in C++ and it was several hundred lines. Another in Java which was slightly smaller. I implemented mine in Python and it was small enough to print on a single piece of paper. I think it was something like 30 lines or so. I exploited break statements to make it terse but readable. It did mean I was marked down a bit for them but it was by far the shortest program in the class.