I really wish when counting code that people would use one of the more modern co...

trishume · on June 15, 2019

A few groups did give me output from a modern line counter, often Tokei or loc. I mention in the post that the ratios between source lines, word count lines and bytes were pretty similar across all projects. If people want to get a sense of how that compares to their intuitions in source lines they can use the ratio from our Rust project to convert.

I ended up using raw lines for comparison because for relative measurement it didn't matter for the above reason, and I knew everyone had the same `wc` and I could get them to send me the output of the `wc {glob}` version that lists lines and bytes of all files so that I could drill down into which parts were different.

boyter · on June 15, 2019

Ah! I assumed you had access to the code. That makes more sense.

masklinn · on June 16, 2019

> I really wish when counting code that people would use one of the more modern code counters, or at least use cloc. Tokei, loc, scc, polyglot, loccount or gocloc give a much better idea over wc because they account for comments and blank lines and in the case of scc and tokei strings.

And even that only tells some of the story e.g. do code counters count separators ({ or } alone on a line) as blank or as code? Are multiline strings (e.g. python docstrings) counted as code or comments?

boyter · on June 16, 2019

Usually they count { on a single line as code. Multi line strings are code in all but scc and Tokei. Tokei counts them as comments or string depending on user settings. Scc will count hem as comments in the next few months.

OnlyOneCannolo · on June 16, 2019

For anyone interested in this sort of thing, there's also Unified Code Count (UCC) [1]. It has a lot of interesting design goals like being open and explicit about the counting rules, which is really useful if you want to predict things like cost and reliability.

[1] https://csse.usc.edu/ucc_new/wordpress/

boyter · on June 16, 2019

I actually lot ucc in the not so great categories. Counters like scc and Tokei are getting close to having the same accuracy as a compiler when it comes to code while being much faster.

They are also support for more languages and are updated way more often. Very much second generation tools that learnt from the first.

OnlyOneCannolo · on June 16, 2019

Different tools for different use cases.

Your use cases seem to prioritize language support, update frequency, and speed (what do you mean by the accuracy part?). For this, Scc and Tokei would of course be better than UCC.

The (admittedly niche) use cases I described require understanding the counting rules very well, and keeping those rules stable. For this, scc and Tokei are as useless as anything else, while UCC does exactly what's needed.

boyter · on June 16, 2019

Have a look at the tests in scc and tokei. They support nested multiline comments, string escapes and other odd language features. As such both get very close to counting lines the way a full tokeniser used by the compiler or interpreter do making them very accurate.

I see your point. I’d argue however the rules for counting should be language rules not some higher level generic set.

OnlyOneCannolo · on June 16, 2019

Ah, now I see what you mean by 'accuracy'.

Each language does have its own counting rules. There's a separate PDF for each one describing the rules for that language, which is the cool part.