I learned how much floating point formatting needs when I was doing work with Zig recently.
Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime). But this time the binary seemed to be much larger than I've seen Zig generate before and it didn't make sense based on how little the tool was actually doing. Dropping it into Binary Ninja revealed that the majority of the code was there to support floating point formatting. So I changed the code to cast the floating point number to an integer before printing it out. That change resulted in a binary that was down at the size I had been expecting.
> Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime)
MSVC defaults to linking against the UCRT, just like how Clang and GCC on Linux default to linking against the system libc. This is to provide a reasonably useful C environment as a sane default.
If you don't want UCRT under MSVC, supply `/MT /NODEFAULTLIB /ENTRY:<function-name>` in the command-line invocation (or in the Visual Studio MSBuild options).
It is perfectly possible to build a Win32-only binary that is fully self-contained and only around 1 KiB.
Yep I've done that before, it's how I know linking to the C runtime is what bloats the binary. For most real projects I wouldn't disable linking it, but it's fun to see the output get so small.
You may or may not be able to pay the protection money to get around the warnings but it is not at all orthogonal to how the binary is build - the scareware industry (both Microsoft as well as third parties) absolutely despises executables that deviate from the default MSVC output.
We have been doing some experiment on optimizing for size, and currently it can be reduced to ~3k on 8-bit AVR. It only contains impl/table for single-precision binary32, and double-precision requires quite more, but at the same time much of the bloat is due to how limited AVR is. On platforms like x64 it should be much smaller.
> It's kind of mindblowing to see how much code floating point formatting needs.
If you want it to be fast. The baseline implementation isn’t terrible[1,2] even if it is still ultimately an implementation of arbitrary-precision arithmetic.
The Dragonbox author reports[1] about 25 ns/conversion, Cox reports 1e5 conversions/s, so that’s a factor of 400. We can probably knock off half an order of magnitude for CPU differences if we’re generous (midrange performance-oriented Kaby Lake laptop CPU from 2017 vs Cox’s unspecified laptop CPU ca. 2010), but that’s still a factor of 100. Still a performance chasm.
You can likely get some of the performance back by picking the low-hanging fruit, e.g. switching from dumb one-byte bigint limbs in [0,10) to somewhat less dumb 32-bit limbs in [0,1e9). But generally, yes, this looks like a teaching- and microcontroller-class algorithm more than anything I’d want to use on a modern machine.
I'm guessing the majority of use-cases limit the number of decimal points that are printed, I wonder if it would be more efficient to multiply by the number of decimals, convert to int, itoa() and insert the decimal point where it belongs...
Not sure what you mean by decimal points. Did you mean the number of decimal digits to be printed in total, or the number of digits after the decimal dot, or something else?
In any case, what Dragonbox and other modern floating-point formatting algorithms do is already roughly what you describe: they compute the integer consisting of digits to be printed, and then print those digits, except:
- Dragonbox and some of other algorithms have totally different requirements than `printf`. The user does not request the precision, rather the algorithm determines the number of digits to print. So `1.2` is printed as `1.2` and `1.199999999999` is printed as `1.199999999999`. You can read about the exact requirements in the Readme page of Dragonbox.
- The core of modern floating-point formatting algorithms is on how to compute the needed multiplication by a power of 10 without needing to do it by the plain bignum arithmetic (which is incredibly slow). Note that a `float` (assuming it's IEEE-754 binary32) instance can be as large as 2^100 or as small as 2^-100. It's nontrivial to deal with these numbers without incorporating bignum arithmetic, and even if you just give up avoiding it, bignum arithmetic itself is quite nontrivial in terms of the code size it requires.
The linked dragonbox [1] project is also worth a read. Pretty optimized for the least used branches.
[1] https://github.com/jk-jeon/dragonbox