It's kind of mindblowing to see how much code floating point formatting needs. T...

ziml77 · on Sept 1, 2024

I learned how much floating point formatting needs when I was doing work with Zig recently.

Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime). But this time the binary seemed to be much larger than I've seen Zig generate before and it didn't make sense based on how little the tool was actually doing. Dropping it into Binary Ninja revealed that the majority of the code was there to support floating point formatting. So I changed the code to cast the floating point number to an integer before printing it out. That change resulted in a binary that was down at the size I had been expecting.

delta_p_delta_x · on Sept 2, 2024

> Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime)

MSVC defaults to linking against the UCRT, just like how Clang and GCC on Linux default to linking against the system libc. This is to provide a reasonably useful C environment as a sane default.

If you don't want UCRT under MSVC, supply `/MT /NODEFAULTLIB /ENTRY:<function-name>` in the command-line invocation (or in the Visual Studio MSBuild options).

It is perfectly possible to build a Win32-only binary that is fully self-contained and only around 1 KiB.

ziml77 · on Sept 2, 2024

Yep I've done that before, it's how I know linking to the C runtime is what bloats the binary. For most real projects I wouldn't disable linking it, but it's fun to see the output get so small.

pjmlp · on Sept 2, 2024

Also UCRT is kind of recent, Windows 10 timeframe.

account42 · on Sept 2, 2024

It's also irrelevant since you could write a native Win32 binary without and crt dependency before it.

account42 · on Sept 2, 2024

> It is perfectly possible to build a Win32-only binary that is fully self-contained and only around 1 KiB.

Good luck actually distributing that binary to users without all the various kinds of scareware in the way yelling DANGER.

delta_p_delta_x · on Sept 2, 2024

That's a matter of code signing and SmartScreen, both of which are completely orthogonal to how the binary is built.

account42 · on Sept 3, 2024

You may or may not be able to pay the protection money to get around the warnings but it is not at all orthogonal to how the binary is build - the scareware industry (both Microsoft as well as third parties) absolutely despises executables that deviate from the default MSVC output.

jk-jeon · on Sept 1, 2024

https://github.com/jk-jeon/dragonbox/discussions/57#discussi...

We have been doing some experiment on optimizing for size, and currently it can be reduced to ~3k on 8-bit AVR. It only contains impl/table for single-precision binary32, and double-precision requires quite more, but at the same time much of the bloat is due to how limited AVR is. On platforms like x64 it should be much smaller.

You can certainly say 3k is still huge though.

mananaysiempre · on Sept 1, 2024

> It's kind of mindblowing to see how much code floating point formatting needs.

If you want it to be fast. The baseline implementation isn’t terrible[1,2] even if it is still ultimately an implementation of arbitrary-precision arithmetic.

[1] https://research.swtch.com/ftoa

[2] https://go.dev/src/strconv/ftoa.go

vitaut · on Sept 2, 2024

If I interpret the numbers correctly it is of the order of ~1000 times slower than modern algorithms such as Dragonbox.

mananaysiempre · on Sept 2, 2024

Something like that.

The Dragonbox author reports[1] about 25 ns/conversion, Cox reports 1e5 conversions/s, so that’s a factor of 400. We can probably knock off half an order of magnitude for CPU differences if we’re generous (midrange performance-oriented Kaby Lake laptop CPU from 2017 vs Cox’s unspecified laptop CPU ca. 2010), but that’s still a factor of 100. Still a performance chasm.

You can likely get some of the performance back by picking the low-hanging fruit, e.g. switching from dumb one-byte bigint limbs in [0,10) to somewhat less dumb 32-bit limbs in [0,1e9). But generally, yes, this looks like a teaching- and microcontroller-class algorithm more than anything I’d want to use on a modern machine.

[1] https://github.com/jk-jeon/dragonbox/blob/master/README.md#p...

vitaut · on Sept 1, 2024

{fmt} has an optional implementation of the old Dragon4 algorithm that is smaller in terms of code size but not as fast.

franga2000 · on Sept 2, 2024

I'm guessing the majority of use-cases limit the number of decimal points that are printed, I wonder if it would be more efficient to multiply by the number of decimals, convert to int, itoa() and insert the decimal point where it belongs...

jk-jeon · on Sept 2, 2024

Not sure what you mean by decimal points. Did you mean the number of decimal digits to be printed in total, or the number of digits after the decimal dot, or something else?

In any case, what Dragonbox and other modern floating-point formatting algorithms do is already roughly what you describe: they compute the integer consisting of digits to be printed, and then print those digits, except:

- Dragonbox and some of other algorithms have totally different requirements than `printf`. The user does not request the precision, rather the algorithm determines the number of digits to print. So `1.2` is printed as `1.2` and `1.199999999999` is printed as `1.199999999999`. You can read about the exact requirements in the Readme page of Dragonbox.

- The core of modern floating-point formatting algorithms is on how to compute the needed multiplication by a power of 10 without needing to do it by the plain bignum arithmetic (which is incredibly slow). Note that a `float` (assuming it's IEEE-754 binary32) instance can be as large as 2^100 or as small as 2^-100. It's nontrivial to deal with these numbers without incorporating bignum arithmetic, and even if you just give up avoiding it, bignum arithmetic itself is quite nontrivial in terms of the code size it requires.