Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).
I mean, he's comparing 2024 Zen 5 and M4 against two generations behind 2022 Intel Raptor Lake. The Lion Cove should be roughly on par with the M4 on this test.
For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.
There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.
There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.
Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.
These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.
The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.
Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.
Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.
Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).
When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".
And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.
However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.
Take a look at the "rsqrt_rcp" section of reference [6] in the accuracy report by Gladman et al referenced above. I did that work 10 years ago because some people at CERN had reported getting different results from certain programs depending on whether the exact same executables were run on Intel or AMD cpus. The result of the investigation was that the differing results were due to different implementations of the rsqrt instruction on the different cpus.
For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.
We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.
Mechanically, you're probably right, but the screen-centric controls of the newer generation are _awful_ by comparison to the F generation's physical buttons and dials (this isn't BMW though, it's the whole industry).
My wife and I both have F31s, which we will drive until we can no longer source replacement parts unless the industry comes to its senses first (unlikely). Any time we've ever looked at plausible replacements, the screen-based controls are an immediate hard no.
Our (~2015) 3-series controls are just about perfect. Where they differ from Honda/Toyota's controls that I am also very familiar with, they're noticeably better now that I'm familiar with them. Everything is really well thought-out.
Of course, now they (and almost every other manufacturer) have followed Tesla off the cliff and made everything a screen, so the current generation cars have abysmal controls.
Schubfach's table is quite large compared to some alternatives with similar performance characteristics. swiftDtoa's code and tables combined are smaller than just Schubfach's table in the linked implementation. Ryu and Dragonbox are larger than swiftDtoa, but also use smaller tables than Schubfach, IIRC.
If I$ is all you care about, then table size may not matter, but for constrained systems, other algorithms in general, and swiftDtoa in particular, may be better choices.
IEEE 754 is a floating point standard. It has a few warts that would be nice to fix if we had tabula rasa, but on the whole is one of the most successful standards anywhere. It defines a set of binary and decimal types and operations that make defensible engineering tradeoffs and are used across all sorts of software and hardware with great effect. In the places where better choices might be made knowing what we know today, there are historical reasons why different choices were made in the past.
DEC64 is just some bullshit one dude made up, and has nothing to do with “floating-point standards.”
It is important to remember that IEEE 754 is, in practice, aspirational. It is very complex and nobody gets it 100% correct. There are so many end cases around the sticky bit, quiet vs. signaling NaNs, etc, that a processor that gets it 100% correct for every special case simply does not exist.
One of the most important things that IEEE 754 mandates is gradual underflow (denormals) in the smallest binate. Otherwise you have a giant non-monotonic jump between the smallest normalizable float and zero. Which plays havoc with the stability of numerical algorithms.
Sorry, no. IEEE 754 is correctly implemented in pretty much all modern hardware [1], save for the fact that optional operations (e.g., the suggested transcendental operations) are not implemented.
The problem you run into is that the compiler generally does not implement the IEEE 754 model fully strictly, especially under default flags--you have to opt into strict IEEE 754 conformance, and even there, I'd be wary of the potential for bugs. (Hence one of the things I'm working on, quite slowly, is a special custom compiler that is designed to have 100% predictable assembly output for floating-point operations so that I can test some floating-point implementation things without having to worry about pesky optimizations interfering with me).
[1] The biggest stumbling block is denormal support: a lot of processors opted to support denormals only by trapping on it and having an OS-level routine to fix up the output. That said, both AMD and Apple have figured out how to support denormals in hardware with no performance penalty (Intel has some way to go), and from what I can tell, even most GPUs have given up and added full denormal support as well.
reply