More

clamchowder · 2025-08-30T21:58:41 1756591121

(author here) they try for all instructions, just that it's a prediction w/replay because inevitably some instructions like memory loads are variable latency. It's not like Nvidia where fixed latency instructions are statically scheduled, then memory loads/other variable latency stuff is handled dynamically via scoreboarding.

clamchowder · 2025-05-01T01:59:28 1746064768

It does clock ramp from 800 MHz idle to 3.2 GHz under load, with 900, 1000, 1100, 1300, 1500, 1800, 2200, and 2700 MHz steps in between until it hits 3.2 GHz after 71.6 ms. Article was getting long enough so I just left it at, it reaches 3.2 GHz and stays there even though the spec sheet says it should go higher.

I remoted into the system for testing (Cheese/George had it), and he said it took 3-4 cold reboots for it to come up, and suspected memory wasn't training correctly. So I did all the testing without ever rebooting the system, because it might not come back up if I tried.

jkampman · 2025-05-01T02:34:45 1746066885

Tangential but thank you for always providing such detailed benchmarks and insights. Your work is a treasure!

clamchowder · on Feb 11, 2025

Yea Wordpress was a terrible platform and Substack is also a terrible platform. I don't know why every platform wants to take a simple uploaded PNG and apply TAA to it. And don't get me started on how Substack has no native table support, when HTML had it since prehistoric times.

If I had more time I'd roll my own site with basic HTML/CSS. It's not even hard, just time consuming.

dark__paladin · on Feb 11, 2025

TAA is temporal anti-aliasing, correct? There is no time dimension here, isn't it just compression + bilinear filtering?

clamchowder · on Feb 11, 2025

It was a joke about blurriness. To extend the joke, be glad it doesn't flicker and shimmer.

But yes, platforms usually apply compression in terrible ways, and it's especially noticeable coming from text and straight line stuff like graphs

dark__paladin · on Feb 12, 2025

Thanks for clarifying, went right over my head!

singhrac · on Feb 11, 2025

Ghost as an alternative? They’ll let you sign up paying subscribers as well.

clamchowder · on Feb 11, 2025

(author here) When I checked the 7600 XT was much more expensive. Right now it's still $360 on eBay, vs the B580's $250 MSRP, though yeah I guess it's hard to find the B580 in stock

jorvi · on Feb 11, 2025

Yeah I guess regional availability really works into it.. bummer

I wonder if the B580 will drop to MSRP at all, or if retailers will just keep it slotted into the greater GPU line-up the way it is now and pocket the extra money.

clamchowder · on Feb 4, 2025

Oh that should be fun. Would have to fit it around work though

clamchowder · on Feb 4, 2025

"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.

brucehoult · on Feb 5, 2025

Ok, there exist cores that don't have a penalty for scaled indexed addressing (though many do). Or is it that they don't have any benefit from non-indexed addressing? Do they simply take a clock speed hit?

But that is all missing the point of "true but irrelevant".

You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.

You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.

A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.

But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.

And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.

It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.

dzaima · on Feb 6, 2025

So your argument isn't that it's irrelevant, but rather that it might be irrelevant, if you happen to have a core where the extra latency of a 64-bit adder on the load/store AGU pushes it just over to the next cycle.

Though I'd imagine that just having the extra cycle conditionally for indexed load/store instrs would still be better than having a whole extra instruction take up decode/ROB/ALU resources (and the respective power cost), or the mess that comes with instruction fusion.

And with RISC-V already requiring a 12-bit adder for loads/stores, thus and an increment/decrement for the top 52 bits, the extra latency of going to a full 64-bit adder is presumably quite a bit less than a full separate 64-bit adder. (and if the mandatory 64+12-bit adder already pushed the latency up by a cycle, a separate shNadd will result in two cycles of latency over the hypothetical adderless case, despite 1 cycle clearly being feasible!)

Even if the RISC-V way might be fine for tight loops, most code isn't such. And ideally most tight loops doing consecutive loads would vectorize anyway.

We're in a world where the latest Intel cores can do small immediate adds at rename, usually materializing them in consuming instructions, which I'd imagine is quite a bit of overhead for not that much benefit.

brucehoult · on Feb 6, 2025

No, my argument is that even if load with scaled indexed addressing takes a cycle longer, it's a rare enough thing given a good compiler and, yes, in many cases vector/SIMD processing, that you are very unlikely to actually be able to measure a difference on a real-world program.

I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V.

dzaima · on Feb 6, 2025

My point with vectorization was that the one case where indexed loads/stores are most defendably unnecessary is also the case where you shouldn't want scalar mem ops in the first place. Thus meaning that many scalar mem ops would be outside of tight loops, and outside of tight loops is also where unrolling/strength reduction/LICM to reduce the need of indexed loads is least applicable.

Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.

brucehoult · on Feb 6, 2025

And outside of tight loops is where a cycle here or there is irrelevant to the overall speed of the program. All the more so if you're going to have cache or TLB misses on those loads.

dzaima · on Feb 6, 2025

I quite heavily disagree. Perhaps might apply to programs which do spend like 90% of their time in a couple tight loops, but there's tons of software that isn't that simple (especially web.. well, everything, but also compilers, video game logic, whatever bits of kernel logic happen in syscalls, etc), instead spending a ton of time whizzing around a massive mess. And you want that mess to run as fast as possible regardless of how much the mess being a mess makes low-level devs cry. If there's headroom in the AGU for a 64-bit adder, I'd imagine it's an extremely free good couple percent boost; though the cost of extra register port(s) (or logic of sharing some with an ALU) might be annoying.

And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.

brucehoult · on Feb 7, 2025

This is not a discussion that can be resolved in the abstract. It requires actual experimentation and data and pointing at actual physical CPUs differing only in this respect and compare the silicon area, energy use, MHz achieved, and cycles per program.

dzaima · on Feb 7, 2025

It's certainly not a thing to be resolved in the abstract, but it's also far from thing to be ignored as irrelevant in the abstract.

But I have a hard time imagining that my general point of "if there's headroom for a full 64-bit adder in the AGU, adding such is very cheap and can provide a couple percent boost in applicable programs" is far from true. Though the register file port requirement might make that less trivial as I'd like it to be.

dzaima · on Feb 4, 2025

Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add.

But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).

Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.

I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.

[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...

[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...

clamchowder · on Jan 27, 2025

(author here) Just a 32 entry BTB is technically a possibility from microbenchmark results, but the EIC7700X datasheet straight up says:

"a branch prediction unit that is composed of a 32-entry Branch Target Buffer (BTB), a 9.1 KiB-entry Branch History Table (BHT), a 16-entry Return Address Stack (RAS), 512-entry Indirect Jump Target Predictor (IJTP), and a 16-entry Return Instruction Predictor"

phire · on Jan 28, 2025

Ah, that makes so much more sense.

So it does have a 2nd level BTB, it's just that it's labeled as IJTP and is potentially only used by indirect branches.

clamchowder · on Jan 28, 2025

No, that's not a second level BTB in that regular direct branches don't seem to use it. It's only for predicting indirect branches.

clamchowder · on Jan 27, 2025

(author here) I compared it to the A75 on the Snapdragon 670, not the 845. I chose that comparison because I have a Pixel 3a (my previous daily driver cell phone), and that's the only A75 core I had access to.

klelatti · on Jan 28, 2025

Hello Chester! Thanks for clarifying and for a terrific post.

clamchowder · on Jan 24, 2025

(author here) by free time and curiosity I mean, I have a day job so I'm able to do this as my hobby

crote · on Jan 24, 2025

That's crazy. Not even a Patreon to fund business expenses and pay for your coffee?!

clamchowder · on Dec 1, 2024

Note - I saw the article through from start to finish. For power measurements I modified my memory bandwidth test to read AMD's core energy status MSR, and modified the instruction bandwidth testing part to create a loop within the test array. (https://github.com/clamchowder/Microbenchmarks/commit/6942ab...)

Remember most of the technical analysis on Chips and Cheese is a one person effort, and I simply don't have infinite free time or equipment to dig deeper into power. That's why I wrote "Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out."

sweetjuly · on Dec 2, 2024

Sorry, I totally didn't mean this as a slight to your work--I've been a fan for quite a while :)

More so that estimating power when you don't have access to post synthesis simulations or internal gas gauges is very hard. For something so small, I can easily see this being a massive pain to measure in the field and the kind of thing that would easily vanish into the noise on a real system.

But in the absence of any clear answer, I do think it's reasonable to assume that the feature does in fact have the power advantages AMD intended, even if small.

tacticus · on Dec 2, 2024

and given the size of the variance it's probably too small to easily pick up from external atx\eps12v monitoring solutions :\