So far, SIMD operations are contant time, since the whole vector needs to be com...

viraptor · 2025-03-12T17:23:24 1741800204

> SIMD operations are contant time, since the whole vector needs to be computed

Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)

I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g., in SIMD or floating-point units, making this variability explicit can better inform..." so it sounds like simd is at least situation specific here.

manwe150 · 2025-03-12T18:04:21 1741802661

I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).

From https://uops.info/table.html

Sesse__ · 2025-03-12T22:50:12 1741819812

You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…

hovav · 2025-03-12T20:09:38 1741810178

It's not guaranteed. See section 7 of https://www.usenix.org/system/files/conference/usenixsecurit...

hinkley · 2025-03-12T17:42:33 1741801353

In MD5, there are calculations that if they result in certain data ranges in certain bytes, results in a secondary calculation to spread that change into other bytes. Like a non Euclidean carry operation.

So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.

nayuki · 2025-03-12T23:01:59 1741820519

I am familiar with the low-level arithmetic in MD5 (e.g. https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ). Exactly which calculations are you pointing to that result in a secondary calculation?

hinkley · 2025-03-13T00:51:57 1741827117

Hmm. I could have sworn the rotation had a component of the calculation in it, but you're right it's just the loop counter. SHA-1 and AES also vary by the round, but none of SHA-2, DES, or MD-4 have either round nor state variance, so if I'm misremembering a different algorithm it's pretty obscure. False memory I guess.

nayuki · 2025-03-13T06:02:15 1741845735

For the 5 algorithms you mentioned, I know them well.

> SHA-1 and AES also vary by the round

Vary what by the round? SHA-1 ( https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ) is structured very similarly to MD5, with one big difference being that SHA-1 doesn't vary the rotation by the round number. AES has a public sequence of round constants that gets applied to the key expansion, but otherwise doesn't do anything special to the ciphertext per round.

The logic that can cause timing issues include using a memory lookup table for the S-box and finite field multiplication in AES, as well data-dependent rotations in rare ciphers like https://en.wikipedia.org/wiki/RC5 and RC6.

hinkley · 2025-03-13T14:49:24 1741877364

The tables have always made people nervous anyway. And the compromised EC entry just sealed the deal. How do we know they aren’t chosen by the NSA for mischief, even now that they should know better that someone else will figure out their trick too?