So far, SIMD operations are contant time, since the whole vector needs to be computed, and not just a single element in it.
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.
> SIMD operations are contant time, since the whole vector needs to be computed
Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)
I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g.,
in SIMD or floating-point units, making this variability
explicit can better inform..." so it sounds like simd is at least situation specific here.
I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).
You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…
In MD5, there are calculations that if they result in certain data ranges in certain bytes, results in a secondary calculation to spread that change into other bytes. Like a non Euclidean carry operation.
So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.
Hmm. I could have sworn the rotation had a component of the calculation in it, but you're right it's just the loop counter. SHA-1 and AES also vary by the round, but none of SHA-2, DES, or MD-4 have either round nor state variance, so if I'm misremembering a different algorithm it's pretty obscure. False memory I guess.
For the 5 algorithms you mentioned, I know them well.
> SHA-1 and AES also vary by the round
Vary what by the round? SHA-1 ( https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ) is structured very similarly to MD5, with one big difference being that SHA-1 doesn't vary the rotation by the round number. AES has a public sequence of round constants that gets applied to the key expansion, but otherwise doesn't do anything special to the ciphertext per round.
The logic that can cause timing issues include using a memory lookup table for the S-box and finite field multiplication in AES, as well data-dependent rotations in rare ciphers like https://en.wikipedia.org/wiki/RC5 and RC6.
The tables have always made people nervous anyway. And the compromised EC entry just sealed the deal. How do we know they aren’t chosen by the NSA for mischief, even now that they should know better that someone else will figure out their trick too?
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.