The reason why the behavior here is bad is because Python uses probing to implement hash collision detection. Essentially, once you've looked up the hash, you have to check to make sure the value stored at this hash is actually the key you are looking for. However, floating point equality is different from bit-wise equality, and so once a NaN is stored as a key in the table it's unretrievable, since it will always report that it's not equal to any key you may be looking for.
Maybe it's April Fool's, but only in the fine tradition of WTF but true.
I'm completely uninterested in the specifics of how Python hash tables are implemented; in other words, the code is irrelevant. What concerns me is the logic of hash functions as applied to hash tables, as a language independent concern.
Hash tables require certain things from the key hash and compare functions. If language defaults are used for comparing, particularly with NaNs, then hash tables break down. The fix isn't to make your hash function random; it's to make your compare function conform to the requirements of hash tables.
I'm not the biggest fan of the solution in the article; but my biggest problem is that you're using floating points as keys for your hash table.
Fix comparison for floating point? Well, now you're violating the IEEE standard.
Introduce a new 'equality-for-hash-tables' magic function? Well, now you have another special case that contradicts normal equality and doesn't even really have well defined semantics in many cases.
I'm not using floating point as keys in my hash table; Russ is.
If you want a hash table with meaningful semantics, you need certain guarantees from your hash and comparison functions. If you're trying to store values that break those comparisons, you either need to change the functions or not store those values.
You're advocating not storing those values; fine. But some people will, not even directly; for example, a 'memoize()' function that creates a cache behind the scenes, using a hash table with keys formed from function arguments, whose types may be floats. Here, you do want NaN = NaN; you want hash table rules to be followed; and you don't want a random hash function.
Just saying "sorry, I can't memoize this function" on pedantry grounds isn't good enough.
What you should be doing, if you're doing numerics, is back up and find out why there's a NaN in the arguments in the first place.
Not all NaN's are created equal, in some contexts you'll want to always raise an exception whenever you see them, and in others you'll want to provide an appropriate closure or limit ... that depends on how you arrived at the Nan in the first place.
For what it's worth, you need to always know your libraries, for any one using Go to hash floats the behaviour is to return random for has( NaN ) :
// NOTE: Because NaN != NaN, a map can contain any
// number of (mostly useless) entries keyed with NaNs.
// To avoid long hash chains, we assign a random number
// as the hash value for a NaN.
void
runtime·f32hash(uintptr *h, uintptr s, void *a)
{
uintptr hash;
float32 f;
USED(s);
f = *(float32*)a;
if(f == 0)
hash = 0; // +0, -0
else if(f != f)
hash = runtime·fastrand1(); // any kind of NaN
else
hash = *(uint32*)a;
*h ^= (*h ^ hash ^ 2860486313U) * 3267000013U;
}
The reason why the behavior here is bad is because Python uses probing to implement hash collision detection. Essentially, once you've looked up the hash, you have to check to make sure the value stored at this hash is actually the key you are looking for. However, floating point equality is different from bit-wise equality, and so once a NaN is stored as a key in the table it's unretrievable, since it will always report that it's not equal to any key you may be looking for.
Maybe it's April Fool's, but only in the fine tradition of WTF but true.