Random Hash Functions

ezyang · on April 1, 2012

Which is distinct from a "totally random hash function", which is a hash function for which the hash associated with every value is uniformly randomly selected from the key space. Totally random hash functions have very good properties, but they take a lot of space to store. (Exercise for the reader: if the hash function is {0,1}^n -> {0,1}^m (so the input is an n-bit string, and the output is an m-bit string), how much space do you need? Why isn't this compressible?)

avar · on April 1, 2012

This is off-topic but is there an official Google API + snippet for embedding Google+ comments in a non-Google+ page like that? He seems to have some hardcoded custom JavaScript code for it, is there no easy official way to do so?

0xbadcafebee · on April 1, 2012

....not sure if april fools joke or if i'm just really dumb

barrkel · on April 1, 2012

I had to AdBlock typekit.com in order to disable the custom font on this site; normally Firefox config option gfx.downloadable_fonts.enabled disables custom fonts, but not here.

(I think the "idea" presented here is silly; I think it's an April Fools. Any given NaN (IEEE NaNs have lots of different representations) should hash consistently and compare consistently with functions specific to the requirements of hash tables, i.e. one for which NaN = NaN for any given NaN. You shouldn't be able to store multiple NaNs in the hash table (unless you're specifically using distinct NaNs) no more than you can store multiple keys for '1.0', so there should be no quadratic explosion.)

(Oh and furthermore: I only know rudimentary Python, but timeit taking a string parameter for the code to time rather than a lambda is revolting.)

ezyang · on April 1, 2012

Did you run the code?

The reason why the behavior here is bad is because Python uses probing to implement hash collision detection. Essentially, once you've looked up the hash, you have to check to make sure the value stored at this hash is actually the key you are looking for. However, floating point equality is different from bit-wise equality, and so once a NaN is stored as a key in the table it's unretrievable, since it will always report that it's not equal to any key you may be looking for.

Maybe it's April Fool's, but only in the fine tradition of WTF but true.

barrkel · on April 1, 2012

I'm completely uninterested in the specifics of how Python hash tables are implemented; in other words, the code is irrelevant. What concerns me is the logic of hash functions as applied to hash tables, as a language independent concern.

Hash tables require certain things from the key hash and compare functions. If language defaults are used for comparing, particularly with NaNs, then hash tables break down. The fix isn't to make your hash function random; it's to make your compare function conform to the requirements of hash tables.

ezyang · on April 1, 2012

I'm not the biggest fan of the solution in the article; but my biggest problem is that you're using floating points as keys for your hash table.

Fix comparison for floating point? Well, now you're violating the IEEE standard.

Introduce a new 'equality-for-hash-tables' magic function? Well, now you have another special case that contradicts normal equality and doesn't even really have well defined semantics in many cases.

barrkel · on April 2, 2012

I'm not using floating point as keys in my hash table; Russ is.

If you want a hash table with meaningful semantics, you need certain guarantees from your hash and comparison functions. If you're trying to store values that break those comparisons, you either need to change the functions or not store those values.

You're advocating not storing those values; fine. But some people will, not even directly; for example, a 'memoize()' function that creates a cache behind the scenes, using a hash table with keys formed from function arguments, whose types may be floats. Here, you do want NaN = NaN; you want hash table rules to be followed; and you don't want a random hash function.

Just saying "sorry, I can't memoize this function" on pedantry grounds isn't good enough.

defrost · on April 2, 2012

What you should be doing, if you're doing numerics, is back up and find out why there's a NaN in the arguments in the first place.

Not all NaN's are created equal, in some contexts you'll want to always raise an exception whenever you see them, and in others you'll want to provide an appropriate closure or limit ... that depends on how you arrived at the Nan in the first place.

For what it's worth, you need to always know your libraries, for any one using Go to hash floats the behaviour is to return random for has( NaN ) :

    // NOTE: Because NaN != NaN, a map can contain any
    // number of (mostly useless) entries keyed with NaNs.
    // To avoid long hash chains, we assign a random number
    // as the hash value for a NaN.
    
    void
    runtime·f32hash(uintptr *h, uintptr s, void *a)
    {
        uintptr hash;
        float32 f;

        USED(s);
        f = *(float32*)a;
        if(f == 0)
                hash = 0;  // +0, -0
        else if(f != f)
                hash = runtime·fastrand1();  // any kind of NaN
        else
                hash = *(uint32*)a;
        *h ^= (*h ^ hash ^ 2860486313U) * 3267000013U;
    }

http://code.google.com/p/go/source/browse/src/pkg/runtime/al...

tantalor · on April 1, 2012

> Since the inputs are identical

They aren't identical because NaN != NaN. That's what "identity" means.