Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, you've pretty much got it. A common alternative to masking bits off of the hash is to take hash modulo the size of the table as the index (although you have to be careful with a modulo strategy, so as not to introduce a bias towards certain table slots).

There are strategies to make the resize not be so expensive. Wikipedia's page on Hash Tables covers this at http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

[edit: clarity in the first paragraph]



If the hash function is any good, simply bitwise-and enough bits from the hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)

Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.


iirc, there is a major advantage to using a prime number of slots that prevents the biasing


It's to protect against bad hashing algorithms:

http://www.concentric.net/~Ttwang/tech/primehash.htm


Does the modulo operation add a negligible cost to these (very cheap) hashing functions?

i.e. if you're hashing a 10 character (non-unicode) word with FNV-1a you're using 10 multiplications and 10 xors. Adding a modulo (by a prime†) operation in there could feasibly double the time taken.

†Not 2...


The modolo function is pretty much required unless you have 2^32 bytes of RAM available for each hashtable. Modolo is also ridiculously cheap, especially compared to the hashing function.


Certainly, if you are hashing a large amount of data then a single modulo operation isn't going to cost a lot. But if you're just hashing an English word, possibly even a URL, I'm not so sure.

You could shift or mask the result (as described above) to use a table size of any power of 2 to avoid the modulo.


Doing constant modulo on an integer is very fast - we're talking a few CPU operations.


It won't be a compile-time constant unless your hash table is incapable of resizing. And modulo arithmetic on arbitrary values is slow (20-94 cycles of latency on Sandy Bridge).


Hmm, good point. Is that also true for JIT compilers?


Yes, but bitwise AND is faster.


Cross that bridge when you get there. If you're really finding that your hash function is the slowest piece of code in a tightly bound for-loop, and it's not all the collisions you're having (from having a bad hashing function), then look into alternatives. Before that, you're doing premature optimization.


Please read the whole thread before replying with the "premature optimization" thing. We are talking about hash table optimizations; this commenter -- http://news.ycombinator.com/item?id=4037320 -- is asking about the cost of modulo operation.


IIrc changing a modulo to an and roughly doubled the performance of my hashtable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: