Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For short patterns this is probably right. Longer patterns will be faster using a modern sublinear search algorithm.


As the author of ripgrep, I wouldn't necessarily buy this. I suppose I might agree with it in the extremes, but SIMD prefilters are quite exceptionally difficult to beat with scalar code in the common cases. Longer patterns are great for the SIMD algorithms in ripgrep because of its use of background frequency distribution heuristics. That is, the longer the pattern, the less likely your candidate detection is to produce a false positive in its hot path. (I say "less likely" here intentionally. One can of course pick specific needles and haystacks where a false positive is reported at every position for any length N.)

I don't mean to 100% disagree with you, but I think it's misleading to suggest a sort of one dimensional view of things where, as the pattern gets larger, SIMD gets worse compared to sublinear search algorithms. There are other factors at play here, and, importantly, what "long" means in this context.

In many practical circumstances, "short" might be a few bytes, while "long" is 16 bytes. But maybe your idea of "long" is actually much longer.

If you're curious how your own algorithm stacks up to ripgrep's, you can plug your implementation into the `memchr` crate's benchmark harness: https://github.com/BurntSushi/memchr

It uses rebar: https://github.com/BurntSushi/rebar


Interesting, thanks for the detailed response. I'll have a look at the benchmark; I'm doing some work on algorithm benchmarking right now by coincidence.

I'd say a long pattern is more like 64 bytes (the benchmarking suite I use defines short patterns as 32 or under).

Edit: will also check out the frequency approach used in ripgrep, sounds fascinating.


You can read more about it at various points in these two blogs I've written:

https://blog.burntsushi.net/ripgrep/

https://blog.burntsushi.net/regex-internals/

64 bytes is decently long, but I'd still put my money on SIMD for "common" cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: