Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It seems like this approach could be adapted as a dimensionality reduction technique: select a set of k random hyperplanes denoted p_i, then map each point to a new vector r such that r_i is the distance to the hyperplane p_i. My gut tells me it may be more efficient to reuse existing nearest neighbor approaches in that embedding space than to bucket them and compute the partial intersection of 20 sets.

That having been said, this was a wonderful read, and beautifully presented.



Yes, given the dimensionality of the input is at least the dimensionality of the hash value, location-sensitive hashing is always a dimensionality reduction operation.

Conversely, any dimensionality-reducing technique with a fixed number of finite range output dimensions can be used for location-sensitive hashing as long as a couple of extra constraints hold: (1) nearby inputs (given an appropriate distance metric) will produce nearby outputs (given a second appropriate distance metric), and (2) far-apart inputs are likely to produce far-apart outputs.


That seems to imply that LSHs are a subset of DRs. Are there LSH techniques that are not applicable to DR?


Like so many of these algorithms, there are a few perspectives, and the most useful way to contextualize it depends on the problem you're trying to solve.

I would say that it's both a dimensionality reduction and an approximate nearest neighbor search technique.

One example of a scenario where I think that the ANN perspective is more useful than the DR one is near duplicate detection. There, the buckets tend not to be treated as dimensions so much as filtered lists of items to consider for more detailed comparison. Assuming you're trying to do a high recall search for everything within one distance, you may just look at the union of all the matching buckets.

On the other hand, if you're doing a k-nearest-neighbors search (rather than a dragnet search for all items within some threshold), you might treat them more like dimensions, since the items that share more buckets with the search vector are likely to also be the most similar. So perhaps you start by ranking by Hamming distance in the LSH space, and then refine using comparisons in the higher-dimensional space.


> Are there LSH techniques that are not applicable to DR?

They're all applicable, but perhaps not useful in your particular problem domain. It all depends on how useful the distance metric used by the LSH is in your problem domain.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: