It seems like this approach could be adapted as a dimensionality reduction techn...

KMag · on June 24, 2021

Yes, given the dimensionality of the input is at least the dimensionality of the hash value, location-sensitive hashing is always a dimensionality reduction operation.

Conversely, any dimensionality-reducing technique with a fixed number of finite range output dimensions can be used for location-sensitive hashing as long as a couple of extra constraints hold: (1) nearby inputs (given an appropriate distance metric) will produce nearby outputs (given a second appropriate distance metric), and (2) far-apart inputs are likely to produce far-apart outputs.

steventhedev · on June 24, 2021

That seems to imply that LSHs are a subset of DRs. Are there LSH techniques that are not applicable to DR?

mumblemumble · on June 24, 2021

Like so many of these algorithms, there are a few perspectives, and the most useful way to contextualize it depends on the problem you're trying to solve.

I would say that it's both a dimensionality reduction and an approximate nearest neighbor search technique.

One example of a scenario where I think that the ANN perspective is more useful than the DR one is near duplicate detection. There, the buckets tend not to be treated as dimensions so much as filtered lists of items to consider for more detailed comparison. Assuming you're trying to do a high recall search for everything within one distance, you may just look at the union of all the matching buckets.

On the other hand, if you're doing a k-nearest-neighbors search (rather than a dragnet search for all items within some threshold), you might treat them more like dimensions, since the items that share more buckets with the search vector are likely to also be the most similar. So perhaps you start by ranking by Hamming distance in the LSH space, and then refine using comparisons in the higher-dimensional space.

KMag · on June 25, 2021

> Are there LSH techniques that are not applicable to DR?

They're all applicable, but perhaps not useful in your particular problem domain. It all depends on how useful the distance metric used by the LSH is in your problem domain.

nighthawk454 · on June 24, 2021

UMAP is sort of along those lines: https://umap-learn.readthedocs.io/en/latest/how_umap_works.h...