Retrieval using CNNs requires computing a cosine similarity matrix. So, for 'n' ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		tanujjain on Oct 7, 2019 \| parent \| context \| favorite \| on: Show HN: Imagededup – Finding duplicate images mad... Retrieval using CNNs requires computing a cosine similarity matrix. So, for 'n' images, a matrix of size n x n would need to be stored in the memory. As you can see, the storage requirements blow up quadratically as 'n' increases. We have already made some optimizations to reduce the memory footprint but there would be clear upper limits to it (We haven't experimented). As for the numbers, the cifar10 dataset example present in the repo was carried out on google colab notebook. The dataset has 60k images and ended up consuming about 6G of RAM using CNN. However, since there hasn't yet been a principled benchmarking done on the package, I would consider these numbers as only marginally indicative of the capabilities. A better way to figure out if it works for you would be to try it out with your own dataset, starting out at a small scale and increasing the dataset size gradually.

dgacmu on Oct 7, 2019 | [–]

Use approximate nearest neighbor. The FAISS library is good.

tanujjain on Oct 7, 2019 | | [–]

Thanks for the pointer, will check it out.

jonatron on Oct 7, 2019 | [–]

Couldn't you add images>threshold to a dict/map as you iterate, rather than building a complete matrix, then iterating through that?

tanujjain on Oct 7, 2019 | [–]

We tried that approach, but it was way too slow.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact