Yeah, the title is misleading because really the problem described in the blog post is to store some unknown number of ints between 1 and 10,000. The better solution than a straight bitmap of 10,000 bits is to use all 4KB and devote three bits to each slot. Then you can remember up to 7 occurrences of each integer.
There's no need for bloom filters here because you can already cover all the possibilities using a bitmap.
Additionally you could decide to only store up to 6 occurrences of each integer in the bit array, use the value of 7 as an indicator that there are 7 or more occurrences and store the actual counts for those cases in a sorted array occupied by the remaining 346 bytes left in the 4KB page.
This looks like a very simplified bloom filter made with a very bad hash function. Mind you, though, I am pretty sure I'd never rediscover that particular data structure by myself :)
Basically it depends on whether you expect to have to record more than 32000 / 14 = 2200 or so values, in which case simply keep a sorted list of the actual integers using a sorting/insertion algorithm of your choice, or you expect to receive fewer than 8 occurrences of a given integer, in which case use a 10,000 x 3-bit deep bitmap, or you don't care how many occurrences of each integer there were, in which case use a 10,000 x 1-bit deep bitmap.
In any event the unintelligible and buggy proposed solution doesn't seem like a good choice.
I'm confused. Is 10000 the number of integers or the maximum integer that has to be stored?
I would have thought it (in general) impossible to store more than 2^15/log2(10000)~2500 integers between 1 and 10000 in 4kB, but I look forward to hearing more.
If you know the maximum integer you can receive, you can do a sort of bucket sort, where the Nth bit in the page represents whether you have revceived number N. So yo can store as many numbers as bits you have available.
Well obviously. That idea is not (IMO) worthy of a news post. But it doesn't say that the integers are distinct. I'd be interested to hear if it was possible to store numbers in general better than above.
If the integers cannot be guaranteed to be distinct, and all of them must be presented at the end, then no algorithm which restricts itself to O(1) memory usage can be correct. Since the problem restricts you to using 4KB of RAM (O(1)), and there is presumably a solution, you can conclude that you need not count non-distinct inputs - but yeah, it's kind of a bad problem description.
I'm not sure you can change the question just because there is no answer :P
4KB of RAM is not the same as an O(1) memory requirement, since we have already restricted the number and range of integers to 10000 (i.e. constants),so there is nothing left to vary. The page size just presents a target ratio for compression, not an asymptotic restriction on memory usage.
In general, do that then run some compression on it. That will do better if the data wasn't random.
Or if you're expecting random data, invent a biased compression algorithm and use that (e.g. one that has shortcuts for storing any multiples of 3). Most of the time it will do nothing and a few times you'll randomly get data it's good at.
Actually on second thought I'm not sure if that works or not. You'll need a header to identify if the compression was used or not. So the compression has to save more on average than this header info costs.
This header problem becomes clearer if you try to chain thousands of special-case compression algorithms that map a single input to one bit and otherwise are unused. Seems to save space at first (at cost of CPU time, and space to store these algorithms), but actually identifying which are used or not will be a problem. Since they all need unique headers, you need just as many header possibilities as there are numbers in the range, so you must as well just use the numbers for the headers, at which case you're actually not doing anything since the header is the data.
You can't do a bitmap in general b/c random integers are really, really big.
If there is a compression algorithm that works on 10% of data sets, and does massive harm on the other 90%, you can use or not use it, at the cost of a small amount of header information and a bunch of CPU time, and it doesn't matter that in general, on average, it's quite bad. All that matters is whether the times it's good beat the header info cost. I think. I'm not sure this is helpful but it doesn't require a compression algorithm being a net gain on average.
I would have thought it (in general) impossible to store more than 2^15/log2(10000)~2500 integers between 1 and 10000 in 4kB, but I look forward to hearing more.
Yeah, I was thinking about that, too. Something theoretically screams 'No' to me. You have 2^15 bits of storage available, and each number takes log2(10,000) bits, I can't see storing more than the one divided by the other, if each number is equally likely and random.
But! Since we have flexibility in the order in which the integers are stored in the array, could we store a couple "meta integers"? e.g. Could we put our 2,500 integers in the array in such a way that the bits in log2(10,000) locations spell out a 2,501st integer between 1 and 10,000?
I feel like it can't always work: maybe our 2500 integers are just such that there's no way to put them in to spell out the meta integer. OTOH, that's just one scheme I thought of, and perhaps there is a way I'm not thinking of that we can leverage the order in which we put the integers in the array. There might be some free "information" that way that is not being counted by the simple number of bits of storage.
Another thought: Maybe the default ordering of the integers is ascending, and we can break that known ordering in certain places to convey information about the 2501 integer. But then -- what if all the integers given are the same? Sigh.
There's no need for bloom filters here because you can already cover all the possibilities using a bitmap.