Well obviously. That idea is not (IMO) worthy of a news post. But it doesn't say...

bdonlan · on Dec 4, 2010

If the integers cannot be guaranteed to be distinct, and all of them must be presented at the end, then no algorithm which restricts itself to O(1) memory usage can be correct. Since the problem restricts you to using 4KB of RAM (O(1)), and there is presumably a solution, you can conclude that you need not count non-distinct inputs - but yeah, it's kind of a bad problem description.

Jabbles · on Dec 4, 2010

I'm not sure you can change the question just because there is no answer :P

4KB of RAM is not the same as an O(1) memory requirement, since we have already restricted the number and range of integers to 10000 (i.e. constants),so there is nothing left to vary. The page size just presents a target ratio for compression, not an asymptotic restriction on memory usage.

xenophanes · on Dec 4, 2010

In general, do that then run some compression on it. That will do better if the data wasn't random.

Or if you're expecting random data, invent a biased compression algorithm and use that (e.g. one that has shortcuts for storing any multiples of 3). Most of the time it will do nothing and a few times you'll randomly get data it's good at.

Actually on second thought I'm not sure if that works or not. You'll need a header to identify if the compression was used or not. So the compression has to save more on average than this header info costs.

This header problem becomes clearer if you try to chain thousands of special-case compression algorithms that map a single input to one bit and otherwise are unused. Seems to save space at first (at cost of CPU time, and space to store these algorithms), but actually identifying which are used or not will be a problem. Since they all need unique headers, you need just as many header possibilities as there are numbers in the range, so you must as well just use the numbers for the headers, at which case you're actually not doing anything since the header is the data.

on Dec 4, 2010

[deleted]

xenophanes · on Dec 4, 2010

You can't do a bitmap in general b/c random integers are really, really big.

If there is a compression algorithm that works on 10% of data sets, and does massive harm on the other 90%, you can use or not use it, at the cost of a small amount of header information and a bunch of CPU time, and it doesn't matter that in general, on average, it's quite bad. All that matters is whether the times it's good beat the header info cost. I think. I'm not sure this is helpful but it doesn't require a compression algorithm being a net gain on average.

Jabbles · on Dec 4, 2010

Sorry, I deleted my previous post because I'd misread yours.

You could compress the bitmap, but we still can't handle duplicates.

copper · on Dec 4, 2010

If you're willing to accept false positives, then yes, you can.

http://en.wikipedia.org/wiki/Bloom_filter

I'm feeling a bit lazy to calculate the error rate right now.