def streaming_algorithm(A, epsilon, delta):
# Initialize parameters
p = 1
X = set()
thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta))
# Process the stream
for ai in A:
if ai in X:
X.remove(ai)
if random.random() < p:
X.add(ai)
if len(X) == thresh:
X = {x for x in X if random.random() >= 0.5}
p /= 2
if len(X) == thresh:
return '⊥'
return len(X) / p
# Example usage
A = [1, 2, 3, 1, 2, 3]
epsilon = 0.1
delta = 0.01
output = streaming_algorithm(A, epsilon, delta)
print(output)
I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.
well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable
OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.
Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.
Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878
Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?
So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.
You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.
You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)
In this example, they have the length of the list and choose the threshold to give them a desired margin of error.
An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.
I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:
from random import random
def estimate_uniques(iterable, window_size=100):
p = 1
seen = set()
for i in iterable:
if i not in seen:
seen.add(i)
if random() > p:
seen.remove(i)
if len(seen) >= window_size:
seen = {s for s in seen if random() < 0.5}
p /= 2
return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.
My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.