Python implementation: def streaming_algorithm(A, epsilon, delta): # Initialize ...

ericmcer · on May 17, 2024

I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.

tgv · on May 17, 2024

The names are literally taken from the paper.

seaman1921 · on May 17, 2024

well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable

sneva · on May 17, 2024

> Name stuff well

OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.

foobarian · on May 17, 2024

Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.

>>> ᚨ=3

>>> ᛒ=6

>>> ᚨ+ᛒ

9

8372049 · on May 17, 2024

You're even using a and b, very good.

pragma_x · on May 17, 2024

Ah yes, programming like the vikings intended.

rcarmo · on May 17, 2024

That's not streaming if you're already aware of the length of the iterable.

axitanull · on May 17, 2024

In python, you can simply substitute `A` with an iterable or generator object, which can be a of unknown length.

mattkrause · on May 17, 2024

But for this algorithm, you need to know the total length ("m") to set the threshold for the register purges.

Does it still work if you update m as you go?

cb321 · on May 17, 2024

Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878

Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?

So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.

istjohn · on May 17, 2024

You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.

empath-nirvana · on May 17, 2024

You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)

In this example, they have the length of the list and choose the threshold to give them a desired margin of error.

rcarmo · on May 17, 2024

See https://news.ycombinator.com/item?id=40390192

rcarmo · on May 17, 2024

I know that. See: https://news.ycombinator.com/item?id=40390192

planede · on May 17, 2024

  return '⊥'

what's this?

rcarmo · on May 17, 2024

An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)

I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.

jbaber · on May 17, 2024

In some symbolic logic classes, that character "bottom" represents "false" ad flipped "top" means true.

Don't know what they're getting at in the code, though.

hollerith · on May 17, 2024

>In some symbolic logic classes, that character "bottom" represents "false"

That's unfortunate, because in the study of computer programming languages, it means "undefined" (raise an error).

tsss · on May 17, 2024

Not always. It is also the uninhabited bottom type.

hollerith · on May 17, 2024

My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.

I have no idea what your point is.

dentemple · on May 17, 2024

Once again proving the need for comments in code. Especially for comments that are more useful than "initialize parameters"

martinsnow · on May 17, 2024

An easy way to identify who copies code without understanding it.

mudiadamz · on May 17, 2024

You can just replace it with something like: print ('Invalid thresh or something')

martinsnow · on May 17, 2024

This however looks scary so an innocent copy/paste programmer wouldn't touch it.

abootstrapper · on May 17, 2024

New leetcode hard question just dropped.

sestep · on May 17, 2024

Is this ChatGPT? Also, I feel like this would be more useful if it included import statements.

mudiadamz · on May 17, 2024

  import math
  import random