Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Python implementation:

  def streaming_algorithm(A, epsilon, delta):
      # Initialize parameters
      p = 1
      X = set()
      thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta))

      # Process the stream
      for ai in A:
          if ai in X:
              X.remove(ai)
          if random.random() < p:
              X.add(ai)
          if len(X) == thresh:
              X = {x for x in X if random.random() >= 0.5}
              p /= 2
              if len(X) == thresh:
                  return '⊥'

      return len(X) / p


  # Example usage
  A = [1, 2, 3, 1, 2, 3]
  epsilon = 0.1
  delta = 0.01

  output = streaming_algorithm(A, epsilon, delta)
  print(output)


I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.


The names are literally taken from the paper.


well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable


> Name stuff well

OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.


Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.

>>> ᚨ=3

>>> ᛒ=6

>>> ᚨ+ᛒ

9


You're even using a and b, very good.


Ah yes, programming like the vikings intended.


That's not streaming if you're already aware of the length of the iterable.


In python, you can simply substitute `A` with an iterable or generator object, which can be a of unknown length.


But for this algorithm, you need to know the total length ("m") to set the threshold for the register purges.

Does it still work if you update m as you go?


Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878

Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?

So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.


You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.


You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)

In this example, they have the length of the list and choose the threshold to give them a desired margin of error.




  return '⊥'
what's this?


An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.


In some symbolic logic classes, that character "bottom" represents "false" ad flipped "top" means true.

Don't know what they're getting at in the code, though.


>In some symbolic logic classes, that character "bottom" represents "false"

That's unfortunate, because in the study of computer programming languages, it means "undefined" (raise an error).


Not always. It is also the uninhabited bottom type.


My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.

I have no idea what your point is.


Once again proving the need for comments in code. Especially for comments that are more useful than "initialize parameters"


An easy way to identify who copies code without understanding it.


You can just replace it with something like: print ('Invalid thresh or something')


This however looks scary so an innocent copy/paste programmer wouldn't touch it.


New leetcode hard question just dropped.


Is this ChatGPT? Also, I feel like this would be more useful if it included import statements.


  import math
  import random




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: