More

toomanybits · on Dec 1, 2021

Now Kafka has ZK taken out and tiered storage coming in these factors are becoming mute points. With tiered storage Kafka has 2 layers and hence the benefits of broker/storage decoupling. By comparison, Pulsar has broker, bookkeeper, S3, and ZK layers to contend with. This is why a second layer was never added to Kafka directly.

toomanybits · on March 30, 2021

I think that's one of the main points. Now you can run it as a single process more like a traditional broker (although it's obviously still a log).

toomanybits · on July 10, 2020

You don't need a CS degree or Martin Kleppmann's book to work out it's a GPITA.

papaf · on July 10, 2020

I am also not a fan of Zookeeper but I have come to respect it. In defense of Zookeeper, distributed systems are a GPITA. If you need to select some components why not go for ones that are solid [1].

[1] https://aphyr.com/posts/291-jepsen-zookeeper

toomanybits · on July 10, 2020

I'm lost too. Kafka auto creates topics by default. Maybe you're referring to being able to create more topics? But that seems to be unproven. Kafka's limit is metadata and Pulsar is more metadata dependent than Kafka.

toomanybits · on July 10, 2020

> Pulsar will happily do with just 2x. This is just wrong. Pulsar provides weaker guarantees than Kafka. It's a quorum based system. If you run with two replicas Pulsar can't provide F-1 guarantees which Kafka can.

toomanybits · on July 10, 2020

Nope. It sucks. RIP.

toomanybits · on July 10, 2020

Show me one

toomanybits · on July 10, 2020

1. is true, but if you want that data to move to a new node, it still needs to be replicated. Kafka's approach is to use tiered storage (which I believe is close to completion).

2. Kafka can read from a replica node. It's relatively new but it's there.

majidazimi · on July 10, 2020

That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.

kevstev · on July 10, 2020

How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.

addisonj · on July 10, 2020

Hey, I work on Pulsar, will try and answer this :)

Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.

But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.

Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.

Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:

1. Handling writes

miguno · on July 13, 2020

(copying this text from another comment of mine elsewhere)

Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)

So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).

majidazimi · on July 10, 2020

Network is faster than disk. Once cached, then you are only bound by network IO for subsequent uses.

kevstev · on July 10, 2020

Sure- but how is this different than kafka's caching?

toomanybits · on July 10, 2020

This smacks of being heavily one-product-focussed to me. Being a Kafka user it's hard enough managing and understanding one system, nevermind three or four joined together.

Maybe it's a bit faster or a bit more elastic, or whatever, who knows. What I really care about is whether I get called at 3am and in that regard the argument seems pretty weak. Kafka for all its woes is a solid system you know you can count on.

I'd much rather see someone come up with a truly innovative alternative that actually pushes the boundaries, rather than just copying what's there already, and adding a few window dressings.

Jedd · on July 10, 2020

What kind of (lower level) surrogate metrics would you be interested in that could translate to '3am phone calls' when comparing messaging systems?

toomanybits · on July 10, 2020

Being used by at least one company of significant size that (a) i've heard of and (b) isn't directly connected to the project would be a good start.

Jedd · on July 12, 2020

'Directly connected' to a project might mean a user of - but I assume you mean a major contributor to (as even small time users of free software often contribute something - bug reports, feature requests, code contributions, money, etc).

The page: https://pulsar.apache.org/powered-by/ suggests there's quite some number of corporate users who are happy to confirm they use this suite. I don't know how many of those you've heard of, though.

I suspect many private & government agencies around the world would decline to formally attach their name to any list like this, lest it be (mis)interpreted as an endorsement.

toomanybits · on July 13, 2020

I've heard of Comcast, but that's Yahoo. Not heard of the others.

toomanybits · on Dec 17, 2018

Are you sure? he seems to be using the past tense: "written so that software products like Landoop were free to embed our open source"