Now Kafka has ZK taken out and tiered storage coming in these factors are becoming mute points. With tiered storage Kafka has 2 layers and hence the benefits of broker/storage decoupling.
By comparison, Pulsar has broker, bookkeeper, S3, and ZK layers to contend with. This is why a second layer was never added to Kafka directly.
I am also not a fan of Zookeeper but I have come to respect it. In defense of Zookeeper, distributed systems are a GPITA. If you need to select some components why not go for ones that are solid [1].
I'm lost too. Kafka auto creates topics by default. Maybe you're referring to being able to create more topics? But that seems to be unproven. Kafka's limit is metadata and Pulsar is more metadata dependent than Kafka.
> Pulsar will happily do with just 2x.
This is just wrong. Pulsar provides weaker guarantees than Kafka. It's a quorum based system. If you run with two replicas Pulsar can't provide F-1 guarantees which Kafka can.
1. is true, but if you want that data to move to a new node, it still needs to be replicated. Kafka's approach is to use tiered storage (which I believe is close to completion).
2. Kafka can read from a replica node. It's relatively new but it's there.
That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.
How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.
Hey, I work on Pulsar, will try and answer this :)
Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.
But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.
Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.
Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:
(copying this text from another comment of mine elsewhere)
Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)
So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).
This smacks of being heavily one-product-focussed to me. Being a Kafka user it's hard enough managing and understanding one system, nevermind three or four joined together.
Maybe it's a bit faster or a bit more elastic, or whatever, who knows. What I really care about is whether I get called at 3am and in that regard the argument seems pretty weak. Kafka for all its woes is a solid system you know you can count on.
I'd much rather see someone come up with a truly innovative alternative that actually pushes the boundaries, rather than just copying what's there already, and adding a few window dressings.
'Directly connected' to a project might mean a user of - but I assume you mean a major contributor to (as even small time users of free software often contribute something - bug reports, feature requests, code contributions, money, etc).
The page: https://pulsar.apache.org/powered-by/ suggests there's quite some number of corporate users who are happy to confirm they use this suite. I don't know how many of those you've heard of, though.
I suspect many private & government agencies around the world would decline to formally attach their name to any list like this, lest it be (mis)interpreted as an endorsement.