Yes, and this is an important point! This is the reason for our current approach for sqlite derivations. You can absolutely just store all the data in the sqlite database, as long as it actually fits. And there's cases where people actually do this on our platform, though I don't think we have an example in our docs.
A lot of people just learning about streaming systems don't come in with useful intuitions about when they can and can't use that approach, or even that it's an option. We're hoping to build up to some documentation that can help new people learn what their options are, and when to use each one.
I agree completely! We've always talked about this, but we haven't really seen a clear way to package it into a good developer UX. We've got some ideas, though, so maybe one day we'll take a stab at it. For now we've been more focused on integrations and just building out the platform.
The main benefit isn't necessarily that it's _streaming_ per se, but that it's _incremental_. We typically see people start by just incrementally materializing their data to a destination in more or less the same set tables that exist in the source system. Then they develop downstream applications on top of the destination tables, and they start to identify queries that could be sped up by pre-computing some portion of it incrementally before materializing it.
There's also cases where you just want real time results. For example, if you want to take action based on a joined result set, then in the rdbms world yoy might periodically run a query that joins the tables and see if you need to take action. But polling becomes increasingly inefficient at lower polling intervals. So it can work better to incrementally compute the join results, so you can take action immediately upon seeing something appear in the output. Think use cases like monitoring, fraud detection, etc.
To my knowledge, nobody's implemented parquet fragment files. But it supports compression of JSONL out of the box. JSON compresses very well, and compression ratios approaching 10/1 are not uncommon.
But more to the point, journals are meant for things that are written _and read_ sequentially. Parquet wasn't really designed for sequential reads, so it's unclear to me whether there would be much benefit. IMHO it's better to use journals for sequential data (think change events) and other systems (e.g. RDBMS or parquet + pick-your-compute-flavor) for querying it. I don't think there's yet a storage format that works equally well for both.
I don't think it's correct to say that JSONL is any more vulnerable to invalid data than other message framings. There's literally no system out there that can fully protect you from bugs in your own application. But the client libraries do validate the framing for you automatically, so in practice the risk is low. I've been running decently large Gazette clusters for years now using the JSONL framing, and have never seen a consumer write invalid JSON to a journal.
The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.
Newton was a member of the British elite, he had income from Cambridge University, and he was first the Warden and later the Master of the Royal Mint. He was also experienced in investments.
Nevertheless, he also got sucked into the infamous South Sea mania of 1720, which was basically the monkey JPG of its day, and consequently he lost a great deal of his wealth.
In contrast to jebarker's comment, I actually think it's really interesting that a concept coming from game engine development actually seems quite applicable in some very different domains.
To paraphrase, each derivation produces a collection of data by reading from one or more source collections (DOD calls these "streams"), optionally updating some internal state (sqlite), and emitting 0 or more documents to add to the collection. We've been experimenting with this paradigm for a few years now in various forms, and I've found it surprisingly capable and expressive. One nice property of this system is that every transform becomes testable by just providing an ordered list of inputs and expectations of outputs. Another nice property is that it's relatively easy to apply generic and broadly applicable scale-out strategies. For example, we support horizontal scaling using consistent hashing of some value(s) that's extracted from each input.
Putting it all together, it's not hard to imagine building real-world web applications using this. Our system is more focused on analytics pipelines, so you probably don't want to build a whole application out of Flow derivations. But it would be really interesting to see a more generic DOD-based web application platform, as I'd bet it could be quite a nice way to build web apps.
I feel like "data products" was a great idea, but difficult to implement in practice. There's kind of a paradox where you need a platform in order to host your data products, but any data products that are tied to a specific platform are almost by definition _not_ data products. Our solution was to focus on _delivery_ of data products to the systems that you're already using instead of making consumers of data products use our platform. I think it's turning out pretty well, so I thought I'd share and see what y'all think.
Post explaining why we chose Typescript for realtime data tranformations in Flow, and how it enables end-to-end static type checking of streaming data pipelines.
> But it is binary, so can’t be viewed or edited with standard tools, which is a pain.
I've heard this sentiment expressed multiple times before, and a minor quibble I have with it is that the fact that it's binary has nothing to do with whether or not it's a pain. It's a pain because the tools aren't ubiquitous, so you can't count on them always being installed everywhere. But I'd argue that sqlite _is_ ubiquitous at this point and, as others have mentioned, it's a _great_ format for storing tabular data.
JSON is also a fine choice, if you want it to be human readable, and I'm not sure why this is claiming it's "highly sub-optimal" (which I read as dev-speak for 'absolute trash'). JSON is extremely flexible, compresses very well, has great support for viewing in lots of editors, and even has a decent schema specification. Oh, and line-delimited JSON is used in lots of places, and allows readers to begin at arbitrary points in the file.
JSON is good for structured data, but I prefer TSV for simple human-readable tabular data. In situations where it's the right choice, a TSV file consists of data and whitespace and nothing else. You can view and edit it with any imaginable tool, and there is no overhead in the form of delimeters and encodings distracting you from the data.
I really like LTSV. (That stands for labeled tab-separated values.)
LTSV is basically equivalent to a JSON object per line. You have columns consisting of a label, a colon, then a value. The columns are then separated by tabs. The value can be quoted. If you need a tab in the value, it goes inside the quotes.
As the http://ltsv.org/ suggests, I use it for logging, too, so that a log line is easily parseable and a log file is basically a table. Notice there are parsers for many languages, and there are several tools supporting it including fluentd.
A lot of people just learning about streaming systems don't come in with useful intuitions about when they can and can't use that approach, or even that it's an option. We're hoping to build up to some documentation that can help new people learn what their options are, and when to use each one.