More

eliasdejong · 2026-02-10T00:22:47 1770682967

Checkout DataStar, it's also a hypermedia web framework but it subsumes the functionality of htmx, is faster, more efficient and fits inside 11 KiB. It uses SSE + brotli streaming compression to achieve compression ratios up to 4000x. It makes the web client completely stateless. Building a webapp cannoy be simpler really.

solomonb · 2026-02-10T00:36:53 1770683813

that sounds really cool but i've already made a commitment to HTMX, at least for the foreseeable future. My main goal here is building a physical radio station.

eliasdejong · 2026-02-08T18:56:43 1770577003

To me this sentiment is silly. Programming was never about the act of writing, but about making the computer do something. Now we can ask computers to helps us write instructions for them. Great if you ask me. And no, your job is not going away because human maintainers will always be required to review changes, communicate with stakeholders and provide a vision for the projects. I have yet to see a chatbot that can REDUCE entropy inside a codebases rather than continuously increase it with more and more slop.

eliasdejong · 2026-02-03T22:31:55 1770157915

Very cool project! Being able to replay history is huge and makes it possible to look back in time without having to make full copies of the database. This is something that is very much lacking in many SQL systems where you need 'temporal tables' to achieve the same effect, but those are really limited as they have to be setup specifically and often duplicate data unnecessarily. If you are interested in this topic, I suggest you study Datomic and the EAVT data model. This is likely where database architecture in the future will be headed.

> The database is stored in memory. So it must be small enough to fit in RAM, and the full journal has to be replayed from scratch when opening a file.

For larger datasets, you really want disk support. Using something like SQLite or DuckDB as an append-only store is another way to achieve this effect.

Also lack of a proper query language will be a problem for long term serious use. A simple hand-rolled program API can only get you so far, until you need more advanced querying.

> Unlike XML or JSON, joedb is a binary file format that does not require any parsing. So, joedb files are much smaller, and processing data is much faster.

Some time ago I created a JSON-compatible serialization format that is zero-copy (no parsing required): https://github.com/fastserial/lite3

It doesn't do transactions or history versioning, but it is also very fast in memory. Something like jq or JSONPath on a disk-file version of this format could be interesting.

eliasdejong · 2026-01-23T17:23:21 1769189001

Performance of Protobuf is a joke. Why not use a zero copy format so that serialization is free? For example, my format Lite³ which outperforms Google Flatbuffers by 242x: https://github.com/fastserial/lite3

tucnak · 2026-01-23T18:04:17 1769191457

Mmmm, I don't know maybe because your library DIDN'T EXIST before November 2025? Or perhaps for any other million reasons why people use Protobuf, and don't use Cap'n'proto and other 0-serialise libraries, like requiring a schema, established tooling for language of their choice, etc?

eliasdejong · 2026-01-17T20:26:20 1768681580

See also, RUM Conjecture: https://www.codementor.io/@arpitbhayani/the-rum-conjecture-1...

Conceptually similar to CAP, but with storage trade-offs. The idea is you can only pick 2 out of 3.

eliasdejong · 2026-01-06T21:40:40 1767735640

Really excellent research and well written, congrats. Shows that io_uring really brings extra performance when properly used, and not simply as a drop-in replacement.

> With IOPOLL, completion events are polled directly from the NVMe device queue, either by the application or by the kernel SQPOLL thread (cf. Section 2), replacing interrupt-based signaling. This removes interrupt setup and handling overhead but disables non-polled I/O, such as sockets, within the same ring.

> Treating io_uring as a drop-in replacement in a traditional I/O-worker design is inadequate. Instead, io_uring requires a ring-per-thread design that overlaps computation and I/O within the same thread.

1) So does this mean that if you want to take advantage of IOPOLL, you should use two rings per thread: one for network and one for storage?

2) SQPoll is shown in the graph as outperforming IOPoll. I assume both polling options are mutually exclusive?

3) I'd be interested in what the considerations are (if any) for using IOPoll over SQPoll.

4) Additional question: I assume for a modern DBMS you would want to run this as thread-per core?

mjasny · 2026-01-06T21:58:22 1767736702

Thanks a lot for the kind words, we really appreciate it!

Regarding your questions:

1) Yes. If you want to take advantage of IOPOLL while still handling network I/O, you typically need two rings per thread: an IOPOLL-enabled ring for storage and a regular ring for sockets and other non-polled I/O.

2) They are not mutually exclusive. SQPOLL was enabled in addition to IOPOLL in the experiments (+SQPoll). SQPOLL affects submission, while IOPOLL changes how completions are retrieved.

3) The main trade-off is CPU usage vs. latency. SQPOLL spawns an additional kernel thread that busy spins to issue I/O requests from the ring. With IOPOLL interrupts are not used and instead the device queues are polled (this does not necessarily result in 100% CPU usage on the core).

4) Yes. For a modern DBMS, a thread-per-core model is the natural fit. Rings should not be shared between threads; each thread should have its own io_uring instance(s) to avoid synchronization and for locality.

eliasdejong · 2026-01-06T02:17:54 1767665874

Increasingly the performance limit for modern CPUs is the amount of data you can feed through a single core: basically memcpy() speed. On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.

This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3

lunixbochs · 2026-01-06T02:46:11 1767667571

your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)

I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

nine_k · 2026-01-06T03:03:16 1767668596

As much as I can understand a Zen 5 CPU core can run two AVX512 operations per clock (1024 bits) + 4 integer operations per clock (which use up FPU circuitry in the process), so additional 256 bits. At 4 GHz, this is 640 GB/s.

I suppose that in real life such ideal condition do not occur, but it shows how badly the CPU is limited by its memory bandwidth for streaming tasks. Its maximum memory-read bandwidth is 768 bits per clock. only 60% of its peak bit-crunching performance. DRAM bandwidth is even more limiting. And this is a single core of at least 12 (and at most 64).

kvemkon · 2026-01-06T03:27:42 1767670062

> CPU is limited by its memory bandwidth for streaming tasks

That must be the reason, why EPYC 9175F exists. It is only 16-core CPU, but all 16 8-core CCDs are populated and only one core on each is active.

The next gen EPYC is rumored to have 16 instead of 12 memory channels (which were 8 only 4-5 years ago).

tanelpoder · 2026-01-06T03:53:38 1767671618

This also leaves more power & thermal allowance for the IO Hub on the CPU chip and I guess the CPU is cheaper too.

If your workload is mostly about DMAing large chunks of data around between devices and you still want to examine the chunk/packet headers (but not touch all payload) on the CPU, this could be a good choice. You should have the full PCIe/DRAM bandwidth if all CCDs are active.

Edit: Worth noting that a DMA between PCIe and RAM still goes through the IO Hub (Uncore on Intel) inside the CPU.

vjerancrnjak · 2026-01-06T11:56:44 1767700604

It is interesting that despite this we still have programming languages and libraries that cannot exploit pipelining to actually demonstrate IO is the bottleneck and not CPU

torginus · 2026-01-06T11:12:01 1767697921

Thanks for the detailed writeup! This made me think of an interesting conundrum - with how much RAM modern computers come with (16GB is considered to be on the small side), having the CPU read the entire contents of RAM takes a nontrivial amount of time.

A single threaded Zen2 program could very well take 1 second to scan through your RAM, during which it's entirely trivial to read stuff from disk, so the modern practice of keeping a ton of stuff in RAM might be actually hurting performance.

Algorithms, such as garbage collection, which scan the entire heap, where the single-threaded version is probably slower than the naive zen 2 memcpy might run for more than a second even on a comparatively modest 16GB heap, which might not be even acceptable.

zozbot234 · 2026-01-06T11:42:07 1767699727

It's true that GC is a huge drain on memory throughput, even though it doesn't have to literally scan the entire heap (only potential references have to be scanned). The Golang folks are running into this issue, it's reached a point where the impact of GC traffic is itself a meaningful bottleneck on performance, and GC-free languages have more potential for most workloads.

jauntywundrkind · 2026-01-06T08:39:39 1767688779

I agree, I don't think these numbers check out. IIRC people were a bit down on manycore Clearwater Forest in August for core complexes of 4x cores each sharing a 35GB/s link. (And also sharing 4MB of pretty alright 400GB/s L2 cache among them). This is a server chip so expectations are higher, but 6GB/s per core seems very unlikely.

https://chipsandcheese.com/p/intels-clearwater-forest-e-core...

eliasdejong · 2026-01-06T04:15:49 1767672949

> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

lunixbochs · 2026-01-06T04:17:10 1767673030

naive c is just a memcpy. non-temporal uses the streaming instructions.

johncolanduoni · 2026-01-06T02:29:52 1767666592

What is the nature of the architectural limit here? The bus between an individual core and the caches and/or memory controller?

eliasdejong · 2026-01-06T03:03:49 1767668629

The limit is the number of outstanding cache line requests to the memory controller. CPUs have a fixed number of slots for this, around 10-12 usually. Intel calls them LFBs (Line Fill Buffers) and AMD MSHRs (Miss Status Holding Registers). When the slots are filled, the CPU can issue no more requests and has to wait for them to complete. Apple M chips (probably) have more slots and the memory is physically packaged together with the CPU, so they get better numbers.

foota · 2026-01-06T03:36:09 1767670569

I assume these must be really expensive? Otherwise it seems like a great way to improve throughput on low concurrency tasks.

IgorPartola · 2026-01-06T06:15:13 1767680113

At least in older CPUs the caches were SRAM (static RAM). It is complicated but requires no refreshing. DRAM is basically just a capacitor per bit and capacitors leak so you constantly have to refresh the entire memory space. When the CPU sends a request to RAM, the memory controller might be too busy refreshing the soon to decay parts to actually respond right away. And if I recall correctly when you read from DRAM you destroy what was there so the process is to read it, then write it back, then send the answer to the CPU which is just a lot of steps. But the price and die size difference is huge so we use GB or TB levels of DRAM and MB levels of SRAM.

foota · 2026-01-06T20:02:08 1767729728

Wouldn't this bound the overall memory bandwidth, not the per core bandwidth? I've sort of assumed that just providing more line fill buffers wouldn't be sufficient, and that the number of LFB is chosen in tandem with a number of other things, but I'm not sure what the other things are (that is, just increasing the # of LFB might not be meaningful without also increasing XYZ).

IgorPartola · 2026-01-06T23:21:58 1767741718

I can’t speak to that. Last time I looked at this stuff was when I was taking an electrical engineering class and we were talking about constructing RAM out of flip flops.

memoriuaysj · 2026-01-06T06:44:57 1767681897

bus wires. you can route only so many of them on a motherboard.

it's why GPUs have their memory chips in a circle around the GPU chip.

foota · 2026-01-06T20:01:39 1767729699

Wouldn't this be the limiting factor moreso for overall throughput, not per core? I believe with Zen 4 for instance it goes through a central memory controller.

memoriuaysj · 2026-01-06T22:23:30 1767738210

sure, but you can have more memory controllers (I/O dies), like threadripper

tiffanyh · 2026-01-06T03:42:25 1767670945

> On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

What makes M-series have 3x the bandwidth (per core), over x86?

MindSpunk · 2026-01-06T04:02:30 1767672150

M-series have a substantially wider memory bus allowing much higher throughput. It's not really an x86/M-series thing, rather it's a packaging limitation. Apple integrates the memory into the same package as the CPU, the vast majority of x86 CPUs are socketed with socketed memory.

Apple are able to push a wider bus at higher frequencies because they aren't limited by signal integrity problems you encounter trying to do the same over sockets and motherboard traces. x86 CPUs like the Ryzen AI Max 395+, when packaged without socketed memory, are able to push equally wide busses at high frequencies.

Analemma_ · 2026-01-06T04:09:50 1767672590

The sibling comment has the correct, more detailed answer, but the high-level answer is that the M-series chips are SoCs with all the RAM on-die. That lets you push way more data than you can over a bus out to socketed memory.

The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

fanf2 · 2026-01-06T08:12:00 1767687120

It isn’t on-die, the ram is separate chips in the same package. See the pics about halfway down https://www.apple.com/de/newsroom/2023/10/apple-unveils-m3-m...

Scaevolus · 2026-01-06T04:18:02 1767673082

Apple had soldered DDR ram for a long time that was no faster than any other laptop. It's only with the Apple Silicon M1 that it started being notably higher bandwidth.

vlovich123 · 2026-01-06T08:16:58 1767687418

There were always technical benefits like lower power consumption iirc

eru · 2026-01-06T04:20:28 1767673228

> The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

That, and if you come at it from the phone / tablet or even laptop angle: most people are quite ok just buying their computing devices pre-assembled and not worrying about upgrading them. You just buy a new one when the old one fails or you want an upgrade.

Similar to how cars these days are harder to repair for the layman, but they also need much less maintenance. The guy who was always tinkering away with his car in old American sitcoms wasn't just a trope, he was truth-in-television. Approximately no one has to do that anymore with modern cars.

15155 · 2026-01-06T12:19:12 1767701952

The memory is in-package, not on-die - on-die would mean that the DRAM is being manufactured on the same 1-2-3-4-whatever nanometer process - DDR is much larger.

zozbot234 · 2026-01-06T09:44:27 1767692667

On quite a few recent chips (including, AIUI, Apple M series) you can only saturate memory bandwidth by resorting to the iGPU (which has access to unified memory), CPU cores on their own won't do it. It means that using the iGPU as a blitter for huge in-memory transfers and for all throughput-limited computation (including such things as parallel parsing or de/compression workloads) is now the technically advisable choice, provided that this can be arranged.

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.

dehrmann · 2026-01-06T02:29:10 1767666550

> 6 GB/s

Samsung is selling NVMe SSDs claiming 14 GB/s sequential read speed.

eliasdejong · 2026-01-06T02:39:24 1767667164

> 14 GB/s

Yes, those numbers are real but only in very short bursts of strictly sequential reads, sustained speeds will be closer to 8-10 GB/s. And real workloads will be lower than that, because they contain random access.

Most NVMe drivers on Linux actually DMA the pages directly into host memory over the PCIe link, so it is not actually the CPU that is moving the data. Whenever the CPU is involved in any data movement, the 6 GB/s per core limit still applies.

jauntywundrkind · 2026-01-06T08:46:03 1767689163

I feel like you are pulling all sorts of nonsense out of nowhere. Your numbers seem all made up. 6GB/s seems outlandishly tiny. Your justifications are not really washing. Zen4 here shows single core as, at absolute worst behavior, dropping to 57GB/s. Basically 10x what you are spinning. You are correct in that memory limits are problematic, but we also have had technology like Intel's Direct Data IO (2011) that lets the CPU talk to peripherals without having to go through main memory at all (big security disclosure on that in 2019, yikes). AMD is making what they call "Smart Data Cache Injection" which similarly makes memory speed not gating. So even if you do divide the 80GB/s memory speed across 16 chips on desktop and look at 5GB/s, that still doesn't have to tell the whole story. https://chipsandcheese.com/p/amds-zen-4-part-2-memory-subsys... https://nick-black.com/dankwiki/index.php/DDIO

As for SSD, for most drives, it's true true that they cannot sustain writes indefinitely. They often write in SLC mode then have to rewrite, re-pack things into denser storage configurations that takes more time to write. They'll do that in the background, given the chance, so it's often not seen. But write write write and the drive won't have the time.

Thats very well known, very visible, and most review sites worth a salt test for it and show that sustained write performance. Some drives are much better than others. Even still, an Phison E28 will let you keep writing at 4GB/s until just before the drive is full full full. https://www.techpowerup.com/review/phison-e28-es/6.html

Drive reads don't have this problem. When review sites benchmark, they are not benchmarking some tiny nanosliver of data. Common benchmark utilities will test sustained performance, and it doesn't suddenly change 10 seconds in or 90 seconds in or whatever.

These claims just don't feel like they're straight to me.

zozbot234 · 2026-01-06T10:16:24 1767694584

6 GB/s may well be in the ballpark for throughput per core whenever all cores are pegged to their max. Yes, that's equivalent to approx. 2-3 bytes per potential CPU cycle; a figure reminiscent of old 8-bit processors! Kind of puts things in perspective when phrased like that: it means that for wide parallel workloads, CPU cycle-limited compute really only happens with data already in cache; you're memory bandwidth limited (and should consider the iGPU) as soon as you go out to main RAM.

yusyusyus · 2026-01-06T04:17:06 1767673026

DMAing as opposed to what?

loeg · 2026-01-06T04:23:37 1767673417

Issuing extremely slow PCIe reads? GP probably meant “all” rather than “most.”

yusyusyus · 2026-01-06T14:53:29 1767711209

yep that makes more sense.

saidnooneever · 2026-01-06T08:07:31 1767686851

i think he means simply the cpu speed bottleneck doesnt apply as it related to dma which doesnt 'transfer data' via the cpu directly.

its phrased a bit weird.

imtringued · 2026-01-06T11:10:41 1767697841

Letting the CPU burn cycles by accessing memory mapped device memory instead of using DMA like in the good old days? What even is this question?

wtallis · 2026-01-06T14:56:50 1767711410

> accessing memory mapped device memory instead of using DMA like in the good old days

That's not actually an option for NVMe devices, is it? The actual NAND storage isn't available for direct memory mapping, and the only way to access it is to set up DMA transfers.

yxhuvud · 2026-01-06T12:39:24 1767703164

What? Nvme dont care about sequential access. If that slows you down then it is the fault of the operating system and the APIs it provide.

In Linux you can use direct IO or RWF_UNCACHED to avoid paying extra for unwanted readahead.

johncolanduoni · 2026-01-06T02:34:52 1767666892

Sequential read speed is attainable while still having a (small) number of independent sequential cursors. The underlying SSD translation layer will be mapping to multiple banks/erase blocks anyway, and those are tens of megabytes each at most (even assuming a 'perfect' sequential mapping, which is virtually nonexistent). So you could be reading 5 files sequentially, each only producing blocks at 3GB/s. A not totally implausible access pattern for e.g. a LSM database, or object store.

wmf · 2026-01-06T03:58:24 1767671904

Any code that's reading/writing to SSD needs to use multiple cores. The SSD is faster than a single CPU core.

vlovich123 · 2026-01-06T06:35:38 1767681338

That doesn’t sound right. A single core should more than fast enough to saturate IOPs (particularly with iouring) unless you’re doing something insane like a lot of small writes. A write of 16mib or 32mob should still be about 1 ssd iop - more CPUs shouldn’t help(and in fact should be slower if you have 2 16mib IOPs vs 1 32mib iop)

wmf · 2026-01-06T17:19:11 1767719951

Do you want to process that data, or just let it hang out in memory?

15155 · 2026-01-06T12:21:45 1767702105

SSDs are not faster than a DMA core.

auselen · 2026-01-06T12:09:08 1767701348

How do you measure/calculate 6GB/s?

woooooo · 2026-01-06T10:53:09 1767696789

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.

squirrellous · 2026-01-06T04:29:10 1767673750

Would you mind sharing what problems motivated Lite? Curious what are the typical use cases for selective reading / in place modification of serialized data. My understanding is that for cases that really want all of the fields, the zero-copy solutions aren’t much better than JSON / protobuf, so these are solutions to different problems.

eliasdejong · 2026-01-06T04:49:49 1767674989

The primary motivations were performance requirements and frustration with schema formats.

The ability to mutate serialized data allows for 2 things:

1) Services exchanging messages and making small changes each time e.g. adding a timestamp without full reserialization overhead.

2) A message producer can keep a message 'template' ready and only change the necessary fields each time before sending. As a result, serializing becomes practically free from a performance perspective.

squirrellous · 2026-01-06T10:01:23 1767693683

Thanks. These are actually achievable with a custom protocol buffer implementation, but I agree at that point one may as well create a new format.

FWIW the other use case I was expecting is something like database queries where you’d select one particular column out of a json-like blob and want to avoid deserialization.

Nathanba · 2026-01-06T02:37:23 1767667043

cool, do you think it's possible to add a schema mode to lite3 to remove the message size tradeoff? I think most people will still want to use lite3 with hard schemas during both serialization and deserialization. It's nice that it also works in a schemaless mode though.

eliasdejong · 2026-01-06T02:50:15 1767667815

Being schemaless is deliberate design decision as it eliminates the need for managing and building schema files. By not requiring schema, messages are always readable to arbitrary consumers.

If you want schema, it must be done by the application through runtime type checking. All messages contain type information. Though I do see the value of adding pydantic-like schema checking in the future.

EDIT: Regarding message size, Lite³ does demand a message size penalty for being schemaless and fully indexed at the same time. Though if you are using it in an RPC / streaming setting, this can be negated through brotli/zstd/dict compression.

Nathanba · 2026-01-06T07:20:39 1767684039

The thing about schemaless is that it's great for usability and I like it with JSON but as with JSON when we develop applications in reality at the end of the day you always have some kind of schema, whether it's written down or not. Like you alluded with pydantic, the application is going to rely on the data being in some sort of shape, even if it's very defensively written and practically everything is optional, you still end up relying something. That would be the informal schema so in my mind if I have a schema anyway no matter what... then maybe this should be supported in the serialization format/library to give me the benefit of size reductions.

yetihehe · 2026-01-06T13:52:12 1767707532

I very often use the best of both worlds. Have schema for messages, but some messages have additional "info" field, which is typically json for other data not defined elsewhere or not added to every message.

digdugdirk · 2026-01-06T05:17:48 1767676668

Pydantic was the first thing I thought of when I saw this. The possibilities are very intriguing.

Do you have any thoughts/recommendations for someone if they were to try making a pydantic interface layer for lite3?

eliasdejong · 2026-01-06T16:02:37 1767715357

The library contains iterator functions that can be used to recursively traverse a message from the root and discover the entire structure. Then every value can be checked for defined constraints appropriately while at the same time validating the traversal paths. Also a schema description format is needed, perhaps something like JSONSchema can be used.

Doing this by definition requires processing an entire message which can be a little slow, so it is best done in C and exposed through CPython bindings / CFFI.

eru · 2026-01-06T04:24:22 1767673462

> By not requiring schema, messages are always readable to arbitrary consumers.

That sounds a bit silly..

> All messages contain type information.

That would (partially) enable what you were suggesting in the sentence I quoted first. But that's orthogonal to being schema-full or schema-less.

tignaj · 2026-01-06T03:40:12 1767670812

If you are looking for a fast format with schema support see STEF: https://www.stefdata.net/

Disclosure: I am the author.

mgaunard · 2026-01-06T12:09:08 1767701348

Quite easy to outperform a parsing library when you're not actually doing any parsing work and just memory-mapping pre-parsed data...

That being said storing trees as serializable flat buffers is definitely useful, if only because you can release them very cheaply.

eliasdejong · 2026-01-06T12:21:56 1767702116

Imagine if you measured the speed of beer delivery by the rate at which beer cans can be packed/unpacked from truck pallets. But then somebody shows up with a tanker truck and starts pumping beer directly in and out. You might argue this is 'unfair' because the tanker is not doing any packing or unpacking. But then you realize it was never about packing speed in the first place. It was about delivering beer.

StilesCrisis · 2026-01-06T13:17:36 1767705456

This is actually a good analogy; the beer cans are self-contained and ready for the customer with zero extra work. The beer delivered by the tanker still needs to be poured into individual glasses by the bartender, which is slow and tedious.

cake-rusk · 2026-01-06T14:55:10 1767711310

He probably meant beer kegs. Memory map-able data is closer to sending beer cans since both are in a ready to use format.

brunoborges · 2026-01-06T13:12:29 1767705149

> This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

Which file formats allow partial parsing?

mort96 · 2026-01-06T15:41:38 1767714098

You just need to encode the size of values in bytes to make it possible to partially parse the format.

Imagine the following object:

    {
        "attributes": [ .. some really really long array of whatever .. ],
        "name": "Bob"
    }

In JSON, if you want to extract the "name" property, you need to parse the whole "attributes" array. However, if you encoded the size of the "attributes" array in bytes, a parser could look at the key "attributes", decide that it's not interested, and jump past it without parsing anything.

You'd typically want some kind of binary format, but for illustration purposes, here's an imaginary XML representation of the JSON data model which achieves this:

    <object content-size-in-bytes="8388737">
      <array key="attributes" content-size-in-bytes="8388608">
        .. some 8 MiB large array of values ..
      </array>
      <string key="name" content-size-in-bytes="3">Bob</string>
    </object>

If this was stored in a file, you could use fseek to seek past the "attributes" array. If it was compressed, or coming across a socket, you'd need more complicated mechanisms to seek past irrelevant parts of the object.

brunoborges · 2026-01-09T20:18:09 1767989889

Yeah, I was thinking of binary formats as the only solution, but your XML example is perfect. Thank you.

perching_aix · 2026-01-06T15:16:47 1767712607

Anything that is both streamable and seekable?

1vuio0pswjnm7 · 2026-01-06T17:47:03 1767721623

Pardon the ignorance, but is there a reason, or reasons, that netstrings/bencode is not included in the list of formats against which Lite^3 is tested

hamandcheese · 2026-01-06T03:25:57 1767669957

Lite claims that it can be modified in-place, but I'm curious how that works with variable-length structures like strings?

eliasdejong · 2026-01-06T03:36:49 1767670609

If the new value is equal size or smaller, it will overwrite the old value in-place. If it is larger, then the new value is appended to the buffer and the index structure is updated to point to the new location.

In the case of append, the old value still lives inside the buffer but is zeroed out. This means that if you keep replacing variable-sized elements, over time the buffer will fragment. You can vacuum a message by recursively writing it from the root to a new buffer. This will clear out the unused space. This operation can be delayed for as long as you like.

If you are only setting fixed-size values like integers or floats, then the buffer never grows as they are always updated in-place.

eru · 2026-01-06T04:21:34 1767673294

Interesting. Sounds like you are getting copy-on-write-with-sharing for growing sizes and in-place updates when your data shrinks?

quadrature · 2026-01-06T15:14:59 1767712499

For what it’s worth simdjson now has an on demand api that lets you skip over keys that you don’t need.

mort96 · 2026-01-06T15:28:17 1767713297

Which is great, but a JSON parser fundamentally can't avoid looking at every byte. You can't jump to the next key, you have to parse your way to the next key.

rattray · 2026-01-06T03:40:04 1767670804

Does capn proto have similar properties?

eliasdejong · 2026-01-06T03:51:39 1767671499

Yes, any zero-copy format in general will have this advantage because reading a value is essentially just a pointer dereference. Most of the message data can be completely ignored, so the CPU never needs to see it. Only the actual data accessed counts towards the limit.

Btw: in my project README I have benchmarks against Cap'N Proto & Google Flatbuffers.

eru · 2026-01-06T04:22:27 1767673347

Have you benchmarked against Rust's rkyv, too?

eliasdejong · 2026-01-06T04:30:07 1767673807

No, I have not benchmarked yet against other languages. Rkyv is Rust only.

One primary difference is that Rkyv does not support in-place mutation. So any modification of a message requires full reserialization, unlike Lite³.

eru · 2026-01-06T06:10:19 1767679819

Rkyv supports a limited amount of in-place mutation. But not in general like your approach does.

eliasdejong · 2025-12-19T09:33:48 1766136828

(author here)

The documentation page is out of date, the format now resolves collisions through quadratic probing.

eliasdejong · 2025-12-19T09:12:44 1766135564

Author here,

First of all, hello Hacker News :)

Many of the comments seem to address the design of key hashing. The reason for using hashed keys inside B-tree nodes instead of the string keys directly is threefold:

1) The implementation is simplified.

2) When performing a lookup, it is faster to compare fixed-sized elements than it is to do variable length string comparison.

3) The key length is unlimited.

I should say the documentation page is out of date regarding hash collisions. The format now supports probing thanks to a PR merged yesterday. So inserting colliding keys will actually work.

It is true that databases and other formats do store string keys directly in the nodes. However as a memory format, runtime performance is very important. There is no disk or IO latency to 'hide behind'.

Right now the hash function used is DJB2. It has the interesting property of somewhat preserving the lexicographical ordering of the key names. So hashes for keys like "item_0001", "item_0002" and "item_0003" are actually more likely to also be placed sequentially inside the B-tree nodes. This can be useful when doing a sequential scan on the semantic key names, otherwise you are doing a lot more random access. Also DJB2 is so simple that it can be calculated entirely by the C preprocessor at compile time, so you are not actually paying the runtime cost of hashing.

We will be doing a lot more testing before DJB2 is finalized in the spec, but might later end up with a 'better' hash function such as XXH32.

Finally, TRON/Lite³ compared to other binary JSON formats (BSON, MsgPack, CBOR, Amazon Ion) is different in that:

1) none of the formats mentioned provide direct zero-copy indexed access to the data

2) none of the formats mentioned allow for partial mutation of the data without rewriting most of the document

This last point 2) is especially significant. For example, JSONB in Postgres is immutable. When replacing or inserting one specific value inside an object or array, with JSONB you will rewrite the entire document as a result of this, even if it is several megabytes large. If you are performing frequent updates inside JSONB documents, this will cause severe write amplification. This is the case for all current Postgres versions.

TRON/Lite³ is designed to blur the line between memory and serialization format.

sushisource · 2025-12-19T23:37:23 1766187443

Really cool project. Any plans to port the API to other languages? My use case for something like this is to represent types that are used on either side of FFI (Rust <--> Some other language). Pairing this with a code generator for the shared types would be great. That's something flatbuffers/capnproto do well that isn't just pure speed.

eliasdejong · 2025-12-19T23:53:23 1766188403

Yes, language ports are being worked on :)

You will be able to access the same data in different languages using APIs specific to your language.

Right now someone has already made a (private) Go port. And Rust is also in the works.

simonw · 2025-12-20T03:44:54 1766202294

I vibe-coded a prototype Python library just to see how it could work - left you an issue comment about that here: https://github.com/fastserial/lite3/issues/6#issuecomment-36... - code is at https://github.com/simonw/lite3/tree/conformance-suite/lite3...

(Update: I used the same process to add a JavaScript library as well: https://github.com/simonw/lite3/tree/conformance-suite/lite3...)

flakiness · 2025-12-20T00:27:23 1766190443

> And Rust is also in the works.

About to troll you on this point. I'm glad I wasn't too hasty. Looking forward to trying it out once it's there!

p0w3n3d · 2025-12-19T22:08:58 1766182138

That's really impressive. As you wrote it in C it gets automatically compilable to webasm and usable in js. I wonder how Java would behave here... As JNI is not the fastest (used to be not the fastest?)

andreyvit · 2025-12-19T12:16:10 1766146570

Hey, I'm sorry, but your Postgres example is completely wrong: because of MVCC, a new version of the data will be stored on every update regardless of the choice of data representation, making the in-place mutability much less of an advantage. (It might be faster than a pair of a compact immutable format + mutable patch layer on top, or it might be slower; the answer ain't immediately obvious to me!)

What you should be imagining instead is a document database entirely built around Lite³-encoded documents, using something like rollback journals instead of MVCC.

We're doing something similar in my company, storing zero-serialization immutable [1] docs in a key-value store (which are read via mmap with zero copying disk-to-usage) and using a mutable [2] overlay patch format for updates. In our analytics use cases, compact storage is very important, in-place mutability is irrelevant (again because of Copy-on-Write at the key-value store level), and the key advantage is zero serialization overhead.

What I'm saying is that Lite³ is a very timely and forward-looking format, but the merging of immutable and mutable formats into one carries tradeoffs that you probably want to discuss, and the discussion into the appropriate use cases is very much worth having.

[1] https://github.com/andreyvit/edb/blob/main/kvo/immutable.go [2] https://github.com/andreyvit/edb/blob/main/kvo/mutable.go

eliasdejong · 2025-12-19T13:12:02 1766149922

Hi, you are right in calling out the Postgres example in the context of DBs/MVCC. The purpose of JSONB is to be an indexable representation of JSON inside a Postgres database. It is not trying to be a standalone format for external interchange and therefore it is fulfilling very different requirements.

A serialization format does not care about versioning or rollbacks. It is simply trying to organize data such that it can be sent over a network. If updates can be made in-place without requiring re-serialization, then that is always a benefit.

Write amplification is still a fact however that I think deserves to be mentioned. To tackle this problem in the context of DBs/MVCC, you would have to use techniques other than in-place mutation like you mention: overlay/COW. Basically, LMDB-style.

And yes I think databases is where this technology will eventually have the greatest potential, so that is where I am also looking.

eliasdejong · 2025-12-14T23:33:20 1765755200

Lite³: a binary format encoding JSON documents as a serialized B-tree, making it possible to construct iterators on them directly and query internal fields at indexed speeds. It is still a serialized document (possible to send over a network), though now you don't need to do any parsing, since the document itself is already indexed.

https://github.com/fastserial/lite3

Retr0id · 2025-12-14T23:39:51 1765755591

Very interesting. How does it deal with "holes" left by deletions/edits, is there an equivalent of VACUUM?

eliasdejong · 2025-12-14T23:45:39 1765755939

When overwriting fixed-sized values like integers or floats, the old value can simply be overwritten. Holes are only left if the overriding values do not fit in the old location (strings, byte arrays).

To clean up unused space, you start an iterator at the root of the document and recursively write to a new buffer. This will clean up all the unused space. This operation can be delayed by the application for as long as they wish, until the size trade-off outweighs the cost of rebuilding.