Hacker Newsnew | past | comments | ask | show | jobs | submit | _willmanning's commentslogin

Vs Apache Parquet: 100x faster random access, 10-20x faster scans, 5x faster writes, similar compression ratio

And nearly as fast as duckdb's native format on Clickbench when queried from duckdb


I'll chime in (as a Vortex maintainer), that we are greatly indebted to Azim's work on FastLanes & ALP. Vortex heavily utilizes his work to get state-of-the-art performance.

I would add that Vortex doesn't have standalone Clickbench results. Azim is presumably referring to the duckdb-vortex results, which were run on an older version of duckdb (1.2) than the duckdb-parquet ones (1.3). We'll get those updated shortly; we just released a new version of Vortex & the duckdb extension. Meanwhile, I believe the DataFusion-Vortex vs DataFusion-Parquet speedups show substantial improvements across the board.

The folks over at TUM (who originally authored BtrBlocks) did a reasonable amount of micro-benchmarking of Vortex vs Parquet in their recent "Anyblox" paper for VLDB 2025: https://gienieczko.com/anyblox-paper

They essentially say in the paper that Vortex is much faster than the original BtrBlocks because it uses better encodings (specifically citing FastLanes & ALP).

I'm looking forward to seeing the FastLanes Clickbench results when they're ready, and Azim, we should work together to benchmark FastLanes against Vortex!


As we discuss in the FastLanes paper, the way BtrBlocks implements cascaded encodings (Vortex now) is essentially a return to block-based compression such as Zstd — which we're trying to avoid as much as possible. This design doesn't work well with modern vectorized execution engines or GPUs: the decompression granularity is too large to fit in CPU caches or GPU shared memory. So Vortex ends up being yet another Parquet-like file format, repeating the same mistakes. And if it still underperforms compared to Parquet... what’s the point?

We just released FastLanes v0.1, and more results — including ClickBench — are coming soon. Please do benchmark FastLanes — and keep us posted!


Compression! Vortex can easily be 10x smaller than the equivalent Arrow representation (and decompresses very quickly into Arrow)


Nice!


Perhaps that verbiage is just confusing. "On-disk" sort of implies "file format" but could be more explicit.

That said, the immediate next line in the README perhaps clarifies a bit?

"Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."


“Vortex is […] a highly extensible & extremely fast framework for building a modern columnar file format.”

It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.


Will and I actually work on Vortex :wave:

Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!

We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.

This is nice for a couple of reasons:

(a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.

(b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.

Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: