Building a Database in the 2020s (2022)

carom · on April 9, 2023

Memory safety is something that needs to be mentioned. I was integrating DuckDB into a project and ended up ripping it out after running into memory corruption issue in practice. Upon investigation they had a massive issue of fuzzer found bugs on their GitHub. While I am glad they are fuzzing and finding issues, I cannot ship that onto customer systems.

We have a few very good memory safe programming languages at this point. Please do not start a project in C/C++ unless you are truly exceptional and understand memory management and exploitation inside and out. I switched to SQLite on the project since it is one of the more fuzzed applications out there that fit the need. The next embeddable database I use (bonus if it works on cloud) will need to be in a memory safe language.

vbezhenar · on April 9, 2023

There're more than one embedded database in Java. They were old and well tested 10 years ago. Why people keep missing this ecosystem is beyond my comprehension.

marginalia_nu · on April 9, 2023

I say this as someone who does low-level db-adjacent stuff in Java:

Java's in a weird spot. On the one hand it almost definitely gets way more shit than it deserves. On the other hand there's a kernel of truth in a lot of the shit that comes its way especially in this area.

Low level programming in Java is unbelievably awkward. From 2Gb limits on bytebuffers to the absence of unsigned types. Everything is just a bit too far removed from the actual hardware and OS to do this stuff really well.

It's like coding with oven mitts, and undeniably there's a bit of overhead almost everywhere as a result, beyond what the JVM already introduces (which is not a lot).

Another big cause of awkwardness in this space is the lack of generic algorithms. Yeah you have <Generics> but they come at a cost of boxing all your primitives, which is just untenable if you need to go fast. So because you're undoubtedly dealing with some awkward custom workaround to the 2Gb byte buffer limit, you end up having to build and maintain a custom standard library of N different sort functions, N different binary searches, N different merge functions, N different bisect functions, N different B*-tree (or whatever) implementations, etc. to support different N widths of data types (probably at least short, int, long, float, double in a dbms).

Although it should be noted a lot of this is getting better with JEPs that are in the pipe and ought to arrive in a few years.

crabbone · on April 9, 2023

I worked on a distributed filesystem which had its administration tools written in Java. The system itself was written mostly in C, and in some weird Haskell / C++ generated templates that compiled to C source.

I've known about a bunch of "sister" storage projects which were written in a similar combo: typically the product itself would be C (or eventually, D, more recently -- Rust), but a language like Java would be inconceivable for a system like that.

Few things that these projects had in common:

* Home-made concurrency (i.e. they didn't use pthreads, instead they used some green threads designed specifically for this project).

* Home-made allocator. Again, similar to the threads, the default allocator options available to system programming aren't good for mission-critical software.

* Home-made logging. Due to high volume, speed and storage requirement it's typically a very involved system that is far more sophisticated than any Java built-in or 3rd party logging library.

* Home-made collections / containers. The reasons to have this is very similar to the reason to have custom allocators -- improved ability to deal with memory-related errors.

* Home-made drivers and IP stack, or at least partial re-implementation of IP and higher-level protocols. Again, because of memory-related concerns.

* Virtually no 3rd-party libraries. For close to 2M lines of C code for the core part of the system, we had like... some compression algorithm library as a dependency... and that was it, iirc.

I mean, these kinds of products need to rebuild something rather minimal like C almost from the ground up... just the mere thought of using Java in this kind of environment would've been laughed out of the room.

I know that databases, like relational databases are often written in Java, but in my experience, those are mostly used by Java applications and rarely leave the boundaries of the Java ecosystem. People coming from storage background rather than pure Java background wouldn't think about Java databases more than as toy projects really.

jiggawatts · on April 9, 2023

C# fixes most of these issues. It has unsigned types, and interop with native calls is generally much more ergonomic than in Java. There are a range of useful built-in types for handling this type of system code, such as SafeHandle[1] and Span<T>[2] where T can be 'byte', etc...

Similarly, it has good support for struct (value) types, and it boxes native types far less often than Java does, for massive performance benefits. Last but not least, NET 8 will come with an ahead-of-time native compiler for faster startup times. It's about to get AVX-512 support! [3]

[1] https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

[2] https://learn.microsoft.com/en-us/dotnet/api/system.span-1

[3] https://github.com/dotnet/runtime/issues/35773

crabbone · on April 9, 2023

No... if Java isn't used in system programming, which is what storage, and databases sort of is, C# has no hope of doing it. At least due to tradition. Same way how nobody will write Linux utilities in C++, even though it's technically possible. It's just a tradition.

Storage people have a very strong allergy to anything that comes from Microsoft. You probably would have a better chance of writing storage applications in Common Lisp than C#. Only Microsoft uses its own storage products in places that matter. Even just having to integrate with Microsoft's products is seen as a huge pain that's not worth the effort.

And, truth be told... I want it to stay that way. Companies like Microsoft should be kept as far away as possible from critical infrastructure. We failed on few fronts, and are paying through the nose for it. It's nice that at least on this front, this is not a problem.

Also, on a personal level, no other language makes me want to puke as much as C# does. I mean, to the point that if the only programming jobs available in the world were the ones that ask for programming in C#, I'd go farming or cleaning... whatever. Just not touch that steaming pile ever again.

carom · on April 9, 2023

Hm, I don't think anyone is missing the Java ecosystem. It is widely used. In our case there were much better choices for interfacing with eBPF hooks.

I am happy to use Java if I am doing graph database work or forced to interact with an Apache project. It is fine language and the developers are cheap. Often a good choice.

dunefox · on April 9, 2023

DuckDB is designed and used for data science, ml, etc. and Java isn't. It's as simple as that.

marginalia_nu · on April 9, 2023

JEPs are in the pipe to fix that though.

tyingq · on April 9, 2023

If you're writing an embeddable database, exposing it as a C library maximizes the potential user base. Since most languages have some way to do ffi with C, most platforms have C compilers, etc.

josephg · on April 9, 2023

C FFI goes both ways. It’s also pretty easy to compile rust or zig code into a static or dynamic C library. And then that library can be used directly from C or from anything that has a C FFI (Python, Ruby, javascript, Java, etc etc).

I’ve been working on an iOS app lately with some of the core code in rust (shared with other platforms) and the UI in native swift. Changing the API at the boundary is a bit annoying, but the resulting app works great.

plagiarist · on April 9, 2023

What are the ergonomics of compiling that binary library? I think this sounds really neat, if you have a build system that produces something like an .xcframework and corresponding package file it wouldn't even be a pain point.

josephg · on April 13, 2023

Yeah; thats exactly what I have. I have a script which compiles the rust code for all the combinations of hardware (arm & x86_64 on ios & macos) and produces an .xcframework. Then I just recompile my xcode project. Its a bit hacky, but it works fine.

I'm using the swift_bridge crate to generate a lot of the bindings and generate the .xcframework.

https://github.com/josephg/replica/blob/master/build_swift.s...

KrugerDunnings · on April 9, 2023

I've been building a Postgresql extension in the last months for some functionality that was needed and have learned a ton about the internal workings of this database. All very scary and complicated sounding stuff but I feel privileged to be able to do this because the things you learn are just pure gold. My attitude before this was that of the ideal customer of a cloud database, someone who was scared of sql and preferred to hide behind the complexity of a ORM. Not anymore, now I write thousands of lines of sql and laugh to myself like a maniac.

nonethewiser · on April 9, 2023

That’s a big jump - hiding behind ORMs to learnings some inner workings of Postgres.

Do you have an example of something that you previously saw as a black box but now understand? Perhaps something simpler than you anticipated, or just something you never even knew existed before digging into the details?

KrugerDunnings · on April 9, 2023

There is just so much, I've thought of writing some blog post about it but there is already a lot of content out there because pg is a big community with a lot of people doing interesting things and I don't know if what I know is unique and not just parroting others peoples stuff.

But ok sure I can give you some pointers to cool stuff. Sometimes it are just architectural patterns that are easy to do like upsert statements in sql using `on conflict`, or writing a task queue on the cheap using `for update skip locked limit 1`. Replace 90% of your crud api controllers with `postgrest` and use row level security for everything, AI is not going to steal your job category theory is. From a operations and scaling perspective it pays dividend to learn about `explain analyse` but did you know they also have `stats` that can help the analyser optimise queries based on statistical relations of different columns. I've also been looking into how to "branch" a database instead if just doing a backup, it does require some ZFS tricks but that is also just pure power and not something you find with a cloud file system. There are also just a ton of extension for timeseries or vector embeddings, or write your own in Rust using `pgx`. Like I said way too much to write here, I don't use all of this in production but I just keep finding these gems while working on my own project. There is a strong OSS ecosystem with lots of teams giving there own spin on pg and that is welcomed if you have a opinion of your own and still want to learn from others. You need motivation to go this deep but in contrast to other esoteric knowledge I have mastered companies are also willing to pay for it.

kiwicopple · on April 9, 2023

for what it's worth, what you describe is the architecture philosophy of supabase (disclosure: i'm the ceo)

supabase is essentially a Postgres database with PostgREST on top, and we recommend pushing down a lot of the logic and security into the database. We took this philosophy with our pg_graphql extension (which uses pgx) and it is faster than other graphql implementations, simply it's co-located with your data, solving the n+1 problem.

pl_rust just reached 1.0, and it is now a "trusted language" so you can expect to see it arriving on a few cloud providers soon. We are releasing something this week with the RDS team which will make it easier to write key parts of your application code in trusted languages. There are certainly trade-offs, and I don't know if _everything_ should be in the database. But in data-intensive cases it makes a lot of sense.

plagiarist · on April 9, 2023

What's the security model in PostgREST? I'm imagining it is called from your backend as a convenience vs. having a database connection library, so not typically exposed to public users of a website?

steve-chavez · on April 9, 2023

It's usually exposed to public users. The security model is mostly based on two things:

- JWT is used to authenticate API requests. The JWT contains a `role` claim which is a PostgreSQL role that is then used for the duration of the request. This role is subject to regular PostgreSQL security, be it table, column or row-level security[1].

- You expose a subset of your database objects for your API schema. This schema consists of views and functions(or only functions) to hide internal details to API users[2].

[1]: https://postgrest.org/en/stable/auth.html#authentication-seq...

[2]: https://postgrest.org/en/stable/schema_structure.html#schema...

jrumbut · on April 9, 2023

Yeah a number of pieces of software people use after a 5 minute tutorial are like this, a ton of depth that we frequently recreate crappy versions of within our apps.

The Apache webserver (and most other webservers) is another example like Postgres where it has a ton of useful features no one uses.

Some of these less popular features don't work at massive scale but in my experience they work great for smaller teams where adding a new tech to the stack is more painful.

bjornsing · on April 9, 2023

> I've also been looking into how to "branch" a database instead if just doing a backup, it does require some ZFS tricks but that is also just pure power and not something you find with a cloud file system.

In the cloud you can just snapshot/clone the underlying EBS volume and you’ll get a “branch” of any database on any file system, right?

muskmusk · on April 9, 2023

Assuming no-one is using the database yes. Also not trivial to set up and use plus vm startup is somewhat slow.

If your want a full featured product (in the postgres space) that does it then look at neon.tech (no affiliation).

Foobar8568 · on April 9, 2023

Can you explain what you mean by "how to branch a database" and the ZFS trick?

adamcharnock · on April 9, 2023

When I read this is thought it was a cool idea. I suspect it involves using ZFS’s file deduplication/copy-on-write mechanism to maintain a lightweight copy of a database on disk. Memory will still be required for holding the indexes etc, but in theory minimal extra disk space should be needed.

ZFS can also perform compression, which could be a big win as long as you have the spare cycles for lz4 (which is normally the case but may be an issue if you’re maxing out your NVMe disks’ bandwidth, say 3GiB/s+)

juancn · on April 9, 2023

Databases are worth the time to understand deeply. Postgres in particular is a magnificent beast.

quanticle · on April 9, 2023

Is it just me, or does Ed Huang skip over the most important part of database design: actually making sure the database has stored the data?

I read to the end of the article, and while having a database as a serverless collection of microservices deployed to a cloud provider might be useful, it ultimately will be useless if this swarm approach doesn't give me any guarantees about how or if my data actually makes it onto persistent storage at some point. I was expecting a discussion of the challenges and pitfalls involved in ensuring that a cloud of microservices can concurrently access a common data store (whether that's a physical disk on a server or a S3 bucket), without stomping on each other, but that seemed to be entirely missing from the post.

Performance and scalability are fine, but when it comes to databases, they're of secondary importance to ensuring that the developer has a good understanding of when and if their data has been safely stored.

audioheavy · on April 12, 2023

Excellent point. Many discussions here do not emphasize transactional guarantees enough, and most developers writing front-ends should not have to worry about programming to address high write contention and concurrency to avoid data anomalies.

As an industry, we've progressed quite a bit from accepting data isolation level compromises like "eventual consistency" in NoSQL, cloud, and serverless databases. The database I work with (Fauna) implements a distributed transaction engine inspired by the Calvin consensus protocol that guarantees strictly serializable writes over disparate globally deployed replicas. Both Calvin and Spanner implement such guarantees (in significantly different ways) but Fauna is more of a turn-key, low Ops service.

Again, to disclaim, I work for Fauna, but we've proven that you can accomplish this without having to worry about managing clusters, replication, partitioning strategies, etc. In today's serverless world, spending time managing database clusters manually involves a lot of undifferentiated, costly heavy lifting. YMMV.

rockwotj · on April 9, 2023

I agree that actually persisting data reliability is tablestakes for a database, which I would assume Ed takes for granted this needs to work. Obviously lots of non trivial stuff there but this post seems to be more about database product direction than the nitty gritty technical details talking about fsync, filesystems, etc

mamcx · on April 9, 2023

Also, most of the "action" on this sphere is for the "super-rich" customer: Assume it has more than 1 machine, lots of RAM, fast i/o & fast networks, etc. And this means: It run on AWS or other "super-rich" environment.

There, you can $$$ your way out of data corruption. You can even loss all the data if you have enough replicas and backups.

Not many are in the game of Sqlite.

This is the space I wish to work more: I think not only mean you can do better the high-end but is more practical all around: If you commit to a DB that depends of running in the cloud (to mask is not that fast, to mask is not that reliable, for extract more $$$ from customers, mostly) then when you NEED to have a portion of that data locally, you are screwed and then, you use sqlite!

quanticle · on April 9, 2023

    There, you can $$$ your way out of data corruption. You can even loss all the data if you have enough replicas and backups.

That's absolutely not true. All the money and all the backups and redundancy in the world won't save you if the data doesn't make it to persistent storage. Even in a totally closed AWS environment, the fallacies of distributed computing [1] still hold. Was there a network connectivity glitch? A latency spike? What happens when two connections attempt to write to the common data store at the same time?

You can't buy your way out of having to deal with the fundamental problem of, "How do I provide the illusion of a single unified system for a highly distributed swarm of microservices?"

[1]: https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

mamcx · on April 9, 2023

> That's absolutely not true.

In the absolute case, yes. But large companies have lost a lot of data, and thanks of $$$ they survive that.

In that scenarios, you pay a lot to not get in problems, but you can pay to overcome them too ("pay bigly for lawyers to follow the law and not get in trouble, and to broke the law and get away with it").

Is not the ideal and exist a point in where that could break badly, but with certain size fatal software failures that will destroy a small company are just "Thursday" for somebody big.

BTW: I don't like this, I prefer to make software solid, but everybody run on C, Js, MongoDb, etc, and this show you can survive a massive crash...

vkakade · on April 9, 2023

I would also add that the databases in 2020s will be written in Rust, rather than C/C++. The safety guarantees Rust provides makes the development process faster, as well as results is clean code that is easier to understand and extend.

jandrewrogers · on April 9, 2023

Modern database architectures have memory safety models that the Rust compiler currently can’t reason about. Hence why new database kernels are still written in C++. It isn’t because C++ is great (it is not) but because there are proven high-performance safety models in databases that Rust has difficulty expressing without a lot of “unsafe”.

Generally speaking, modern database architectures have minimal dynamic memory allocation, no multi-threaded code to speak of, or even buffer overflow issues (everything internally is built for paged memory). Where are these advantages from Rust supposed to come from? It has a lot of disadvantages for databases today, and I say that as someone who has used Rust in this space. People who build this kind of software are pragmatic and Rust doesn’t deliver for database developers currently.

indogooner · on April 9, 2023

Can you point to some resources (discussions/blogs/papers) on these? I was under impression that the reason recent database kernels are in C++ is because the authors are more proficient in it which should change as Rust becomes more popular.

jandrewrogers · on April 10, 2023

There is an element of truth that C++ is used because it is the default language for people that work on database kernels professionally but that doesn't tell the whole story. I've been party to multiple attempts to port or implement modern database kernel designs in Rust by people that also work on the C++ kernels. There are ordinary language friction issues since Rust is not as expressive as C++ for the kind of low-level memory manipulation commonly found inside database kernels, but that's not what's limiting use.

Two core design elements of modern database kernels create most of the real challenges. First, all references to most objects are implicitly mutable and some references are not observable by the compiler; safety can only be guaranteed dynamically at runtime by the scheduler. Major performance optimizations rely on the implications of this. Second, your runtime data structures don't have lifetimes in the ordinary sense because they all live in explicitly paged memory -- they aren't even guaranteed to have a memory address, never mind a stable one. And you want this to all be zero-copy, because performance. There are elegant ways of doing this transparently in C++ with a bit of low-level trickery, but it is antithetical to the way Rust wants you to manipulate memory. There are workaround options in Rust of course but they are strictly worse than just doing it in C++.

Database kernels are an edge case. Most systems software doesn't break most assumptions about ownership and lifetimes so pervasively. I expect Rust will add features over time that help in these cases.

breck · on April 9, 2023

I have seen some data that makes me think you may be right. I haven't looked at database projects yet but looking at the code bases of other programming languages and big systems (such as Linux), I see the Rust file count going up. That being said, I did recently look at SQLite and that is still all C. It's on my todo list to look this up for all the major open source DBs.

xiphias2 · on April 9, 2023

I was thinking of giving AutoGPT a try to convert SQLite (or Redis) to Rust, because it has a lot of tests anyways, so it would be fun. I'm still waiting for GPT-4 API access though :(

gxt · on April 9, 2023

Working on it. Just wish LLVM was natively written in rust too, it's easy to segfault when doing something unexpected.

mamcx · on April 9, 2023

Same.

Even if some aspect of Rust safety can't cover the complications of the low-level stuff, is easier to do that on Rust than with C/c++.

studmuffin650 · on April 9, 2023

My 2 cents is that modern C++ (c++11 and greater) removes most of the memory safety problems that people historically have used as example for the language being dangerous or bad. I would actually predict that modern databases will probably continue to be built in C++ but using a modern version of the language, providing memory safety and high performance.

perfectspiral · on April 9, 2023

I dont agree with the premise that running transactional and analytical workloads on the same database is architecturally “simpler”. In my experience this is only true at very low scale and those contexts are already sufficiently well served by existing database tech.

davidgomes · on April 9, 2023

Can you elaborate? I am curious :)

perfectspiral · on April 9, 2023

At a certain amount of scale its net “simpler” if the analytics people are free to do their jobs without worrying they will bring down “the app” and vice versa.

OLAP has certainly become overcomplicated but collapsing everything into big monolith dbs again is an over correction IMO.

davidgomes · on April 9, 2023

Yeah, that’s interesting. I think both SingleStoreDB and Rockset have come up with solutions to this problem (by separating the compute). But I understand that most database providers do not have a mechanism for the analytics people to not bring down the app accidentally.

[1]: https://www.singlestore.com/blog/distributed-sql-workspaces-...

[2]: https://www.rockset.com/blog/introducing-compute-compute-sep...

FridgeSeal · on April 10, 2023

To be fair, the OP is part of PingCap, who build TiKV/TiDB, which is built from the ground up to support both workloads, they call their database “Hybrid Transactional Analytical” database, which is explicitly designed to support both workloads without either side “treading on each other”, and without needing ETL pipelines gluing stuff together.

breck · on April 9, 2023

I'm working on a new Git backed file based Database for knowledge bases. Not designed for domains where you can confidently predict your schemas ahead of time but instead designed for use cases where you have large complex schemas that are frequently changing.

It simply wasn't possible a few years ago (SSDs weren't fast enough and Git was too slow for projects with huge number of files).

I'm having fun with it.

Current version is just written in Javascript but if there demand hits a higher level would likely write a version in Rust or Go.

If anyone has any pointers to similar projects I'm all ears.

FridgeSeal · on April 10, 2023

Have you looked at Dolt? Or LakeFS? Both of these are taking the “hot for data” approach too.

breck · on April 10, 2023

I had not seen LakeFS. Thanks of the pointer!

I have been following Dolt. Been a while since I took a look. Seems like they are making progress!

refset · on April 9, 2023

Interesting read! I would like to add:

* databases need to get better yet at schema management and workload isolation to enable multiple applications to properly integrate through the database (as traditionally envisioned)

* HTAP seems inevitable but needs built-in support for row-level history/versioning to get maximum benefit

* databases should abstract over raw compute infrastructure efficiently enough that you don't need k8s to run your application logic and APIs elsewhere. The database should be a decent all-in-one place to build & ship stuff

paulddraper · on April 9, 2023

I'm surprised there isn't a "serverless" PostgreSQL. That seems like it would get more bank for buck then writing a cloud native DB from scratch.

(Or maybe there is one but I don't know about it.)

AWS made a serverless MySQL.

esperent · on April 9, 2023

A quick Google search suggests there are many serverless PostGres implementations, including an AWS version.

antifa · on April 10, 2023

I wish GCP would implement one.

paulddraper · on April 9, 2023

Ah I'm out of date then

dragonwriter · on April 9, 2023

> I'm surprised there isn't a "serverless" PostgreSQL

There are many.

> AWS made a serverless MySQL.

AWS Aurora Serverless has both MySQL and Postgres flavors.

samieljabali · on April 12, 2023

planetscale.com

adam_gyroscope · on April 9, 2023

bit.io

jansommer · on April 9, 2023

neon.tech