Why SQLite is so great for the edge

jgrahamc · on June 6, 2023

This no need to compile SQLite into your Cloudflare Worker. We provide it native on our platform as D1. And it gives you replication. https://blog.cloudflare.com/d1-turning-it-up-to-11/

Also, I think the idea of “edge” doesn’t make a ton of sense. What we really need is code and data that move around as needed for the best performance. See: https://blog.cloudflare.com/announcing-workers-smart-placeme...

What people call “edge” is a single optimization bringing code near the user. I think we should go far beyond that. Sure, use something like Cloudflare Workers for your “edge” needs (ie bringing your React app close to the end user/doing server side rendering). But don’t stop there because where data resides, what APIs you call are all going to matter.

That's the vision of something I called the "Supercloud": https://blog.cloudflare.com/welcome-to-the-supercloud-and-de...

haolez · on June 6, 2023

How's D1 meant to be used, since it has a very small maximum size (100mb I believe)? Should I create one database per user, for example? Genuine question.

kentonv · on June 6, 2023

The maximum size will increase to at least 1GB in the near future, enabled by our recent rewrite of the underlying storage layer. Perhaps we can push it further, even, we'll see.

But yes, I think the next step is then some sort of sharding. Sharding by user would be an obvious approach for many apps. I think we should build a framework to help manage this, so apps would only need to provide some callbacks e.g. to compute shard key for a particular query.

Alternatively, apps that want full control will be able to use Durable Objects directly. We'll soon (in a few months, probably?) enable the new storage engine for all Durable Objects which means every object will have a private SQLite database.

(I'm the tech lead for Workers in general, and currently focused on this project in particular.)

no_wizard · on June 6, 2023

do you think you will have any API driven access to the data held in these services? IE, I don't per se want to have to use a worker to arbitrage between the Cloudflare platform and HTTP access to the data within Cloudflare systems.

Much like R2 has an S3 compat API as well as a way to retrieve files outside of workers (IIRC) will this be true of all forms of storage on the platform?

Sometimes it'd be nice to prime the cache with KV for example, without having to invoke a worker

elithrar · on June 6, 2023

> do you think you will have any API driven access to the data held in these services? IE, I don't per se want to have to use a worker to arbitrage between the Cloudflare platform and HTTP access to the data within Cloudflare systems.

For D1, yes — on the roadmap is a native HTTP API. KV has a HTTP API as well, so you can write directly.

tebbers · on June 6, 2023

Awesome, thanks for the info! We are building more and more features on Cloudflare and absolutely love the whole experience, faultless.

fauigerzigerk · on June 6, 2023

>Should I create one database per user, for example?

You can't. D1 won't let you create databases at runtime. At least that was the case last time I tried it, and it's what killed it for me.

ibejoeb · on June 8, 2023

Not really sure how you distinguish between runtime and something else. It seems to me that you could provision a "shard" as simply as executing in a worker:

    wrangler d1 create customer28 --experimental-backend
    wrangler d1 execute customer28 --file=schema.sql

fauigerzigerk · on June 9, 2023

What I mean is that workers cannot create databases. You have to create each database using wrangler (or the UI) and then edit the configuration file to add a binding. The binding becomes a variable in the JavaScript environment, which means you have to change the source code every time a new database is created.

Therefore, creating a database per user as part of the user signup flow is not possible.

ibejoeb · on June 9, 2023

Ah, I see. I thought it would be possible to dynamically set a secret which would effectively be a database uri. But upon further investigation, it does indeed need a redeploy.

I suppose you could make something to write the bindings to the toml and trigger a redeploy, but that's definitely not pretty.

ignoramous · on June 6, 2023

afaik, that's how it is envisioned to be used. Many databases rather one big monolith. The era of microdbs is here. Who's not going to bite?

See also: https://archive.is/e7u9x / https://www.the-paper-trail.org/post/2020-04-06-physalia

Xeoncross · on June 6, 2023

The linked https://blog.cloudflare.com/d1-turning-it-up-to-11/ page shows pricing for databases over 5GB

tiernano · on June 6, 2023

the docs say the Alpha is limited to 100Mb, and, given the size of SQLite DBs i have seen, 100Mb is quite large, is it not?

electroly · on June 6, 2023

SQLite is readily capable of vastly bigger databases than that. Whatever limitation there is here isn't from SQLite.

remram · on June 6, 2023

The SQLite website recommends using a client/server DBMS "when the size of the content looks like it might creep into the terabyte range"

https://sqlite.org/whentouse.html

dgellow · on June 6, 2023

Thank for sharing this, that makes lot of sense to me. The focus on running everything on edge has been confusing to me for a while due to the increased db latency. That makes so much sense to dynamically move processing to the best location in the current network instead of the binary choice “close to db”, “close to users”.

Spivak · on June 6, 2023

We've finally come full-circle with shared hosting.

CSDude · on June 6, 2023

For our product, (SQL your cloud & saas resources as a service) we want to consider D1, but for security we would want database per tenant. Would that model work for Cloudflare? 1000 databases for 1000 tenants. Or should we use something else with shared tenancy?

5Qn8mNbc2FNCiVV · on June 6, 2023

Not quite sure understanding SQL your cloud (using their AWS account?) but from what I guess by the consideration of D1: I'd wager you want something like Neon but self-hosted. I am building a similar product where I give each customer a full-blown Postgres (without individual extensions) and that is the solution I landed on with the most flexibility and security after testing out SQLite vm per user (via Fly.io), SQLite on the edge (Worker + S3, before D1 existed) and Postgres cluster per user (via Stackgres)

The nice part is the decoupling. User not using their DB for 5 minutes? Time to idle them and use that compute for another customer.

And it's also really nice that storage becomes way way way easier. Storage is the biggest hurdle when your database count scales with your user count so you can instead build out the compute system once, fix a few issues along the way and from then on just babysit a single big storage cluster.

And if that becomes too big at some point, replicate it, shard the tenants by region or whatever and you "only" have to manage storage a dozen times or something. It's obviously always annoying, but at least the setup is similar enough that you can park on FTE per Storage cluster and serve enough customers to make it worth it

zerbinxx · on June 6, 2023

Edge does make a lot of sense, but only if you’re actually talking about physical computers that need to exist at physical locations to do no-network/embedded systems/rtos without the cloud, but with a cloud-metrics or aggregation layer (or other apps) which explicitly do not need to be on a nearby computer to help the users, and often serve as a control plane for a set of edge resources. Think Tesla wanting to do BI on every car vs. having the car actually work for the user when they decide to drive through the network-disconnected wilderness.

Server side rendering or “single optimization” as the “edge” is kind of an abuse of terms.

hobo_mark · on June 6, 2023

Does D1 support FTS?

piperswe · on June 6, 2023

`wrangler d1 execute abcdtest --command 'CREATE VIRTUAL TABLE people USING fts5(firstName, lastName);'` succeeds with no errors, so I'd say it does.

xyx0826 · on June 6, 2023

I extensively used SQLite in a telemetry system for an electric race car. The car has an onboard computer, first a Raspberry Pi then a dual core Arm processor. Onboard code logs ~4000 messages a second into three SQLite databases. After a drive session a script merges the three databases into a single SQLite session log. The session log is decoded on a different computer to ~400 columns of time series data again stored in SQLite. This data is then finally converted into a proprietary format for visual analysis.

Prior to SQLite I tried CSV files and raw binary formats. None of them could match the flexibility, ease of use, and throughput of SQLite. My setup has been in use for several weeks now and has processed numerous rows of data. As the SQLite project outlines on its website:

> SQLite does not compete with client/server databases. SQLite competes with fopen().

sgu999 · on June 6, 2023

Other have asked, but I'd also be very interested in knowing more if you can share more details. I had to spent a non trivial amount of time convincing others to use SQLite instead of dumb files in similar situations, stories like yours are pure gold.

Lex-2008 · on June 6, 2023

Haven't seen these links mentioned elsewhere in this thread, but in case you haven't seen them yet:

* SQLite is 35% Faster Than The Filesystem [1]

* Writing to files safely is hard [2], and SQLite goes to great details to ensure consistency in case of application, OS, or computer crash [3]

[1]: https://sqlite.org/fasterthanfs.html

[2]: https://danluu.com/deconstruct-files/

[3]: https://sqlite.org/atomiccommit.html

xyx0826 · on June 7, 2023

There are a lot of SQLite praises in this thread, on the internet and on SQLite's website so here are a couple of reasons that drove me to choose it. My experience with traditional formats like CSV or raw binary is that they have a somewhat lower overhead to "get working" but are quite difficult to scale and adapt to changing needs, a phase most projects are bound to go through.

SQLite's reputation and adoption makes me comfortable to assume its operational correctness so I can focus on data itself, not reader/writer logics for it. Even though a SQLite database's schema may change, I'm glad knowing that the database file can still be easily opened and read many years into the future, something that custom formats may not guarantee.

On CSV: Syntax and library differences aside, SQLite almost works like a drop-in replacement for CSV since a single table database can be essentially treated as a glorified spreadsheet/CSV if wished. My post processing data pipeline is built with pandas so it is virtually the same for me to read a SQLite database vs. a CSV file.

rongenre · on June 6, 2023

I've had a huge amount of success using sqlite to collect and analyze logs. The obvious approach would have been to dump enormous csv or json files, but doing in sqllite gave me (a) much better reliability, and (b) the ability to come up with post-hoc analysis approaches without changing the main process.

I was really pleased to add it to my toolkit.

hiatus · on June 6, 2023

> Onboard code logs ~4000 messages a second into three SQLite databases. After a drive session a script merges the three databases into a single SQLite session log.

A bit unrelated, but curious as to why you wrote to three separate databases only later to merge them.

magicalhippo · on June 6, 2023

Could be three different processes interacting with different systems. Say one process listening to a CAN bus, another to local Ethernet[1] and a third to some I2C/SPI sensors. If they don't interact otherwise, which they probably wouldn't given they're just logging telemetry, it's more flexible to just have them as separate processes. The I2C/SPI might be a Python script, the others in C/C++ or Rust say, whatever is easier.

[1]: https://www.keysight.com/blogs/tech/sim-des/2021/06/10/autom...

xyx0826 · on June 7, 2023

Precisely so! There are two CAN buses and one UART port to log from, and the Python code spawns subprocesses for each of them. They each create or open a SQLite database named after their bus/port and get to work.

This setup is easy to put together but might not be optimal. I don't have a lot of experience with SQLite performance tuning, and I wonder if it will be faster to have worker threads pass everything thru IPC to a writer thread, which batches rows and writes to a centralized database.

kazinator · on June 6, 2023

Another question is what is interesting in a speeding electric car in such a way that 4000 events per second must be captured about it.

magicalhippo · on June 6, 2023

Good question. As a non-racer who enjoys watching racing, I'd imagine at least some accelerometers/gyros and load cells if they got them, and possibly Pitot tubes or similar for measuring aerodynamics when testing, to get a detailed understanding on the forces on the car and how it behaves. Motor currents probably up there too, to catch current spikes.

Curious to see how far off I am :)

kazinator · on June 6, 2023

> detailed understanding on the forces on the car and how it behaves

But at 4000 samples per second? You'd think twenty would be plenty.

NyQuist: 4000 samples per second means we are looking to get information about an up to 1800 Hz signal. A6 on the piano keyboard (excluding harmonics).

Sorry, I mean Nyquist. I was confused by NyQuil, the night time flu medicine.

lfowles · on June 6, 2023

I read it as an aggregate 4000 samples per second -- so across 40 sensors that's only 100sps

magicalhippo · on June 6, 2023

At 250kph and 4kHz sampling rate the car moves 17mm per sample, or about 2/3 of an inch.

Doesn't seem unreasonable to want to sample at that rate, especially for tweaking suspension and such.

kazinator · on June 6, 2023

A sound wave moves 85 mm in that time at Mach 1, around 340 m/s. That doesn't mean we need to sample a sound that has no components above 20 Hz, using 4000 samples per second.

The propagation speed can be misleading.

All that matters is whether there are oscillations in the suspension that go up to 1800 Hz, not how fast the car is going forward. If there are, those would have to be harmonics. You'd think would be beyond the frequency response of the suspension. Suspensions are heavy, bulky components and are heavily dampened (which is a kind of low-pass filter).

If the suspension moves anywhere near 17 mm per sample (250 km/h vertically), that car is in serious trouble.

magicalhippo · on June 6, 2023

> All that matters is whether there are oscillations in the suspension that go up to 1800 Hz

Surely you care about more than just oscillations, for example how exactly the suspension compresses during breaking. If you've seen slow-mo shots from high-speed race cars riding a curb, it can be quite violent with lots of movement.

In this article[1] about Formula 1, they state they sample vibration data at 200kHz. This is then filtered to a lower rate for logging, but getting say 5kHz out of 200kHz raw sensor data doesn't seem unreasonable to me.

They also mention they collect about 30MB per lap of sensor data from more than 250 sensor, and laps are typically around 1.5-2 minutes long. If one assumes a 2 minute lap and 250 sensors, that's 1kB/s per sensor on average.

[1]: https://www.racecar-engineering.com/articles/how-data-works-...

kazinator · on June 6, 2023

> how exactly the suspension compresses during breaking

The entire graph of suspension vs time should be well below the frequency range. It doesn't matter whether we are looking for frequency domain features or time domain features.

(Like someone said upthread, it's probably 4000 events per second as an aggregate from numerous sensors, mutiplexed into one sqlite. Or maybe even a total across the three sqlites.)

magicalhippo · on June 6, 2023

I found this page[1] which mentions dynoing shocks at 3 inches per second, as that what one could see in the pits. If you want some spatial-temporal resolution for that then you'll want a fairly decent sample rate, no?

Maybe not quite 4kSps is needed but for vibration and such it'd make sense to me.

But sure, could very well be an aggregate rate.

[1]: https://www.speedwaymotors.com/the-toolbox/general-sprint-ca...

kazinator · on June 7, 2023

3 inches per second is very slow. I think shocks compress quite a bit faster when you hit a speedbump in a parking lot.

The sample rate you need to reconstruct the motion of shocks is entirely determined by Nyquist, just like sampling audio or any other signal.

You need some 2.2x the highest frequency you want to capture, and make sure you filter out anything above that (if it exists).

If there is nothing above 20 Hz in the suspension's movement, or nothing you're interested in, then you need a 44 Hz sample rate. (More if you implement oversampling, but not 10 times more let alone 100.)

xyx0826 · on June 7, 2023

All that and more! Sensors like shock potentiometers, strain gauges, temperatures from all over the car, IMU readings ([ag][xyz]), and electrical measurements like HV/LV voltage, current, power, energy, as well as controls like motor commands and speeds, controller states, etc.

Lots of folks put work into a car and they all have metrics of interest to tell how their systems perform. An ideal telemetry system for us captures all this information and makes it available both in real time and analysis.

mbreese · on June 6, 2023

I’m going to take a guess that the rationale is file locking. If you have three separate processes using three separate database files, you can write all the data you want, as fast as you can, without worrying about file locking. SQLite can handle three processes writing at the same time, but it can be faster to do it split into separate files and then merge after the fact.

Cthulhu_ · on June 6, 2023

One of SQLite's weaknesses is it's not client-server, so it's ill-advised to have multiple applications try to use a single sqlite server. Concurrency is another issue.

nirimda · on June 6, 2023

One of SQLite's strengths is that it's not client-server, so there's no single sqlite server. It is simply not possible to have multiple applications try to use a single sqlite server, because you cannot use what does not exist. No sane person would ever say "it's ill-advised to have multiple ambassadors try to speak to the King of the USA", and likewise no sane person should ever say "it's ill-advised to have multiple applications try to use a single sqlite server".

It is however possible to have multiple applications access the same sqlite database on the same filesystem (i.e. locally, not via network access, not via remounts or symlinks or other forms of synonyms) and it will sort out the locking for you. Depending on what you need to do, there's I think three possible strategies you can use so it can probably handle your concurrency needs faster than Postgres via the network can. For instance, this particular task is write-heavy, and it would benefit from a different setting than a read-heavy application.

There are certainly occasions when Postgres or MySQL or MS SQL or Oracle is a better choice than SQLite, and sometimes concurrency contributes to that decision. But it's unlikely to be the case that you had an SQLite solution that worked really well with one process reading and writing from the database, and now all of a sudden you need a second or third application to use the database concurrently with the first, and you find you need to switch to a client-server model And even if sometimes you need to take specific steps (like adopting a different locking model), you also have to take specific steps to run Postgres (e.g. using a pool to speed up connections and limit the number of concurrent connections because each connection is a separate process). It's a matter of knowing your tools.

kazinator · on June 6, 2023

> None of them could match the flexibility, ease of use, and throughput of SQLite.

It sounds unbelievable that you couldn't append some CSV or binary record to file via an open file descriptor more efficiently than inserting into SQLite.

(It's clear that to have the data already in SQLite form saves post-race steps.)

xyx0826 · on June 7, 2023

The onboard code processes binary data: CAN bus messages and alike, along with their nonbinary metadata. The first version of the code used CSV because they are so trivial to read and write, but CSV as a text format required me to either hex or base64 the bytes which incurred a non-trivial waste of time and space. For a custom binary record I would need to write custom (de)serializers, which I wasn't willing to do given frequent changes in types of data I need to record during development.

SQLite is the only solution for me so far that provides a structured data store for mixed data types, is power loss safe (important to an embedded system,) and is a high-quality and portable industry standard. Let along the SQL language itself, which has proved to be hugely useful for first-pass post processing given the amount of data that I ingest.

Perhaps there are other technology that will work just as well or better: maybe Protocol Buffers and alike? Yet SQLite currently ticks all my boxes and I'm quite happy with it.

regularfry · on June 6, 2023

It's an AND statement, not an OR statement. You can append to a CSV file efficiently, but while flexibility is context-dependent and ease of use is very subjective, we can assume the GP knows their own use case).

lloydatkinson · on June 6, 2023

Sounds cool! Any project details anywhere?

xyx0826 · on June 7, 2023

Sadly my team is pretty light on public tech blogs, but there are plenty of projects like this one! Copying from my comment above:

> The system is for a Formula SAE (FSAE) Electric style race car. FSAE is a collegiate competition, and I'm part of a US Pacific Northwest team. Our car uses a 550 V battery and is 4WD. Most of our electronics and firmware/software are in-house besides motors and inverters. I lead the telemetry project and work with 1-2 other members on the system.

> The system consists of an on-vehicle computer (Raspberry Pi or NXP i.MX) and a ground station (Rockchip RK3588S). We use Python heavily and the real-time dashboard part is done with InfluxDB and Grafana. Both computers and any user devices are on a Wi-Fi network made with a long-range WISP access point.

> Our team is unfortunately very light on public tech blogs and how I wished I can change that... There is plenty of information on FSAE electric cars, vehicle telemetry, and both on the internet and they were a source of inspiration as I designed my own solution.

brtv · on June 6, 2023

Interesting! I'm currently working on a system that writes time series data to raw binary files, but we're considering switching to a different file format for the same reasons. Have you considered any other formats, such as hdf5?

nunuvit · on June 6, 2023

The one potential requirement I'll caution you on is resilience against write failures, in case you're collecting time-series data and can't afford to lose a "session" or spend time messing with recovery options. HDF5 is not made for that. Binary and SQLite are better in that respect. SQLite wins on usability against binary and HDF5.

https://cyrille.rossant.net/moving-away-hdf5/

xyx0826 · on June 7, 2023

I'm not familiar with HDF5 but agree on resiliency. We are a collegiate racing team and our car's power rails aren't stable and redundant at all times, so power loss failure is something I've kept in mind from day one on my telemetry project. SQLite is generally equipped to handle power losses and write thread crashes:

> An SQLite database is highly resistant to corruption. If an application crash, or an operating-system crash, or even a power failure occurs in the middle of a transaction...

Quite often after a run the entire car is turned off and, on next power up, the databases are left as .db and .db-journal files. The code has no problem processing or even continuing on logging with DBs in this state.

nunuvit · on June 7, 2023

HDF5 is more for storing and exchanging numerical simulation data. It doesn't have to be resilient to write failures because worst-case scenario you rerun the simulation or try again to copy the data into the file.

SQLite writes are "atomic" transactions. After writing new data, it goes back to the index and registers that new data has been written using a single instruction. That's why interrupting it in the middle of a write doesn't result in partial data or a corrupted index.

Solvency · on June 6, 2023

I'd love to know more about this electric race car, the telemetry system, technology behind it, etc. Do you have a blog (or can you recommend a place) where you do deeper on this subject matter?

xyx0826 · on June 7, 2023

The system is for a Formula SAE (FSAE) Electric style race car. FSAE is a collegiate competition, and I'm part of a US Pacific Northwest team. Our car uses a 550 V battery and is 4WD. Most of our electronics and firmware/software are in-house besides motors and inverters. I lead the telemetry project and work with 1-2 other members on the system.

The system consists of an on-vehicle computer (Raspberry Pi or NXP i.MX) and a ground station (Rockchip RK3588S). We use Python heavily and the real-time dashboard part is done with InfluxDB and Grafana. Both computers and any user devices are on a Wi-Fi network made with a long-range WISP access point.

Our team is unfortunately very light on public tech blogs and how I wished I can change that... There is plenty of information on FSAE electric cars, vehicle telemetry, and both on the internet and they were a source of inspiration as I designed my own solution.

Solvency · on June 7, 2023

Awesome. If you dont mind me asking, what sort of dashboarding/visualization software do you use? Highchart homebrewn web apps? Some closed source auto software? What's that world look like? Always been curious about it from a front-end perspective.

xyx0826 · on June 7, 2023

Real time data is fed into an InfluxDB instance and shown on Grafana hosted on the ground station. This data is downsampled to cope with the volume of raw traffic. We also tried some solutions used by other teams like KX [0] but decided to go with a FOSS stack. A post processing pipeline converts original data to a however proprietary format used by AEMdata [1], a proprietary data analysis tool from our ECU vendor and used somewhat commonly in the racing scene.

[0]: https://code.kx.com/dashboards/

[1]: https://www.aemelectronics.com/products/software-aemnet/aemd...

nitwit005 · on June 6, 2023

This article isn't talking about the edge as embedded systems. It means cloudflare workers and the like.

samwillis · on June 6, 2023

For a tenanted SAAS app SQLite at the edge is a really compelling architecture. You have a single DB per customer/company/group with all their users woking against that. It can operate at the edge, closest to where the majority of the customers users are.

SQLite could scale to even quite large customers with this.

Another really compelling architecture is a DB per user, with partial/selective sync between the nodes. If you then couple this with a "local first" design, the "edge" just becomes an other local deployment that the users db can sync against. Collaborative apps, where the users have their own documents but can also share/fork them would align well with this.

I believe starting with "local first" and using the edge for sync and "online only" modes is going to become the default for a significant number of apps moving forward. SQLite with CRDT based syncing and merge conflict resolution is the way to do this. There are a couple of exciting projects working on this:

https://electric-sql.com/

https://vlcn.io/

victorbjorklund · on June 6, 2023

Yea, but does it scale? Yes, for 100 users maybe. But you have 100 000 small sqlite db:s I would be a bit worried about how it scales.

Electric is super cool. Can't wait until it is a bit more mature.

samwillis · on June 6, 2023

In broad strokes having 100k SQLite dbs is a bit like having 100k git repositories. It's doable and just about having tooling to manage it.

It actually makes a few things simpler, say progressively rolling out visiting upgrades. You need to write schema migrations anyway, this lets you upgrade one customer at a time.

treis · on June 6, 2023

The problem is that one at a time times 100,000 DBs takes a long time. For every second a migration takes its 1.1 days.

You can parallelize it, of course, but that's a challenge in of itself.

remlov · on June 6, 2023

> You can parallelize it, of course, but that's a challenge in of itself.

Not a challenge at all really.

mring33621 · on June 6, 2023

GNU Parallel + some Python code?

victorbjorklund · on June 6, 2023

Yea, pretty much tooling and stuff is what I doubt scales. I'm sure a machine can handle it. It will just be very hard to manage because lack of tooling.

biorach · on June 6, 2023

> Yea, but does it scale? Yes, for 100 users maybe. But you have 100 000 small sqlite db:s

If it's a case of 1 node per customer (node being vm, lambda, cloudflare worker, whatever) then there should be no limit.

Maybe you're thinking of a more "traditional" approach with a server process managing 1 sqlite db per customer, which might make sense in order to keep cloud costs down. Even in that case it should be trivially easy to distribute the load.

victorbjorklund · on June 6, 2023

So store sqlite on S3. Start a new lambda (or similar) per user, download the db from sqlite and write to it and upload it again. And making sure there can never be more than 1 function running per user (to avoid two lambdas editing the same database). Seems a bit fragil but I might be wrong.

jtsiskin · on June 6, 2023

This also sounds like CouchDB: https://couchdb.apache.org/

aarpmcgee · on June 6, 2023

more at https://localfirstweb.dev/

asmor · on June 6, 2023

the irony of "join our discord community" really hits pretty hard on that one

runeks · on June 6, 2023

> It’s borderline impossible to compare it against networked database management systems like MySQL or Postgres, because SQLite is a library that operates on a local file — it bypasses all the costs incurred by the network, layers of serialization and deserialization, authentication, authorization, and more.

Postgres can run locally, communicating via a Unix socket. You should try benchmarking this before stating that it's so much slower than SQLite.

citrin_ru · on June 6, 2023

Even with Unix socket there will be at least one additional copy across kernel - userpace boundary compare to SQLite (and likely at least one copy inside the kernel too from one socket buffer to another). Yes, a Unix socket has lower overhead than a TCP one, but still not free:

  kernel (filesystem) -> app with SQLite library
  kernel (filesystem) -> Postgres -> kernel -> app with Postgres client

If Postgres reads data from own cache (not from disk) the chain will be one step shorter: Postgres -> kernel -> client, but the same true for SQLite own cache.

If mmap is used (make sense for small data sets cached in RAM) then reading data by SQLite can be zero-copy [1].

Benchmark result highly depends on a use case - neither is better for all use cases but at least for some use cases SQLite is faster.

[1] https://www.sqlite.org/mmap.html

mschuster91 · on June 6, 2023

No matter what, PostgreSQL can't (trivially) run in-memory only, and even Unix sockets has a performance cost compared to a library running in the same address space.

charcircuit · on June 6, 2023

It should be trivial to set it up on a ramdisk and I see some guides on doing so when I search for "postgres ramdisk"

mschuster91 · on June 6, 2023

PHP SQLite in-memory effort is `$db = new PDO('sqlite::memory');`. With postgreSQL, you'll at least have to:

1. create the ramdisk

2. mount the ramdisk, configure systemd to automount it

3. modify the systemd unit file for psql to depend on the ramdisk

4. initialize all the psql data structures in the ramdisk mount folder

bob1029 · on June 6, 2023

I've done the SQLite benchmark a few times. You can insert somewhere in the range of 10~20k rows per second if you are using NVMe (i.e. ~50uS per row). Requires some pragmas (i.e. WAL). This is 100% serialized throughput.

No clue what Postgres would manage, but I suspect it would be about an order of magnitude higher latency in the happy case.

runeks · on June 6, 2023

> No clue what Postgres would manage, but I suspect it would be about an order of magnitude higher latency in the happy case.

Unless you’re talking 1 versus 10 microseconds (or less), I don’t think Postgres will have an order of magnitude higher latency. And if we are talking this range, why would it matter for a web app where the client’s latency is almost certainly >1 millisecond?

ripley12 · on June 13, 2023

Because it changes the kinds of things you can build: https://www.sqlite.org/np1queryprob.html

With SQLite, it's often practical to issue several queries in situations where that would be too slow for a traditional client-server database.

norir · on June 6, 2023

I'm pretty sure sqlite is capable of significantly higher throughput that 10-20k rows per second depending on the workload. Inserts can be much faster if they are batched in large transactions and prepared statements are used to avoid sql parsing for each row. This only works of course if your workload can be batched.

usrbinbash · on June 6, 2023

> communicating via a Unix socket.

Which is already IPC, which already makes it slower than having the data never crossing process boundaries. No benchmark required.

runeks · on June 6, 2023

I don’t disagree with this. I disagree with whether these microseconds of difference matter when you’re serving a web page with milliseconds of latency.

usrbinbash · on June 6, 2023

There are many applications beyond serving webpages, where the difference matters. Collecting and collating analytics data is just one example.

remram · on June 6, 2023

Slight tangent: Is a UNIX socket faster than TCP over loopback?

mfalcao · on June 6, 2023

In my experience they are 30 to 40% faster

remram · on June 6, 2023

Wow that is significant!

sgt · on June 6, 2023

Someone should benchmark that. Unix sockets in my mind should be faster as in theory you have less overhead. But the flow using TCP over loopback might have received many more optimizations over the last decades.

mfalcao · on June 7, 2023

Here's a quick benchmark with pgbench:

  $ pgbench -h localhost -p 5432 -b select-only -T 60 -c 10 -j 2 bench
  Password: 
  pgbench (15.3 (Ubuntu 15.3-1.pgdg22.04+1))
  starting vacuum...end.
  transaction type: <builtin: select only>
  scaling factor: 1
  query mode: simple
  number of clients: 10
  number of threads: 2
  maximum number of tries: 1
  duration: 60 s
  number of transactions actually processed: 10370147
  number of failed transactions: 0 (0.000%)
  latency average = 0.058 ms
  initial connection time = 54.881 ms
  tps = 172993.008029 (without initial connection time)

  $ pgbench -h /var/run/postgresql -p 5432 -b select-only -T 60 -c 10 -j 2 bench
  pgbench (15.3 (Ubuntu 15.3-1.pgdg22.04+1))
  starting vacuum...end.
  transaction type: <builtin: select only>
  scaling factor: 1
  query mode: simple
  number of clients: 10
  number of threads: 2
  maximum number of tries: 1
  duration: 60 s
  number of transactions actually processed: 16890415
  number of failed transactions: 0 (0.000%)
  latency average = 0.036 ms
  initial connection time = 7.186 ms
  tps = 281540.260418 (without initial connection time)

YMMV depending on your workload, but Unix sockets should always be significantly faster.

sgt · on June 7, 2023

Very interesting! Thank you for the effort.

pjaoko · on June 6, 2023

I understood "networked database" to mean a client-server DB architecture. So that Postgres for example is running locally, but communicating with a local server.

kubb · on June 6, 2023

It's so insane that he links to a whole another post to clarify what he means by "the edge", and he struggles to state that in the first 5 paragraphs of it.

melx · on June 6, 2023

Well, the post "about the edge" also links to some announcement post there they "tried" to talk about the edge in more edgy details. Yet, it's like paragraph 67 on that 3rd post they were mention something about "the edge".

Art9681 · on June 6, 2023

It's the datacenter closest to your user(s) where you deploy services that require minimal latency. That's all there is to it.

AtlasBarfed · on June 6, 2023

A computing platform with intermittent and/or slow network connectivity?

I would say "low computing power" but a raspberry pi is 4Ghz these days.

revskill · on June 6, 2023

I want to hear "What's not great using SQLite compare to PostgreSQL" instead .

chasil · on June 6, 2023

No foreign keys by default. Using them means that every connecting session must toggle it on. Forget one and it will be free to violate referential integrity.

One writer. Any session that issues BEGIN TRANSACTION and then hangs halts all dml.

WAL mode confusion. WAL cannot safely be used on network filesystems, and it breaks ACID on ATTACHed databases, among other problems.

Date and time types don't really exist. There are functions to assemble your own, but it does require some thought. The ODBC driver for SQLite does have options to "emulate" this.

Length specifications on a column are ignored. CHAR(2) will allow the insertion of a blob. I think that check constraints could be used to enforce this.

Type affinity means that any data type can be inserted into columns declared as any other data type. Rigid type enforcement can be done, but it is not the default.

Those are the major eye-openers.

anitil · on June 6, 2023

I believe there is a recent (optional) strict mode that will enforce types. I haven't used it so I don't know how full-featured it is. Your other points are of course correct

chrismorgan · on June 6, 2023

Unfortunately, strict mode is incompatible with name-based type mapping: if you have a column that you want to be an RFC 3339 date-time (e.g. "2023-06-06T23:25:27+11:00"), for example, the only spelling permitted for the column type is “TEXT”, whereas without strict mode you can name it “DATETIME” or similar and popular libraries in various languages will automatically map that to the language’s datetime type (according to local conventions, not anything universally defined—e.g. https://docs.rs/sqlx/latest/sqlx/sqlite/types/index.html shows the built-in mappings the Rust sqlx library has decided to support). This destroys the ergonomics of strict mode for me, rendering strict mode quite undesirable, since it’s basically fighting with a more useful sort of strictness.

(I wrote a little more about this, with examples, at https://news.ycombinator.com/item?id=33282830.)

norir · on June 6, 2023

These are all fair points though they mostly can be addressed at the application layer. One could, for example, use a wrapper that opens the sqlite library and immediately turns foreign keys on. One does this once per language in your project and never think about it again. Similarly, type constraints can be enforced at the application layer.

I agree that it would be nice if some of these things were built in but at least sqlite is reliable when configured and used in particular ways.

jasode · on June 6, 2023

>I want to hear "What's not great using SQLite compare to PostgreSQL" instead .

The context of this thread's article is about "edge computing" which means resource-constrained endpoints like IoT, mobile devices, cloud "workers" on CDN, etc.

Those scenarios will inherently favor a lightweight database like SQLite over a full-blown heavyweight RDBMS like Postgresql/MySQL. RDBMS engines have extra code and cpu/RAM requirements for handling multi-user concurrency, locking, etc. You don't need all that machinery for edge computing.

Another example of the above tradeoffs is smartphones like iPhone and Android. Their default persistence API framework in iOS/Android use embedded SQLite as the local backing database. It doesn't make sense for a battery-powered device to waste cpu and RAM on a multi-user Postgresql/MySQL engine when there's only a single user of a smartphone.

Archelaos · on June 6, 2023

A short use-case comparison with client/server database engines in general can be found on the SQLite Web-site itself:

https://www.sqlite.org/whentouse.html

frou_dh · on June 6, 2023

Poverty in terms of datatypes available to use.

mjevans · on June 6, 2023

Offhand, Date types are really strings in SQLite.

Personally I'm not yet convinced that the normal SQL representation for what a 'date' object is matches real world use cases that well, or at least doesn't cover all of them.

As a programmer, what I find I want is a 'moment' which retains the input specification. Possibly in a sanitized binary format that's not the literal text, but also isn't a single numeric value either. E.G. Timezone Specifier / City (not just one per TZ!) + Human timespec (5 PM Tuesday the Whatever day of Month in Year) AND a pair of evaluation rules version and 'effective UTC||TAI time' for comparisons. The evaluation version thing needn't be a full version it might even just be a couple bits at the top or bottom of a long range 64 or 128 etc time number of some units. Something that can be used to determine if a value still needs to be updated with the latest rules set rather than using the cached representation.

chrismorgan · on June 6, 2023

> Offhand, Date types are really strings in SQLite.

It’s more that there’s no such thing as a date type, but that date and time functions can work with text or numbers, in a few different interpretations.

Documentation: https://www.sqlite.org/lang_datefunc.html.

chasil · on June 6, 2023

If you load the ODBC driver version of SQLite, there is an option to transparently convert the ODBC date time expressions into Oracle style dates.

It's an option for those who want easier handling.

"When the DSN Option "JDConv" (Julian Day conversion) is enabled the SQLite 3 driver translates floating point column data interpreted as Julian Day to/from SQL_DATE, SQL_TIME, and SQL_TIMESTAMP data types (supported since May 2013)."

http://www.ch-werner.de/sqliteodbc/html/index.html

chrisweekly · on June 6, 2023

Good point / ideas about SQL date objects... but you wrote about date in response to a comment about datatypes. Is your point that datatypes more generally may be less useful than parent implied?

citrin_ru · on June 6, 2023

SQLite is not a good choice if you want to access a DB over the network from different hosts or even from many processes on the same host (later may work good in some cases but not all). Large write (insert/update) volume is also not the best load for SQLite. And of course where replication is needed I would prefer PostgreSQL or MySQL with mature replication support to things like SQLite+litestream.

biorach · on June 6, 2023

> SQLite is not a good choice if you want to access a DB over the network from different hosts or even from many processes on the same host

That's a bit like saying that a family car is not suited for heavy goods transport.

GolDDranks · on June 6, 2023

Indeed, but that was an answer to the question asked.

quickthrower2 · on June 6, 2023

No kitchen sink?

quickthrower2 · on June 6, 2023

Turns out you don't need to cram SQLite into a Cloudflare worker anymore. https://blog.cloudflare.com/introducing-d1/

quickthrower2 · on June 6, 2023

A year later, still in Alpha! But generally available albeit with 100mb limit.

rozenmd · on June 6, 2023

There was a recent update about that: https://blog.cloudflare.com/d1-turning-it-up-to-11/

0xbadcafebee · on June 6, 2023

If the best engineers from 20+ years ago were told that the future of computing would be running weird packed-up assembly-javascript in colos close to users, I wonder if they would just quit immediately and move to a farm

jmull · on June 6, 2023

The “edge” for data storage doesn’t really make sense.

With SQLite, the data is stored very local to where it’s being accessed.

Either move it local to the user (SQLite with WASM/OPFS does this — this is mentioned at the end of the article, but that’s not the edge), or centralize it — for SQLite this means centralizing the execution of the code accessing the data too, since it’s designed for high granularity, low latency access, which, again, isn’t the edge.

SQLite is so good and convenient that you might still use it for some cases at the edge (e.g. some read-only or precomputed, predistributed chunk of data you want to access in a flexible way) though it would never be the only option for that kind of thing.

zackangelo · on June 6, 2023

I think the read-only use case you mention is a compelling one. I’m working on a developer product that makes it easy to prematerialize data in the way you describe.

It works by consuming data out of upstream APIs and then publishes a new, faster version of that API in edge locations. The underlying data there is stored in SQLite (plus some custom bits), but that’s just an implementation detail.

ilrwbwrkhv · on June 6, 2023

SQLite using Pocketbase is the best thing to go from 0 - 1 in the world of PaaS. It is magical, amazing and super fast.

realPubkey · on June 6, 2023

SQLite is great, but it is not great for the edge. Handling conflicts and replication is just too complex for SQL based systems.

bob1029 · on June 6, 2023

This is where we are at. SQLite took care of us for ~8 years, and will certainly continue to do in many cases moving forward. But, we are finding that as we scale up beyond customer #5, the idea of having tons of small databases scattered across our customer base becomes concerning.

I am looking at moving us towards 1 gigantic SQL Server Hyperscale DB for most things. Force our clients into a PaaS solution over time, enforce one standardized product model, etc.

Where you probably don't want to be is somewhere in the middle where half your data lives in some centralized place and the other half is in your edge/client nodes. Synchronizing across these domains can rapidly become a nightmare.

HelloNurse · on June 7, 2023

You are unfairly associating the inconvenience of "tons of small databases" with SQLite only because SQLite is good enough to make such an architecture practically plausible; it shouldn't be assumed that it is a good architecture in the first place, or that problems are SQLite's fault.

If you start having trouble "beyond customer #5", you have a problem of labor-intensive manual changes and insufficient automation, not a task that a different DBMS could do better.

endisneigh · on June 6, 2023

Who exactly is using SQLite in production for non-trivial usecases (>10K QPS)?

lanstin · on June 6, 2023

I use it to provide persistence to an in memory counter server. It is written to in batches of 4096 (total 25k/second). For restarting he server, the server gets a signal, stops listening, flushes the remaining writes to SQLite, and exits. On start up, it reads the SQLite then starts processing accepted reads. It was a couple hundred lines of go and makes the server much better operationally. It is not involved in the request servicing, just this back up persistence.

regularfry · on June 6, 2023

Who exactly is defining triviality purely in terms of request rate?

endisneigh · on June 6, 2023

I think most. You don't need a database to do 1 request a day, for instance.

mongol · on June 6, 2023

It depends on the usecase. A database is not only a data store, it can also be used as an analytics tool.

regularfry · on June 7, 2023

Request rate doesn't dictate the need for a database. Next?

endisneigh · on June 7, 2023

paws · on June 6, 2023

Recently I got curious about SQLite in serverless settings and eventually figured out a way to combine GCP Cloud Run with Seafowl, an early stage database. SQLite has been quite helpful! Did a write up [0] in case anyone is interested. (also about to submit it for my first Show HN.)

[0] https://www.splitgraph.com/blog/deploying-serverless-seafowl

rapnie · on June 6, 2023

> But WebAssembly goes way beyond browsers, and so does SQLite. Recent releases compile to WASI — a modular system interface for WebAssembly.

Wasm Labs group in VMware was involved with this.

See: "SQLite builds for WASI since 3.41.0" discussed at https://news.ycombinator.com/item?id=36054521

emadda · on June 6, 2023

I recently released Durafetch.

It consists of server functions to add to your Cloudflare Worker and a CLI to download all Durable Object state to a local SQLite database.

Take a look: https://github.com/emadda/durafetch-server

cryptonector · on June 6, 2023

Interesting how little TFA says about libSQL, even though TFA is ostensibly about libSQL, and how little the comments here say (so far nothing) about libSQL. Does anyone use or care about libSQL?

Alifatisk · on June 6, 2023

Why is SQLite gaining so much momentum and traction lately?

themerone · on June 6, 2023

I think this is more of a return to the status quo.

10+ years ago there were 2 big names in open source embedded databases, Berkley DB, and SQLite.

Then the NoSQL fad hit, and a bunch of new key value stores emerged. Now developers are remembering what made relational databases great.

Even when you don't need relational features, SQLite is battle tested, and a great choice for storing key, value pairs.

systems · on June 6, 2023

well, not only lately, SQLite was always very popular

My personal opinion, is that the relational model is the best option to model data, any attempt to avoid it, replace it, mimic it adds more complexity than it solve

Linked Tables (relations) is the ultimate data modeling tool

velcrovan · on June 6, 2023

SQLite has been the most widely used SQL database worldwide for at least a decade if not more. https://sqlite.org/mostdeployed.html

ushakov · on June 6, 2023

What are the economics with SQLite on edge? Do I have to pay for a copy of the database in every edge location?

randunel · on June 6, 2023

For FaaS and the like, I'd expect pricing schemes similar to cloudflare's https://blog.cloudflare.com/content/images/2023/05/Screensho...

This means you'd be paying per writes, reads and storage.

stoicjumbotron · on June 6, 2023

In what scenarios would one opt for SQLite over MongoDB or vice versa for a basic CRUD app?

TylerE · on June 6, 2023

One in which one wishes to retain one’s sanity. It’s _that_ bad, and I don’t especially care for SQLite.

thunderbong · on June 6, 2023

All of them. Plus, MongoDB is not an RDBMS.

whatapuff · on June 6, 2023

"Arguably the main advantage of SQLite is that you don’t need to spend dozens of hours trying to set up a distributed system in order to even start using it."

Last time I set up MySql it took me around 10 minutes. I don't what this person is smoking.

nichenqin · on June 6, 2023

sqlite is cheap when you sync the file in s3

twelvechairs · on June 6, 2023

> Lightweight; fast; zero-configuration; rock-solid; feature-packed; can be wasm

Most of these features are better served by things like polars/parquet/arrow. Maybe not the feature packed side yet