"A while" is underselling it. As long as you have people who are half-decent with SQL, "just put Postgres on a big db server" will get you to 50 million row tables before you have to start thinking about even hiring a real DBA.
50M is something that can easily be done with just a dev who understands that sql is more than select, insert and update with the manual, google and chatgpt.
You can get really damn far with a fat postgres box.
The problem with these discussions is that you can always get a lot out of any given architecture, structure, db approach or whatnot. Can. There's just a lot of daylight between "can" and "likely will."
Ultimately, everything has its limitations and tradeoffs. If we respect them, it's generally a smooth ride. Problem is that we rarely do... within a company (startup or otherwise) under real conditions. There's also a dynamic where we build until the point where something stops us. Tech debt, complexity, over-engineering, under-engineering, feature bloat or antagonism between early decisions and current goals.
There's a self-regulating aspect to this. If architecture is spot on, perfect for the task at hand we can move faster to reach the point where it no longer is.
My point is that postgres is, compared to almost everything else, easy to get from "can" to "likely will" with just somebody with a brain, a manual and google. In absolute terms it of course depends, but the point is relative.
A brain, a manual and google is usually a pretty good way generally to make solid decisions, respect the limits of your chosen stack.
What happens when there is >1 brain involved... or when brainless, manualless decisions eventually get made... or two pivots from now...
I'm not disagreeing with your approach. I agree with it, especially as starting point. I'm cautioning that resilience against complexity isn't about how easy it is to make good decisions when you understand the spec, read the manual and calmly proceed. Complexity and fragility accumulate when one or all of these are absent. How easily you can (and thus inevitably will) make a mess... not how easily you can keep it clean.
IRL situations with regular rdbs, a very common trend seems to be long term drift between schema and spec. The flexibility and approachability of postgres often enables a lot of kludge eventually.
Data stores have this dichotomy between "look how easy" and "is limiting factor" that speaks to difficulties we don't know how to articulate or isolate.
Of course if there are lots of bozos in the startup then even getting to and maintaining 50 million rows is going to be very difficult.
But if you have a small group of folks that are at least as competent as the folks at WhatsApp pre-acquistion, then there really shouldn't be any doubt whatsoever.
> a small group of folks that are at least as competent as the folks at WhatsApp pre-acquistion
That's a success case... beware survivorship bias.
> if there are lots of bozos in the startup
No arguing that quality engineers are fundamental to quality engineering. That said... by this standard, there's no point in having this entire discussion. Every good db/store out there is good. They all work very well if used as they should be, with due respect to tradeoffs. Yet, almost everyone has db problems. Almost every one of these problems occur well within the technical limits of postgres or whatnot.
"It shouldn't be a problem" when it usually is irl is tunnel vision. There is an empirical reality disagreeing with you. Walking into it with "this shouldn't be a problem unless everyone is a moron" is bad strategy. If you can't think of reasons why architecture can and will become a problem, then just assume that you (or some of you, some of the time) are morons, and try to make it moron proof.
I've been on both the sysadmin, development, and hiring sides, and with data models at scales of 50M+ records, devs who "understand that sql is more than select [...]" are rare in my experience, as they're a cross between db admins and developers.
Administrating (in particular, query planning and production operations) databases with tables sizes with magnitude of 10M and more records is challenging, and requires a skill set that is very different from pure development.
One won't get "really damn far with a fat postgres box", unless they're doing very simple SELECTs, which is not the case with modern web apps.
> Administrating (in particular, query planning and production operations) databases with tables sizes with magnitude of 10M and more records is challenging, and requires a skill set that is very different from pure development.
Is it more or less challenging than the alternatives? Is it less challenging enough to add a new tech to your stack, add the required knowledge to the team, etc?
I mean, knowing "enough-to-perform-CRUD" SQL is table stakes for developing on the back-end, but knowing $CURRENT-FLAVOUR-NOSQL (of which there are multiple products, all with such substantial differences that there is no knowledge transfer between using them) isn't, so there's going to be ramp-up time for every dev, and then every dev that is added to the team.
I'm not disputing your argument, I'm just pointing out that, sometimes, it's easier and faster to upskill your PostgreSQL developer to "scale the DB" than it is to teach them how to properly use, maintain, architect and code for DynamoDB and others.
It's not just $CURRENT-FLAVOUR-NOSQL. It's also doing custom transactions and locking on top and/or thinking in terms of eventual consistency. It's so, so much more complex than just SELECT FOR UPDATE/BEGIN TRANSACTION.
It's not even funny how many software engineers just don't know that, SQL is crazy fast and performant, if you have basic understanding about it - I was once refactoring (or, rather, getting rid of) a microservice that was just a JSON blob storage on top of Postgres, without having any schema for blobs, with 100s of 1000s of them, no indices, and main complaint was - it's slow.
Unpopular opinion: if you're skimping on ops/DBA resources (as you may need to do in a startup), then MySQL is a better default. By all means use postgres if your use case demands it, but personally I find the ops story for MySQL takes less engineering overhead.
Yes, and most successful companies who started in the last ~20 years started ( and many continue ) with a monolith and a MySQL database.
Only the mega-cap ones started to pursue other options mostly due to their type of business and bucket-loads of "free" VC money with explicit orders to burn it and get "unicorn" status - which involves hiring thousands of developers in record time and the whole thing turns into a zoo. Which is an organizational problem, mostly not a tech one.
Other than the ones we pretend are the whole Universe, there are thousands and thousands of medium to big companies with billions of revenue who started their product with a monolith and a MySQL database and many still do just that.
I agree with this "unpopular opinion". Worked with both MySQL and postgres based mid-scale apps of several thousands of users. Postgres is so deeply lauded here at HN yet requires two more magnitudes of operations work to keep it up and running. Vacuuming sucks hard.
That is a very specific use case, and might only be a small subset of your actual data. If you don't have these specific requirements (eg. CRUD apps), you can save yourself a lot of unnecessary headaches by defaulting to MySQL.
My main point is attempting to counter the narrative popular on HN that postgres should be an automatic default. For sure there are many aspects in which postgres is superior, I absolutely do not debate that, especially when it comes to developer experience. But there is much more to it than that when it comes to delivering business value. That's where ops and DBA concerns start to matter, and IMO MySQL is so far ahead in this regard that it outweighs all the other hideous warts of working with it, when you consider the bigger picture of the business as a whole.
The problem isn't storing or inserting 50M rows, it querying 50M rows in non trivial ways. And the difference in performance between doing that 'right' and 'wrong' is orders of magnitude.
Eh, intelligent table design should knock most of that out. If you've got an 8 page query implementing a naïve solution to the knapsack problem (I've seen this in the wild) several mistakes have been made.
I've scaled 300M rows in just one of many similarly sized tables to 1M users... in 2007... on a single box with spinning rust in it. Heck, my laptop could handle 100x the production load.
It amazes me that my comment (while admittedly flippant) got voted down.
It really is true that your phone can update a 50M row table about 10K times per second!
That people are incredulous of this is in itself a stunning admission that developers these days don't have the faintest idea what computers can or cannot actually do.
Just run the numbers: 50M rows with a generous 1 KB per row is 50 GB. My iPhone has 1TB of flash storage that has a random access latency of something like 50 microseconds, which equates to 200K IOPS. An ordinary NVMe laptop SSD can now do 2M. Writing even 10K random locations every second is well within mobile device capability, with 50% headroom to "scale". At 1 KB per row, this is just 10 MB/s, which is hilariously low compared to the device peak throughput of easily a few GB/s.
It's not that good usually, e.g. PostgreSQL writes data in pages (8 KB by default), and changing 10K random rows in the 50M rows table can be quite close to the worst case of 1 changed page per changed row, so 8x of your estimate. Also need to multiply x2 to account for WAL writes. Also indexes. It's not hard to hit a throughput limit, especially with HDDs or networked storage. Although local SSDs are crazy fast indeed.
Agreed: 80MB/s for the random 8K page updates. However, transaction logs in modern databases are committed to disk in batches, and each log entry is smaller than a page size. So a nice round number would be 100 MB/s for both.[1]
For comparison, that's about 1 gigabit per second in the era of 200 Gbps networking becoming common. That's a small fraction of SSD write throughput of any modern device, mobile or not. Nobody in their right mind would use HDD storage if scaling was in any way a concern.
[1] Indexes add some overhead to this, obviously, but tend to be smaller than the underlying tables.