I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.
Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.
Whatever the issue is, I’m wishing the engineers working on it all the best.
I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?
I know we're already in the weeds, but Malcolm Gladwell’s “Revisionist History” did an excellent episode on the stuck accelerator problem, basically showing it almost certainly didn't happen.
OP specifically pointed out that it's well documented. Every program makes tradeoffs, and I don't think anyone in this thread implied that because Postgres made this one it's unstable or bad.
The specific behavior of Postgres isn't the FUD, the FUD is the speculation that this is what brought down Roblox for multiple days. As others have mentioned, we don't even know if Roblox uses Postgres, and yet we're diving deep on how an edge case of Postgres brought down Roblox. It's the speculation that I think is FUD.
Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.
I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.
It's an edge case, it only happens _if_ you have either very long running transactions holding xids, or you haven't vacuumed your database enough and it has to do a blocking vacuum to move the xid forward far enough to wrap around.
Most postgres databases wrap around with no issues.
The problem with increasing the size is a pair of xids are needed for every row, so you're doubling that if you go to 64 bits.
IDK about Postgres internals, but typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted.
That's assuming a lot, including that the binaries aren't 64 bit already (a bit unlikely nowadays), and the database wouldn't just use a 32 bit datatype for this specific purpose in this specific configuration. (If this issue has anything to do with transaction IDs at all, as covered elsewhere.)
they're not wrong. its not the software that's the issue, its the data stored on disk. the transaction ID is stored in the data for various reasons.
in theory nowadays it wouldn't be too hard to change if you use logical replication to upgrade the database but it'd be a huge undertaking for a lot of companies.
But you wrote "that's assuming a lot, including that the binaries aren't 64 bit already" as a response to "typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted". Personally I definitely didn't read the latter as "recompile binaries from 32 bits to 64 bits". (I'm assuming that's what you meant by "not 64 bit already"?) More like "recompile 64 bit binaries to use different internal and external data structures".
No, I can't be positive, but I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86, or armv8/arm64 instead of armv7... At least that's how I took it, because not many things are compile-time constants nowadays, especially for databases. It doesn't distribute well.
The size of Postgres XIDs is in fact a compile-time constant.
If you’re wondering why it’s one: 1. it affects the size of a core data structure (that also gets serialized to disk and read back directly into said structure), and 2. basically nobody who has this problem (already a vanishingly small number of people) solves it by changing this constant, since you only have this problem at scale, and changing it would make your scale problems worse (and also make all your existing data unreadable by the new daemon.)
Also, FYI, Postgres uses compile-time constants for things that one might want to change much more frequently (though still in the “vanishingly unlikely” realm), e.g. WAL segment sizes. When you change internal constants like this, it’s expected that you’re changing them for every member of the cluster, or more likely building a new cluster from the ground up that is tuned in this strange way, and importing your data into it. “Doesn’t distribute well” never really enters into it.
Well, I stand corrected. I thought OP did not know that (given they said they didn't know anything about postgres and were talking about "recompiling all binaries"), but reading their posting history they probably do.
With "doesn't distribute well" by the way I meant that it doesn't distribute well as program binaries, not across a cluster. It used to be extremely common to recompile e.g. your Linux kernel, nowadays almost nobody does that unless there are some very specific needs. Of course, building a specialized postgres cluster for exceptional scale would easily qualify.
Yes, you are right. I was not talking about switching to x86-64, but the fact that most code on 64 bit platforms still uses 32 bit integers and that you'll have to recompile to get 64 bit ints.
This shows up sometimes also in numerical programming, e.g. when your meshing ends up producing more than ~4.2 billion grid points. But it is quite rare, precisely because UINT32_MAX is a fairly huge number.
Because OP said they didn't know anything about postgres and were speaking in generic terms, and the way they said "switching to int64" and recompiling all binaries just stroke me as "if you want a 64 bit system, you need to recompile it" (which is true). I'm surprised it's a compile time constant, as I just learned.
Reading their post history now they probably do know what they're talking about, though.
There's a reason for even database servers with 32-bit transaction IDs to be 64-bit processes - performance and memory. AFAIK until Firebird 3.0, Firebird also had 64-bit server builds with 32-bit transaction IDs - it would have broken many external tools had the old on-disk format been changed.
Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?
This is a hell of a comparison to try to make. How many people do you think regularly run into this kind of issue in Postgres? How many people do you think regularly run into the Python GIL?
Even if we assume your premise that vacuum is at least as bad as the GIL is correct, this would easily explain it.
Add in the sheer difference in expectations between a programming language and a database... and well, I don't think there's any mystery here at all. There's easily multiple great explanations, even if your premise is correct.
A company's internal PKI infrastructure wouldn't be responsible for issuing a public-facing certificate. They literally can't sign those -- a real CA has to do it.
You are of course correct but usually public and private would reuse some core components of the infra (eg still need to store signed key pair somewhere safe). I’m speculating here but given how long it’s been down some very core and very difficult to recover service must have failed. Security infra tends to have those properties
Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.
That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.
Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”
I’m just speculating. I didn’t do any in depth research - none of the articles or tweets by Roblox I saw offered anything more than “an internal issue”.
It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.
CockroachDB is an example of what a modern database should be like.
I doubt their homepage is being served from the same DB as their game platform.
I think throwing together a static page better than "we're making the game more awesome" would be simple. It kinda makes me wonder if it's an internal auth/secret issue as has been speculated. That could theoretically make it harder to update the website, especially if it's deployed by CI/CD.
This is a different issue. You ran out of IDs for your row identifier. The PostgreSQL issue is that every transaction has an ID; normally this is a fairly invisible internal thing you don't need to worry about, but in some edge cases this transaction ID can causes problems. You could, in theory, run in to this issue with just one row.
I don't know how MySQL or MariaDB handles this; AFAIK it doesn't have this issue.
The lack of communication for an outage this big is absolutely shameful. I put this on the leadership, not any of the engineers working round the clock. Having been in the middle of a critical service outage that lasted over 24 hours I totally get the craziness of the situation, but Roblox seriously needs to revisit their incident management and customer update process. Even though kids are the main consumers of the app, the near total silence speaks volumes about their business's lack of preparedness for disaster scenarios. If nothing else, I hope they'll see this as an opportunity to learn and do better next time.
What else would you expect when it comes to communication? They posted status page and said they're working on it. Do you want to be part of their internal chat and see what exactly they're investigating?
Would then posting every 2h "still working on it" really made it better?
I think you have a Kernel of truth in your statement, but perhaps you are overlooking how silly it is for someone's well-being to rely on a couple weeks of video game micro transaction.
My heart goes out to the Devs in this category, but hopefully this is further impetus not to stake too much on tech services.
It is business relationship like any other. There's always some risk with investing and relying on a business partner, but it's also reasonable to have expectations of cooperation and good faith.
100%. Really, I'm more expressing my desire to live in a society which doesn't feel a constant (simulated) risk just to exist. This economy is just a game we play, and I think a lot of people don't know this, and also don't want to be part of it.
I think to be working at the top of the Hierarchy of needs (in game development, for instance), people should demonstrably master the lower levels of the pyramid. We really can live in a way, where this is possible. We just have to collectively want it.
In my mind, that reinforces my point. One ought to try to mitigate risk to yourself by minimizing the amount of human infrastructure ones needs to keep life going.
It might be a game for kids, but their company is publicly traded and valued higher than Electronic Arts. (More than ubisoft and take-two interactive combined)
How much growth and revenue would they have missed had they allocated more resources towards better security and recovery plans before this, instead of better features?
As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.
> As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.
Huh? So shareholders who literally own a piece of the company don't have the right to complain about poor management, operational incompetence, company strategy, etc.?
That's not how it works. Shareholders actually have greater legal rights than customers even though both, in practice, are usually fairly limited when it comes to situations like this.
There are multiple studios and independent game developers who rely on a consistent and predictable player count as a means of an income. Multiple days of lost revenue whilst not the end of the world is certainly not ideal and the Roblox Corp should be providing regular updates.
Roblox made their developer tooling require you to be always online and signed in, even though it doesn't actually need this to function. This means that all development workflows have been bricked during this outage too. https://twitter.com/RBXStatus/status/1454815143607607300/pho...
This guy's too humble. He's Matheus Valadares, and he's the creator of Agar.io [1]. This is the game that pretty much started the ".io game" genre [2].
[2]: "Around 2015 a multiplayer game, Agar.io, spawned many other games with a similar playstyle and .io domain". See https://en.wikipedia.org/wiki/.io
absolutely? how many people are in these discords, and where do their players fall in the database sharding key? because thundering herd is definitely a problem when letting people back on to a system, and oh also we have no idea what the underlying issue actually is because Roblox has been basically silent this entire time. (Unsubstantiated Internet rumor about it being the secrets store also doesn't count.)
That's a very long outage, I wonder how this happened.
- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it.
- Data recovery can take forever if you run into edge cases with your databases
Anyone found any articles about their architecture?
oof (from the link). Debugging nightmare to have a fairly inexperienced team trying to diagnose stuff in these incredibly complex systems.
> 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes across 22 clusters, serving 420+ internal developers
> "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."
> We have people who are first-time system administrators deploying applications. building containers, maintaining Nomad. There is a guy on our team who worked in IT help desk for eight years - just today he upgraded an entire cluster himself.”
When I saw "secret store" I guessed it had to be Vault. Vault's is amazing but it lets you configure things that can blow up on you in X time. For example, issuing 50 secrets per second but have every secret expire after a week (or never). It would mean (multiple) goroutine per secret checking status on the lease.
This kind of thing unfortunately, is easy to miss and occur in Vault.
Secret expiration seems like a better thing to do from the application: check the token every time it's used, and if it's past a week mark it as expired. Combine this with caching, etc. Is there an advantage to having such a system in the database?
The person mentioning secret store, and the blog post they put up does seem to somewhat point to Vault, but no name drop.
> A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load. This was not due to any peak in external traffic or any particular experience. Rather the failure was caused by the growth in the number of servers in our datacenters. The result was that most services at Roblox were unable to effectively communicate and deploy.
As longs as we're speculating.. One of the few things I can think of that can't reasonably be sped up is data integrity recovery. Say some data got in an inconsistent state and now they have to manually restore a whole bunch of financial transactions or something before opening the game up again, because otherwise customers would get very mad at missing stuff they've paid for, traded, etc.
If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.
Been checking the #roblox hashtag on Twitter and the two main themes are addicts going through withdrawal and devs saying how they wouldn't have their llama appreciation fan site be down this long let alone your core business.
Saudi Aramco is estimated to be worth $2t-$10t (it's privately held, so noone knows) and 35,000 of their computers got their MBRs and MFTs (extrapolating) nuked in a political sting operation in 2012. It's handwavily rumored the HDD supply chain shortages at the time (remember the floods in Thailand?) were partially from Saudi Aramco momentarily redirecting chunks of the world's HDD manufacturing capacity to fix it.
Colonial Pipeline was down for five days after a ransomware attack, you may remember the gas hording that was occurring back in May as a result... and the many ridiculous videos on the internet of people doing things like trying to fill up the back of pickup trucks with gas.
Adding to the speculation here, I'm willing to bet some component of their issue is not entirely technical. Regardless of the underlying cause (PKI was mentioned), for downtime to last this long it almost definitely means some persistent data was lost or corrupted. Of course they can recover from a backup (I'm confident they have clean backups) but what does that mean for the business? "We irrecoverably lost 12 hours of data" could have severe implications, for example legal or compliance risks.
Why are you confident they have clean backups? It’s been my experience that backup infrastructure is usually not given much thought, engineers infrequently test that recovery from backups work as expected. Not saying that’s what it is but not sure if it can be ruled out.
I wonder how the market is going to react when it opens tomorrow. I am thinking a quick scalp with weekly $RBLX puts, then when it recovers double up on cheap long call options.
There are trading bots that react to news articles so its not unreasonable to see significant swings especially in options activity this week. They also have earnings coming up.
In the immediate short term, yes, obviously - stress, long hours, etc. In the immediate aftermath, probably too: Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".
But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.
> Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".
When something like this threatens to end the entire party (no one pays or get paid) you god damn want to figure out why and not make it happen again. That’s not burdensome, that’s business.
Who wants to bet there will be a Kubernetes versus Nomad/Consul debate coming to a Roblox meeting soon? I'd like to hear what Hashicorp has to say here.
From TFA - “We believe we have identified an underlying internal cause of the outage with no evidence of an external intrusion,” says a Roblox spokesperson.
Thanks, but can we trust any company in this situation to be transparent, given that such an attack would also mean that user data could have been leaked?
Reports of IT incidents from public companies tend to be obscure, vague, missing or technically true but misleading but they almost universally stop short of outright lying (mainly because for public company officers lying to shareholders i.e. the general public has harsher personal risks&consequences than the company losing lots of money or even shutting down; being incompetent is completely legal but lying to your shareholders about material aspects of company finances is a crime and also makes them personally financially liable), so if they say that they have seen no evidence of an external intrusion, then I would presume that it's definitely not ransomware, where signs of external intrusion - namely, the ransom demand - would be obvious and undeniable.
Perhaps (though IMHO not that likely) it may be some other kind of attack e.g. one intended to secretly steal customer data, which does not give signs of external intrusion if the company doesn't look for them much (and it might have motivation to not look very hard), but if it was ransomware, I'm quite sure they would not say what they said.
It's possible, but that'd be a theory entirely founded in paranoia, as it both needs no evidence (we have no sign it's an attack) and accepts no counter-evidence (they have said it is not an attack).
So while it's always a possibility, it's also kinda pointless to wonder about until there's supporting evidence.
Two day outage of a product the size of Roblox is unusual too. I hope they publish a postmortem - my guess is some kind of database issue to have stayed down this long.
It feels like some kind of catastrophic data loss. I can’t imagine that app servers or network infrastructure could have been the root cause, especially because they are running on AWS and there hasn’t been any reports of outages or other customers impacted. Restoring an old backup and rebuilding data from logs seems like the only thing that could take so long. That or an entirely dysfunctional IT org that can’t get out of its own way in a crisis.
"Roblox is very popular, especially with kids — more than 50 percent of Roblox players are under the age of 13. More than 40 million people play it daily" ....from misleading logical non sequitur to parroting Roblox marketing numbers in under 30 words, nice. Verge is such trash
Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.
Whatever the issue is, I’m wishing the engineers working on it all the best.