Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Roblox has been down for days and it’s not because of Chipotle (theverge.com)
243 points by Terretta on Oct 31, 2021 | hide | past | favorite | 169 comments


I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.

Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.

Whatever the issue is, I’m wishing the engineers working on it all the best.


I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?


Maybe their load balancer got hit by a 2009 Toyota Camry with a sticky accelerator pedal?


I know we're already in the weeds, but Malcolm Gladwell’s “Revisionist History” did an excellent episode on the stuck accelerator problem, basically showing it almost certainly didn't happen.

https://www.pushkin.fm/episode/blame-game/


It is a very unlikely Y2K bug


It's not odd at all!

Speculation is a useful intellectual exercise, and the sign of a healthy, intelligent and curious mind!

Speculation is also fun!


No, it's just FUD directed at Postgres.


OP specifically pointed out that it's well documented. Every program makes tradeoffs, and I don't think anyone in this thread implied that because Postgres made this one it's unstable or bad.


The specific behavior of Postgres isn't the FUD, the FUD is the speculation that this is what brought down Roblox for multiple days. As others have mentioned, we don't even know if Roblox uses Postgres, and yet we're diving deep on how an edge case of Postgres brought down Roblox. It's the speculation that I think is FUD.


I did too but it's pretty easy to find out which database(s) a company uses:

https://corp.roblox.com/careers/listing/?gh_jid=3363389


Somebody mentioned a secret store issue that affected all of their services.

https://news.ycombinator.com/item?id=29044500


Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.


Do you have any reason to believe this is the case?


I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.


If it's a known issue, is there no way to increase the transaction ID size?

Quite surprising a seemingly battle-tested database can choke in such a manner.


It's an edge case, it only happens _if_ you have either very long running transactions holding xids, or you haven't vacuumed your database enough and it has to do a blocking vacuum to move the xid forward far enough to wrap around.

Most postgres databases wrap around with no issues.

The problem with increasing the size is a pair of xids are needed for every row, so you're doubling that if you go to 64 bits.


IDK about Postgres internals, but typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted.


That's assuming a lot, including that the binaries aren't 64 bit already (a bit unlikely nowadays), and the database wouldn't just use a 32 bit datatype for this specific purpose in this specific configuration. (If this issue has anything to do with transaction IDs at all, as covered elsewhere.)


they're not wrong. its not the software that's the issue, its the data stored on disk. the transaction ID is stored in the data for various reasons.

in theory nowadays it wouldn't be too hard to change if you use logical replication to upgrade the database but it'd be a huge undertaking for a lot of companies.


Right, but the "you need to recompile binaries if you switch something to 64 bit" part was a bit too general.


You seem to confuse 64 bit integers with 64 bit programs


I didn't.


But you wrote "that's assuming a lot, including that the binaries aren't 64 bit already" as a response to "typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted". Personally I definitely didn't read the latter as "recompile binaries from 32 bits to 64 bits". (I'm assuming that's what you meant by "not 64 bit already"?) More like "recompile 64 bit binaries to use different internal and external data structures".


No, I can't be positive, but I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86, or armv8/arm64 instead of armv7... At least that's how I took it, because not many things are compile-time constants nowadays, especially for databases. It doesn't distribute well.


The size of Postgres XIDs is in fact a compile-time constant.

If you’re wondering why it’s one: 1. it affects the size of a core data structure (that also gets serialized to disk and read back directly into said structure), and 2. basically nobody who has this problem (already a vanishingly small number of people) solves it by changing this constant, since you only have this problem at scale, and changing it would make your scale problems worse (and also make all your existing data unreadable by the new daemon.)

Also, FYI, Postgres uses compile-time constants for things that one might want to change much more frequently (though still in the “vanishingly unlikely” realm), e.g. WAL segment sizes. When you change internal constants like this, it’s expected that you’re changing them for every member of the cluster, or more likely building a new cluster from the ground up that is tuned in this strange way, and importing your data into it. “Doesn’t distribute well” never really enters into it.


Well, I stand corrected. I thought OP did not know that (given they said they didn't know anything about postgres and were talking about "recompiling all binaries"), but reading their posting history they probably do.

With "doesn't distribute well" by the way I meant that it doesn't distribute well as program binaries, not across a cluster. It used to be extremely common to recompile e.g. your Linux kernel, nowadays almost nobody does that unless there are some very specific needs. Of course, building a specialized postgres cluster for exceptional scale would easily qualify.


Well, I'm pretty sure of the exact opposite. Maybe we'll get adjudicated by semi-extrinsic if we're lucky. :)


Yes, you are right. I was not talking about switching to x86-64, but the fact that most code on 64 bit platforms still uses 32 bit integers and that you'll have to recompile to get 64 bit ints.

This shows up sometimes also in numerical programming, e.g. when your meshing ends up producing more than ~4.2 billion grid points. But it is quite rare, precisely because UINT32_MAX is a fairly huge number.


> I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86

I don't see why you should think that given that the discussion has been pretty much about the uint32 nature of TransactionId.


Because OP said they didn't know anything about postgres and were speaking in generic terms, and the way they said "switching to int64" and recompiling all binaries just stroke me as "if you want a 64 bit system, you need to recompile it" (which is true). I'm surprised it's a compile time constant, as I just learned.

Reading their post history now they probably do know what they're talking about, though.


There's a reason for even database servers with 32-bit transaction IDs to be 64-bit processes - performance and memory. AFAIK until Firebird 3.0, Firebird also had 64-bit server builds with 32-bit transaction IDs - it would have broken many external tools had the old on-disk format been changed.


Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?


This is a hell of a comparison to try to make. How many people do you think regularly run into this kind of issue in Postgres? How many people do you think regularly run into the Python GIL?

Even if we assume your premise that vacuum is at least as bad as the GIL is correct, this would easily explain it.

Add in the sheer difference in expectations between a programming language and a database... and well, I don't think there's any mystery here at all. There's easily multiple great explanations, even if your premise is correct.


> why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail?

Because it's better than the alternatives? What would you suggest?


I thought it was a certificate issue? I am looking at robox.com and the issuing CA is GoDaddy...


Certificate issues don't take that long to resolve.


They do if your entire PKI infra is down too


A company's internal PKI infrastructure wouldn't be responsible for issuing a public-facing certificate. They literally can't sign those -- a real CA has to do it.


You are of course correct but usually public and private would reuse some core components of the infra (eg still need to store signed key pair somewhere safe). I’m speculating here but given how long it’s been down some very core and very difficult to recover service must have failed. Security infra tends to have those properties


Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.


That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.


Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”


Wait. You can sign your own. They are just not trusted by the wider world. Your devices have an OS provided set of trusted root-CA.


I’m just speculating. I didn’t do any in depth research - none of the articles or tweets by Roblox I saw offered anything more than “an internal issue”.


Yeah but wouldn’t the impact be more widespread then?


robox or roblox?


I think he must have meant robox right? Cmon man get outta here.. :)


Roblox.com


It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.

CockroachDB is an example of what a modern database should be like.


Funnily enough it looks like Roblox actually uses CockroachDB https://corp.roblox.com/careers/listing/?gh_jid=3363389


Maybe so, but my comment was tangential about Postgres not Roblox. The parent was speculating. We don’t know what their problem is.


The latest version of Postgres addresses this issue in various ways, although it’s not entirely solved, it should be significantly mitigated.


That’s not a bad theory. Even the homepage is down so that suggests their entire database was taken down.


I doubt their homepage is being served from the same DB as their game platform.

I think throwing together a static page better than "we're making the game more awesome" would be simple. It kinda makes me wonder if it's an internal auth/secret issue as has been speculated. That could theoretically make it harder to update the website, especially if it's deployed by CI/CD.


Do any other databases have similar issues to Postgres? Or this is specific to Postgres?


Years ago I ran into this issue with MySQL, storing four billion rows with a unique ID.


This is a different issue. You ran out of IDs for your row identifier. The PostgreSQL issue is that every transaction has an ID; normally this is a fairly invisible internal thing you don't need to worry about, but in some edge cases this transaction ID can causes problems. You could, in theory, run in to this issue with just one row.

I don't know how MySQL or MariaDB handles this; AFAIK it doesn't have this issue.


Was it because your primary key was an unsigned 32-bit integer and that’s not an “issue”.


True. Not the same thing but an equally disastrous outcome.


The lack of communication for an outage this big is absolutely shameful. I put this on the leadership, not any of the engineers working round the clock. Having been in the middle of a critical service outage that lasted over 24 hours I totally get the craziness of the situation, but Roblox seriously needs to revisit their incident management and customer update process. Even though kids are the main consumers of the app, the near total silence speaks volumes about their business's lack of preparedness for disaster scenarios. If nothing else, I hope they'll see this as an opportunity to learn and do better next time.


What else would you expect when it comes to communication? They posted status page and said they're working on it. Do you want to be part of their internal chat and see what exactly they're investigating?

Would then posting every 2h "still working on it" really made it better?


The article says they didn't post to the status page for 24 hours.


Good grief. It's a game for kids. Nobody needs updates. They could even take the weekend off.


>It's a game for kids

There are companies that rely on Roblox for hosting their games so they can make money. This is the equivalent of cloud hosting going down.


I thought they were just making money for Roblox and other big tech companies (who take like three fourths of the total pie).


You both make money


I think you have a Kernel of truth in your statement, but perhaps you are overlooking how silly it is for someone's well-being to rely on a couple weeks of video game micro transaction.

My heart goes out to the Devs in this category, but hopefully this is further impetus not to stake too much on tech services.


It is business relationship like any other. There's always some risk with investing and relying on a business partner, but it's also reasonable to have expectations of cooperation and good faith.


100%. Really, I'm more expressing my desire to live in a society which doesn't feel a constant (simulated) risk just to exist. This economy is just a game we play, and I think a lot of people don't know this, and also don't want to be part of it.

I think to be working at the top of the Hierarchy of needs (in game development, for instance), people should demonstrably master the lower levels of the pyramid. We really can live in a way, where this is possible. We just have to collectively want it.


Things in the physical world break down sometimes. It’s not a result of some calculated collective choice.


In my mind, that reinforces my point. One ought to try to mitigate risk to yourself by minimizing the amount of human infrastructure ones needs to keep life going.


It might be a game for kids, but their company is publicly traded and valued higher than Electronic Arts. (More than ubisoft and take-two interactive combined)


As with all businesses Roblox has a paying customer base and the company has a responsibility to those customers.


I would also argue a responsibility to their shareholders.

How much revenue would they have missed out on during this time? And how much will it affect their longer term growth?

If I were a Roblox shareholder, I would be pretty annoyed by the silence.


How much growth and revenue would they have missed had they allocated more resources towards better security and recovery plans before this, instead of better features?

As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.


> As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.

Huh? So shareholders who literally own a piece of the company don't have the right to complain about poor management, operational incompetence, company strategy, etc.?

That's not how it works. Shareholders actually have greater legal rights than customers even though both, in practice, are usually fairly limited when it comes to situations like this.


There are multiple studios and independent game developers who rely on a consistent and predictable player count as a means of an income. Multiple days of lost revenue whilst not the end of the world is certainly not ideal and the Roblox Corp should be providing regular updates.


>Good grief. It's a game for kids. Nobody needs updates. They could even take the weekend off.

It's also a $50bn company, with thousands of shareholders and businesses that rely on their platform.


You mean a weekend - in plenty of countries, an extended weekend due to holidays - in which many kids hoped they could play some Roblox?


I can't tell if this aggressively dismissive comment is a joke or not.


It’s an actually a platform and ecosystem in which some people make their entire living.


Meh. This is part of playing a game where all the multiplayer goes through one service.


Roblox made their developer tooling require you to be always online and signed in, even though it doesn't actually need this to function. This means that all development workflows have been bricked during this outage too. https://twitter.com/RBXStatus/status/1454815143607607300/pho...


Development has been bloxxed


Fun fact: I operate many web mmo games (or “io games” how people like to call them), and traffic is up around 20-100% since the Roblox outage started.


This guy's too humble. He's Matheus Valadares, and he's the creator of Agar.io [1]. This is the game that pretty much started the ".io game" genre [2].

[1]: https://en.wikipedia.org/wiki/Agar.io

[2]: "Around 2015 a multiplayer game, Agar.io, spawned many other games with a similar playstyle and .io domain". See https://en.wikipedia.org/wiki/.io


Similar to John Carmack coming into a thread and saying he’s worked on a game or two.


Love agar.io and HN for this


That's cool, I love io games. Which ones?


Most well known one is probably agar.io. I made a couple of other ones like diep.io, and most recently digdig.io


Hi M28!


https://twitter.com/Bloxy_News/status/1454861081021587456

"STATUS UPDATE: Roblox is incrementally opening the website to groups of users and will continue to open up to more over the course of the day..."


Confirmation on the official account: https://twitter.com/roblox/status/1454900890180063238


They write that, but according to various game Discords it's absolutely not true. No one is allowed to log in.


absolutely? how many people are in these discords, and where do their players fall in the database sharding key? because thundering herd is definitely a problem when letting people back on to a system, and oh also we have no idea what the underlying issue actually is because Roblox has been basically silent this entire time. (Unsubstantiated Internet rumor about it being the secrets store also doesn't count.)


My friend's kid got in half an hour ago.


You have confirmation 0 Roblox users are able to log in?


That's a very long outage, I wonder how this happened.

- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it. - Data recovery can take forever if you run into edge cases with your databases

Anyone found any articles about their architecture?


Hashicorp has a case study with some details of what they use,

https://www.hashicorp.com/case-studies/roblox


oof (from the link). Debugging nightmare to have a fairly inexperienced team trying to diagnose stuff in these incredibly complex systems.

> 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes across 22 clusters, serving 420+ internal developers

> "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."


> We have people who are first-time system administrators deploying applications. building containers, maintaining Nomad. There is a guy on our team who worked in IT help desk for eight years - just today he upgraded an entire cluster himself.”


I know everyone has to start somewhere, but this just sounds like a sure way to have a platform blow up impressively.


I bet it is some sort of Vault/Consul shenanigans that's going on.


When I saw "secret store" I guessed it had to be Vault. Vault's is amazing but it lets you configure things that can blow up on you in X time. For example, issuing 50 secrets per second but have every secret expire after a week (or never). It would mean (multiple) goroutine per secret checking status on the lease. This kind of thing unfortunately, is easy to miss and occur in Vault.


Secret expiration seems like a better thing to do from the application: check the token every time it's used, and if it's past a week mark it as expired. Combine this with caching, etc. Is there an advantage to having such a system in the database?


What you suggest does scale with performance but not with organization its why Vault is used in the first place.


The person mentioning secret store, and the blog post they put up does seem to somewhat point to Vault, but no name drop.

> A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load. This was not due to any peak in external traffic or any particular experience. Rather the failure was caused by the growth in the number of servers in our datacenters. The result was that most services at Roblox were unable to effectively communicate and deploy.

https://blog.roblox.com/2021/10/update-on-our-outage/


As longs as we're speculating.. One of the few things I can think of that can't reasonably be sped up is data integrity recovery. Say some data got in an inconsistent state and now they have to manually restore a whole bunch of financial transactions or something before opening the game up again, because otherwise customers would get very mad at missing stuff they've paid for, traded, etc.

If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.


Wouldn't it be cheaper to directly compensate such customers rather than keeping the whole website down for 3 days?


maybe, but wouldn't you have to recuperate the data to be able to figure out what customers you owed and how much?


Well you can get that from your transaction processor, doubt they do that themselves


Just ask people to send in support tickets


Right, but as soon as people get wind of what you're doing, you'll be drowning in fraudulent tickets.


This is a possible way to get free items. They have millions of players, check every ticket seems not scalable. No?


Been checking the #roblox hashtag on Twitter and the two main themes are addicts going through withdrawal and devs saying how they wouldn't have their llama appreciation fan site be down this long let alone your core business.


Has any billion dollar company ever been down for 2 days?

Or is this the highscore?


Saudi Aramco is estimated to be worth $2t-$10t (it's privately held, so noone knows) and 35,000 of their computers got their MBRs and MFTs (extrapolating) nuked in a political sting operation in 2012. It's handwavily rumored the HDD supply chain shortages at the time (remember the floods in Thailand?) were partially from Saudi Aramco momentarily redirecting chunks of the world's HDD manufacturing capacity to fix it.

https://darknetdiaries.com/transcript/30/ "Shamoon" (the epsiode about this, which says it depends on the previous two episodes)

https://darknetdiaries.com/transcript/28/ "Unit 8200"

https://darknetdiaries.com/transcript/29/ "Stuxnet"


Maersk had their entire computer infrastructure down for 9 days.

https://www.i-cio.com/management/insight/item/maersk-springi...


The supermarket chain Coop in Sweden was down longer than 2 days. They closed their shops during the ransomware attack. https://www.bbc.com/news/technology-57707530


Tesco online grocery shopping in the UK was down for about 2 days recently, and Tesco as a whole has a market cap of $29.1B.


I assume you could still go in the store and get groceries though, right? Not familiar if they're online-only right now.


Yes, Tesco was basically open the whole time.


Did they publish what happened?


They claimed it was an attack (an "attempt to interfere with our systems") but haven't given any details.


Colonial Pipeline was down for five days after a ransomware attack, you may remember the gas hording that was occurring back in May as a result... and the many ridiculous videos on the internet of people doing things like trying to fill up the back of pickup trucks with gas.


I dont remember how long Garmin was down, but I feel like it had to have been close!


How long was the Playststion Network down for in like 2011? Around a week?


According to Wikipedia (and from my memory of how long it felt like it lasted) 23 days.

https://en.wikipedia.org/wiki/2011_PlayStation_Network_outag...


I was reminded of the early days of Twitter and the FailWhale!

But I don't think they ever went down for two days?

But certainly Twitter were notorious for regular outages!


Notpetya really hindered operations for some companies of that size. Depends on how you define an outage.


What about Currys? They always seem to be down when I need to use their website.


Adding to the speculation here, I'm willing to bet some component of their issue is not entirely technical. Regardless of the underlying cause (PKI was mentioned), for downtime to last this long it almost definitely means some persistent data was lost or corrupted. Of course they can recover from a backup (I'm confident they have clean backups) but what does that mean for the business? "We irrecoverably lost 12 hours of data" could have severe implications, for example legal or compliance risks.


Why are you confident they have clean backups? It’s been my experience that backup infrastructure is usually not given much thought, engineers infrequently test that recovery from backups work as expected. Not saying that’s what it is but not sure if it can be ruled out.


I wonder how the market is going to react when it opens tomorrow. I am thinking a quick scalp with weekly $RBLX puts, then when it recovers double up on cheap long call options.


I like slots myself.


It went up when it was known that the site was down, so it can be a bit hard to predict.


Down 3% since opening. Roblox still has not recovered. Might be a free fall if they have not recovered by mid week.


probably a bunch of bots that bought on any news is good news.


Pretty remarkable to think the market prices things like this, or that public information everyone is privy to gives you an edge.


There are trading bots that react to news articles so its not unreasonable to see significant swings especially in options activity this week. They also have earnings coming up.



Internal cause does not necessarily mean a technical mishap. Read rogue sysadmin or other employee initiated event.


I've been getting recruiting emails from Roblox recently.... Maybe they really do need my help.


I feel for the people in the trenches on this one. It’s got to suck bad.


In the immediate short term, yes, obviously - stress, long hours, etc. In the immediate aftermath, probably too: Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".

But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.


> Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".

When something like this threatens to end the entire party (no one pays or get paid) you god damn want to figure out why and not make it happen again. That’s not burdensome, that’s business.


> management that isn't always aware of reality

This is the burdensome part.


They are still a public company so I guess we will see the damage on Monday. Might be an opportunity to buy low.


Who wants to bet there will be a Kubernetes versus Nomad/Consul debate coming to a Roblox meeting soon? I'd like to hear what Hashicorp has to say here.


Is there a possibility this is caused by ransomware?


From TFA - “We believe we have identified an underlying internal cause of the outage with no evidence of an external intrusion,” says a Roblox spokesperson.


Thanks, but can we trust any company in this situation to be transparent, given that such an attack would also mean that user data could have been leaked?


Reports of IT incidents from public companies tend to be obscure, vague, missing or technically true but misleading but they almost universally stop short of outright lying (mainly because for public company officers lying to shareholders i.e. the general public has harsher personal risks&consequences than the company losing lots of money or even shutting down; being incompetent is completely legal but lying to your shareholders about material aspects of company finances is a crime and also makes them personally financially liable), so if they say that they have seen no evidence of an external intrusion, then I would presume that it's definitely not ransomware, where signs of external intrusion - namely, the ransom demand - would be obvious and undeniable.

Perhaps (though IMHO not that likely) it may be some other kind of attack e.g. one intended to secretly steal customer data, which does not give signs of external intrusion if the company doesn't look for them much (and it might have motivation to not look very hard), but if it was ransomware, I'm quite sure they would not say what they said.


It's possible, but that'd be a theory entirely founded in paranoia, as it both needs no evidence (we have no sign it's an attack) and accepts no counter-evidence (they have said it is not an attack).

So while it's always a possibility, it's also kinda pointless to wonder about until there's supporting evidence.


A ransomware intrusion into a modern large tech company like Roblox would be unusual and impressive.


Two day outage of a product the size of Roblox is unusual too. I hope they publish a postmortem - my guess is some kind of database issue to have stayed down this long.


At the same time, Twitch and Kaseya both got broken into recently.


Twitch was not breached by ransomware, and Kaseya is not the caliber of company I am discussing.


Well, I'm glad I'm finally off the hook.


Apparently, according to roblox.com, some player are able to play: "We are incrementally opening to groups of players and will continue rolling out."

https://i.imgur.com/KgDxNsg.png


They write that, but according to various game Discords it's absolutely not true. No one is allowed to log in.


Like when Camelcamelcamel was down for a week in 2019

https://news.ycombinator.com/item?id=19038198


Seems to work at least partly now. I just jumped into one of the more popular games and a bunch of people playing.


I see one of their DBs is Mongo. I wonder if they ran into some sharding related nightmare.



A couple of possibilities...

I know they said they weren't hacked but they were hacked.

or

They are completely inept and have no disaster recovery plan in place, etc.


It feels like some kind of catastrophic data loss. I can’t imagine that app servers or network infrastructure could have been the root cause, especially because they are running on AWS and there hasn’t been any reports of outages or other customers impacted. Restoring an old backup and rebuilding data from logs seems like the only thing that could take so long. That or an entirely dysfunctional IT org that can’t get out of its own way in a crisis.

Best to them.


This article from 2019 suggests they use a mix of cloud and a dedicated data center.

https://portworx.com/blog/architects-corner-roblox-runs-plat...


"Roblox is very popular, especially with kids — more than 50 percent of Roblox players are under the age of 13. More than 40 million people play it daily" ....from misleading logical non sequitur to parroting Roblox marketing numbers in under 30 words, nice. Verge is such trash




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: