Roblox has been down for days and it’s not because of Chipotle

jitl · on Oct 31, 2021

I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.

Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.

Whatever the issue is, I’m wishing the engineers working on it all the best.

mikeklaas · on Oct 31, 2021

I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?

anyfoo · on Oct 31, 2021

Maybe their load balancer got hit by a 2009 Toyota Camry with a sticky accelerator pedal?

wdr1 · on Nov 1, 2021

I know we're already in the weeds, but Malcolm Gladwell’s “Revisionist History” did an excellent episode on the stuck accelerator problem, basically showing it almost certainly didn't happen.

https://www.pushkin.fm/episode/blame-game/

Threeve303 · on Oct 31, 2021

It is a very unlikely Y2K bug

firecall · on Oct 31, 2021

It's not odd at all!

Speculation is a useful intellectual exercise, and the sign of a healthy, intelligent and curious mind!

Speculation is also fun!

Buttons840 · on Nov 1, 2021

No, it's just FUD directed at Postgres.

ravi-delia · on Nov 1, 2021

OP specifically pointed out that it's well documented. Every program makes tradeoffs, and I don't think anyone in this thread implied that because Postgres made this one it's unstable or bad.

Buttons840 · on Nov 1, 2021

The specific behavior of Postgres isn't the FUD, the FUD is the speculation that this is what brought down Roblox for multiple days. As others have mentioned, we don't even know if Roblox uses Postgres, and yet we're diving deep on how an edge case of Postgres brought down Roblox. It's the speculation that I think is FUD.

mbesto · on Nov 1, 2021

I did too but it's pretty easy to find out which database(s) a company uses:

https://corp.roblox.com/careers/listing/?gh_jid=3363389

frankjr · on Oct 31, 2021

Somebody mentioned a secret store issue that affected all of their services.

https://news.ycombinator.com/item?id=29044500

drewbailey · on Nov 1, 2021

Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.

mulmen · on Oct 31, 2021

Do you have any reason to believe this is the case?

hintymad · on Oct 31, 2021

I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.

crehn · on Oct 31, 2021

If it's a known issue, is there no way to increase the transaction ID size?

Quite surprising a seemingly battle-tested database can choke in such a manner.

sitharus · on Oct 31, 2021

It's an edge case, it only happens _if_ you have either very long running transactions holding xids, or you haven't vacuumed your database enough and it has to do a blocking vacuum to move the xid forward far enough to wrap around.

Most postgres databases wrap around with no issues.

The problem with increasing the size is a pair of xids are needed for every row, so you're doubling that if you go to 64 bits.

semi-extrinsic · on Oct 31, 2021

IDK about Postgres internals, but typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted.

anyfoo · on Oct 31, 2021

That's assuming a lot, including that the binaries aren't 64 bit already (a bit unlikely nowadays), and the database wouldn't just use a 32 bit datatype for this specific purpose in this specific configuration. (If this issue has anything to do with transaction IDs at all, as covered elsewhere.)

jatone · on Oct 31, 2021

they're not wrong. its not the software that's the issue, its the data stored on disk. the transaction ID is stored in the data for various reasons.

in theory nowadays it wouldn't be too hard to change if you use logical replication to upgrade the database but it'd be a huge undertaking for a lot of companies.

anyfoo · on Oct 31, 2021

Right, but the "you need to recompile binaries if you switch something to 64 bit" part was a bit too general.

alophawen · on Oct 31, 2021

You seem to confuse 64 bit integers with 64 bit programs

anyfoo · on Oct 31, 2021

I didn't.

jhgb · on Nov 1, 2021

But you wrote "that's assuming a lot, including that the binaries aren't 64 bit already" as a response to "typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted". Personally I definitely didn't read the latter as "recompile binaries from 32 bits to 64 bits". (I'm assuming that's what you meant by "not 64 bit already"?) More like "recompile 64 bit binaries to use different internal and external data structures".

anyfoo · on Nov 1, 2021

No, I can't be positive, but I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86, or armv8/arm64 instead of armv7... At least that's how I took it, because not many things are compile-time constants nowadays, especially for databases. It doesn't distribute well.

derefr · on Nov 1, 2021

The size of Postgres XIDs is in fact a compile-time constant.

If you’re wondering why it’s one: 1. it affects the size of a core data structure (that also gets serialized to disk and read back directly into said structure), and 2. basically nobody who has this problem (already a vanishingly small number of people) solves it by changing this constant, since you only have this problem at scale, and changing it would make your scale problems worse (and also make all your existing data unreadable by the new daemon.)

Also, FYI, Postgres uses compile-time constants for things that one might want to change much more frequently (though still in the “vanishingly unlikely” realm), e.g. WAL segment sizes. When you change internal constants like this, it’s expected that you’re changing them for every member of the cluster, or more likely building a new cluster from the ground up that is tuned in this strange way, and importing your data into it. “Doesn’t distribute well” never really enters into it.

anyfoo · on Nov 1, 2021

Well, I stand corrected. I thought OP did not know that (given they said they didn't know anything about postgres and were talking about "recompiling all binaries"), but reading their posting history they probably do.

With "doesn't distribute well" by the way I meant that it doesn't distribute well as program binaries, not across a cluster. It used to be extremely common to recompile e.g. your Linux kernel, nowadays almost nobody does that unless there are some very specific needs. Of course, building a specialized postgres cluster for exceptional scale would easily qualify.

jhgb · on Nov 1, 2021

Well, I'm pretty sure of the exact opposite. Maybe we'll get adjudicated by semi-extrinsic if we're lucky. :)

semi-extrinsic · on Nov 1, 2021

Yes, you are right. I was not talking about switching to x86-64, but the fact that most code on 64 bit platforms still uses 32 bit integers and that you'll have to recompile to get 64 bit ints.

This shows up sometimes also in numerical programming, e.g. when your meshing ends up producing more than ~4.2 billion grid points. But it is quite rare, precisely because UINT32_MAX is a fairly huge number.

pchanda · on Nov 1, 2021

> I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86

I don't see why you should think that given that the discussion has been pretty much about the uint32 nature of TransactionId.

anyfoo · on Nov 1, 2021

Because OP said they didn't know anything about postgres and were speaking in generic terms, and the way they said "switching to int64" and recompiling all binaries just stroke me as "if you want a 64 bit system, you need to recompile it" (which is true). I'm surprised it's a compile time constant, as I just learned.

Reading their post history now they probably do know what they're talking about, though.

jhgb · on Nov 1, 2021

There's a reason for even database servers with 32-bit transaction IDs to be 64-bit processes - performance and memory. AFAIK until Firebird 3.0, Firebird also had 64-bit server builds with 32-bit transaction IDs - it would have broken many external tools had the old on-disk format been changed.

tluyben2 · on Nov 1, 2021

Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?

foerbert · on Nov 1, 2021

This is a hell of a comparison to try to make. How many people do you think regularly run into this kind of issue in Postgres? How many people do you think regularly run into the Python GIL?

Even if we assume your premise that vacuum is at least as bad as the GIL is correct, this would easily explain it.

Add in the sheer difference in expectations between a programming language and a database... and well, I don't think there's any mystery here at all. There's easily multiple great explanations, even if your premise is correct.

jhgb · on Nov 1, 2021

> why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail?

Because it's better than the alternatives? What would you suggest?

xyst · on Oct 31, 2021

I thought it was a certificate issue? I am looking at robox.com and the issuing CA is GoDaddy...

mateo411 · on Oct 31, 2021

Certificate issues don't take that long to resolve.

dilyevsky · on Oct 31, 2021

They do if your entire PKI infra is down too

duskwuff · on Oct 31, 2021

A company's internal PKI infrastructure wouldn't be responsible for issuing a public-facing certificate. They literally can't sign those -- a real CA has to do it.

dilyevsky · on Oct 31, 2021

You are of course correct but usually public and private would reuse some core components of the infra (eg still need to store signed key pair somewhere safe). I’m speculating here but given how long it’s been down some very core and very difficult to recover service must have failed. Security infra tends to have those properties

charcircuit · on Oct 31, 2021

Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.

crehn · on Oct 31, 2021

That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.

dilyevsky · on Oct 31, 2021

Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”

djbusby · on Oct 31, 2021

Wait. You can sign your own. They are just not trusted by the wider world. Your devices have an OS provided set of trusted root-CA.

jitl · on Oct 31, 2021

I’m just speculating. I didn’t do any in depth research - none of the articles or tweets by Roblox I saw offered anything more than “an internal issue”.

isuckatcoding · on Oct 31, 2021

Yeah but wouldn’t the impact be more widespread then?

mig39 · on Oct 31, 2021

robox or roblox?

indigodaddy · on Nov 1, 2021

I think he must have meant robox right? Cmon man get outta here.. :)

xyst · on Nov 1, 2021

Roblox.com

api · on Nov 1, 2021

It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.

CockroachDB is an example of what a modern database should be like.

mmiyer · on Nov 1, 2021

Funnily enough it looks like Roblox actually uses CockroachDB https://corp.roblox.com/careers/listing/?gh_jid=3363389

api · on Nov 1, 2021

Maybe so, but my comment was tangential about Postgres not Roblox. The parent was speculating. We don’t know what their problem is.

vbg · on Oct 31, 2021

The latest version of Postgres addresses this issue in various ways, although it’s not entirely solved, it should be significantly mitigated.

AznHisoka · on Oct 31, 2021

That’s not a bad theory. Even the homepage is down so that suggests their entire database was taken down.

bink · on Oct 31, 2021

I doubt their homepage is being served from the same DB as their game platform.

I think throwing together a static page better than "we're making the game more awesome" would be simple. It kinda makes me wonder if it's an internal auth/secret issue as has been speculated. That could theoretically make it harder to update the website, especially if it's deployed by CI/CD.

1cvmask · on Oct 31, 2021

Do any other databases have similar issues to Postgres? Or this is specific to Postgres?

ElbertF · on Oct 31, 2021

Years ago I ran into this issue with MySQL, storing four billion rows with a unique ID.

arp242 · on Nov 1, 2021

This is a different issue. You ran out of IDs for your row identifier. The PostgreSQL issue is that every transaction has an ID; normally this is a fairly invisible internal thing you don't need to worry about, but in some edge cases this transaction ID can causes problems. You could, in theory, run in to this issue with just one row.

I don't know how MySQL or MariaDB handles this; AFAIK it doesn't have this issue.

dreyfan · on Nov 1, 2021

Was it because your primary key was an unsigned 32-bit integer and that’s not an “issue”.

ElbertF · on Nov 1, 2021

True. Not the same thing but an equally disastrous outcome.

romanhn · on Oct 31, 2021

The lack of communication for an outage this big is absolutely shameful. I put this on the leadership, not any of the engineers working round the clock. Having been in the middle of a critical service outage that lasted over 24 hours I totally get the craziness of the situation, but Roblox seriously needs to revisit their incident management and customer update process. Even though kids are the main consumers of the app, the near total silence speaks volumes about their business's lack of preparedness for disaster scenarios. If nothing else, I hope they'll see this as an opportunity to learn and do better next time.

justapassenger · on Nov 1, 2021

What else would you expect when it comes to communication? They posted status page and said they're working on it. Do you want to be part of their internal chat and see what exactly they're investigating?

Would then posting every 2h "still working on it" really made it better?

lkrubner · on Nov 1, 2021

The article says they didn't post to the status page for 24 hours.

CheezeIt · on Oct 31, 2021

Good grief. It's a game for kids. Nobody needs updates. They could even take the weekend off.

charcircuit · on Oct 31, 2021

>It's a game for kids

There are companies that rely on Roblox for hosting their games so they can make money. This is the equivalent of cloud hosting going down.

Hamuko · on Oct 31, 2021

I thought they were just making money for Roblox and other big tech companies (who take like three fourths of the total pie).

charcircuit · on Nov 1, 2021

You both make money

WhisperingShiba · on Oct 31, 2021

I think you have a Kernel of truth in your statement, but perhaps you are overlooking how silly it is for someone's well-being to rely on a couple weeks of video game micro transaction.

My heart goes out to the Devs in this category, but hopefully this is further impetus not to stake too much on tech services.

kjeetgill · on Oct 31, 2021

It is business relationship like any other. There's always some risk with investing and relying on a business partner, but it's also reasonable to have expectations of cooperation and good faith.

WhisperingShiba · on Oct 31, 2021

100%. Really, I'm more expressing my desire to live in a society which doesn't feel a constant (simulated) risk just to exist. This economy is just a game we play, and I think a lot of people don't know this, and also don't want to be part of it.

I think to be working at the top of the Hierarchy of needs (in game development, for instance), people should demonstrably master the lower levels of the pyramid. We really can live in a way, where this is possible. We just have to collectively want it.

freewilly1040 · on Nov 1, 2021

Things in the physical world break down sometimes. It’s not a result of some calculated collective choice.

WhisperingShiba · on Nov 1, 2021

In my mind, that reinforces my point. One ought to try to mitigate risk to yourself by minimizing the amount of human infrastructure ones needs to keep life going.

kjaftaedi · on Oct 31, 2021

It might be a game for kids, but their company is publicly traded and valued higher than Electronic Arts. (More than ubisoft and take-two interactive combined)

_yoqn · on Oct 31, 2021

As with all businesses Roblox has a paying customer base and the company has a responsibility to those customers.

notjesse · on Oct 31, 2021

I would also argue a responsibility to their shareholders.

How much revenue would they have missed out on during this time? And how much will it affect their longer term growth?

If I were a Roblox shareholder, I would be pretty annoyed by the silence.

Maarten88 · on Nov 1, 2021

How much growth and revenue would they have missed had they allocated more resources towards better security and recovery plans before this, instead of better features?

As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.

LurkingPenguin · on Nov 1, 2021

> As a shareholder, you own the company including its problems. You don't get to complain like a customer, you did not buy any product or service from them, it's the other way around.

Huh? So shareholders who literally own a piece of the company don't have the right to complain about poor management, operational incompetence, company strategy, etc.?

That's not how it works. Shareholders actually have greater legal rights than customers even though both, in practice, are usually fairly limited when it comes to situations like this.

tfsh · on Nov 1, 2021

There are multiple studios and independent game developers who rely on a consistent and predictable player count as a means of an income. Multiple days of lost revenue whilst not the end of the world is certainly not ideal and the Roblox Corp should be providing regular updates.

ramesh31 · on Nov 1, 2021

>Good grief. It's a game for kids. Nobody needs updates. They could even take the weekend off.

It's also a $50bn company, with thousands of shareholders and businesses that rely on their platform.

TeMPOraL · on Oct 31, 2021

You mean a weekend - in plenty of countries, an extended weekend due to holidays - in which many kids hoped they could play some Roblox?

jerbearito · on Oct 31, 2021

I can't tell if this aggressively dismissive comment is a joke or not.

newobj · on Oct 31, 2021

It’s an actually a platform and ecosystem in which some people make their entire living.

swiley · on Oct 31, 2021

Meh. This is part of playing a game where all the multiplayer goes through one service.

intunderflow · on Oct 31, 2021

Roblox made their developer tooling require you to be always online and signed in, even though it doesn't actually need this to function. This means that all development workflows have been bricked during this outage too. https://twitter.com/RBXStatus/status/1454815143607607300/pho...

lotophage · on Oct 31, 2021

Development has been bloxxed

Matheus28 · on Oct 31, 2021

Fun fact: I operate many web mmo games (or “io games” how people like to call them), and traffic is up around 20-100% since the Roblox outage started.

truetraveller · on Oct 31, 2021

This guy's too humble. He's Matheus Valadares, and he's the creator of Agar.io [1]. This is the game that pretty much started the ".io game" genre [2].

[1]: https://en.wikipedia.org/wiki/Agar.io

[2]: "Around 2015 a multiplayer game, Agar.io, spawned many other games with a similar playstyle and .io domain". See https://en.wikipedia.org/wiki/.io

bombcar · on Nov 1, 2021

Similar to John Carmack coming into a thread and saying he’s worked on a game or two.

mromanuk · on Nov 1, 2021

Love agar.io and HN for this

schnebbau · on Oct 31, 2021

That's cool, I love io games. Which ones?

Matheus28 · on Nov 1, 2021

Most well known one is probably agar.io. I made a couple of other ones like diep.io, and most recently digdig.io

jimmydorry · on Nov 1, 2021

Hi M28!

swatkat · on Oct 31, 2021

https://twitter.com/Bloxy_News/status/1454861081021587456

"STATUS UPDATE: Roblox is incrementally opening the website to groups of users and will continue to open up to more over the course of the day..."

meheleventyone · on Oct 31, 2021

Confirmation on the official account: https://twitter.com/roblox/status/1454900890180063238

abricot · on Oct 31, 2021

They write that, but according to various game Discords it's absolutely not true. No one is allowed to log in.

fragmede · on Oct 31, 2021

absolutely? how many people are in these discords, and where do their players fall in the database sharding key? because thundering herd is definitely a problem when letting people back on to a system, and oh also we have no idea what the underlying issue actually is because Roblox has been basically silent this entire time. (Unsubstantiated Internet rumor about it being the secrets store also doesn't count.)

huhtenberg · on Oct 31, 2021

My friend's kid got in half an hour ago.

yibg · on Oct 31, 2021

You have confirmation 0 Roblox users are able to log in?

tschellenbach · on Oct 31, 2021

That's a very long outage, I wonder how this happened.

- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it. - Data recovery can take forever if you run into edge cases with your databases

Anyone found any articles about their architecture?

ekovarski · on Oct 31, 2021

Hashicorp has a case study with some details of what they use,

https://www.hashicorp.com/case-studies/roblox

breadzeppelin__ · on Oct 31, 2021

oof (from the link). Debugging nightmare to have a fairly inexperienced team trying to diagnose stuff in these incredibly complex systems.

> 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes across 22 clusters, serving 420+ internal developers

> "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."

ytjohn · on Oct 31, 2021

> We have people who are first-time system administrators deploying applications. building containers, maintaining Nomad. There is a guy on our team who worked in IT help desk for eight years - just today he upgraded an entire cluster himself.”

sidlls · on Oct 31, 2021

I know everyone has to start somewhere, but this just sounds like a sure way to have a platform blow up impressively.

kawsper · on Oct 31, 2021

I bet it is some sort of Vault/Consul shenanigans that's going on.

wernerb · on Oct 31, 2021

When I saw "secret store" I guessed it had to be Vault. Vault's is amazing but it lets you configure things that can blow up on you in X time. For example, issuing 50 secrets per second but have every secret expire after a week (or never). It would mean (multiple) goroutine per secret checking status on the lease. This kind of thing unfortunately, is easy to miss and occur in Vault.

tentacleuno · on Nov 1, 2021

Secret expiration seems like a better thing to do from the application: check the token every time it's used, and if it's past a week mark it as expired. Combine this with caching, etc. Is there an advantage to having such a system in the database?

wernerb · on Nov 1, 2021

What you suggest does scale with performance but not with organization its why Vault is used in the first place.

flutas · on Nov 1, 2021

The person mentioning secret store, and the blog post they put up does seem to somewhat point to Vault, but no name drop.

> A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load. This was not due to any peak in external traffic or any particular experience. Rather the failure was caused by the growth in the number of servers in our datacenters. The result was that most services at Roblox were unable to effectively communicate and deploy.

https://blog.roblox.com/2021/10/update-on-our-outage/

breakingcups · on Oct 31, 2021

As longs as we're speculating.. One of the few things I can think of that can't reasonably be sped up is data integrity recovery. Say some data got in an inconsistent state and now they have to manually restore a whole bunch of financial transactions or something before opening the game up again, because otherwise customers would get very mad at missing stuff they've paid for, traded, etc.

If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.

g123g · on Oct 31, 2021

Wouldn't it be cheaper to directly compensate such customers rather than keeping the whole website down for 3 days?

bryanrasmussen · on Oct 31, 2021

maybe, but wouldn't you have to recuperate the data to be able to figure out what customers you owed and how much?

fieldcny · on Nov 1, 2021

Well you can get that from your transaction processor, doubt they do that themselves

charcircuit · on Oct 31, 2021

Just ask people to send in support tickets

TeMPOraL · on Oct 31, 2021

Right, but as soon as people get wind of what you're doing, you'll be drowning in fraudulent tickets.

marcooliv · on Nov 1, 2021

This is a possible way to get free items. They have millions of players, check every ticket seems not scalable. No?

tetron · on Oct 31, 2021

Been checking the #roblox hashtag on Twitter and the two main themes are addicts going through withdrawal and devs saying how they wouldn't have their llama appreciation fan site be down this long let alone your core business.

TekMol · on Oct 31, 2021

Has any billion dollar company ever been down for 2 days?

Or is this the highscore?

exikyut · on Nov 1, 2021

Saudi Aramco is estimated to be worth $2t-$10t (it's privately held, so noone knows) and 35,000 of their computers got their MBRs and MFTs (extrapolating) nuked in a political sting operation in 2012. It's handwavily rumored the HDD supply chain shortages at the time (remember the floods in Thailand?) were partially from Saudi Aramco momentarily redirecting chunks of the world's HDD manufacturing capacity to fix it.

https://darknetdiaries.com/transcript/30/ "Shamoon" (the epsiode about this, which says it depends on the previous two episodes)

https://darknetdiaries.com/transcript/28/ "Unit 8200"

https://darknetdiaries.com/transcript/29/ "Stuxnet"

wodenokoto · on Oct 31, 2021

Maersk had their entire computer infrastructure down for 9 days.

https://www.i-cio.com/management/insight/item/maersk-springi...

cinntaile · on Oct 31, 2021

The supermarket chain Coop in Sweden was down longer than 2 days. They closed their shops during the ransomware attack. https://www.bbc.com/news/technology-57707530

jlokier · on Oct 31, 2021

Tesco online grocery shopping in the UK was down for about 2 days recently, and Tesco as a whole has a market cap of $29.1B.

mysterydip · on Oct 31, 2021

I assume you could still go in the store and get groceries though, right? Not familiar if they're online-only right now.

arichard123 · on Oct 31, 2021

Yes, Tesco was basically open the whole time.

TekMol · on Oct 31, 2021

Did they publish what happened?

omnicognate · on Oct 31, 2021

They claimed it was an attack (an "attempt to interfere with our systems") but haven't given any details.

gpm · on Nov 1, 2021

Colonial Pipeline was down for five days after a ransomware attack, you may remember the gas hording that was occurring back in May as a result... and the many ridiculous videos on the internet of people doing things like trying to fill up the back of pickup trucks with gas.

sithlord · on Oct 31, 2021

I dont remember how long Garmin was down, but I feel like it had to have been close!

isiahl · on Oct 31, 2021

How long was the Playststion Network down for in like 2011? Around a week?

flutas · on Nov 1, 2021

According to Wikipedia (and from my memory of how long it felt like it lasted) 23 days.

https://en.wikipedia.org/wiki/2011_PlayStation_Network_outag...

firecall · on Oct 31, 2021

I was reminded of the early days of Twitter and the FailWhale!

But I don't think they ever went down for two days?

But certainly Twitter were notorious for regular outages!

grayfaced · on Oct 31, 2021

Notpetya really hindered operations for some companies of that size. Depends on how you define an outage.

tentacleuno · on Nov 1, 2021

What about Currys? They always seem to be down when I need to use their website.

xakahnx · on Oct 31, 2021

Adding to the speculation here, I'm willing to bet some component of their issue is not entirely technical. Regardless of the underlying cause (PKI was mentioned), for downtime to last this long it almost definitely means some persistent data was lost or corrupted. Of course they can recover from a backup (I'm confident they have clean backups) but what does that mean for the business? "We irrecoverably lost 12 hours of data" could have severe implications, for example legal or compliance risks.

pm90 · on Oct 31, 2021

Why are you confident they have clean backups? It’s been my experience that backup infrastructure is usually not given much thought, engineers infrequently test that recovery from backups work as expected. Not saying that’s what it is but not sure if it can be ruled out.

xyst · on Oct 31, 2021

I wonder how the market is going to react when it opens tomorrow. I am thinking a quick scalp with weekly $RBLX puts, then when it recovers double up on cheap long call options.

pkulak · on Oct 31, 2021

I like slots myself.

cinntaile · on Oct 31, 2021

It went up when it was known that the site was down, so it can be a bit hard to predict.

xyst · on Nov 1, 2021

Down 3% since opening. Roblox still has not recovered. Might be a free fall if they have not recovered by mid week.

downrightmike · on Oct 31, 2021

probably a bunch of bots that bought on any news is good news.

ctvo · on Oct 31, 2021

Pretty remarkable to think the market prices things like this, or that public information everyone is privy to gives you an edge.

Infinitesimus · on Oct 31, 2021

There are trading bots that react to news articles so its not unreasonable to see significant swings especially in options activity this week. They also have earnings coming up.

mukundmr · on Nov 1, 2021

Robolox is back online. https://twitter.com/roblox/status/1454994222583455744?s=21

roamerz · on Oct 31, 2021

Internal cause does not necessarily mean a technical mishap. Read rogue sysadmin or other employee initiated event.

QuercusMax · on Nov 1, 2021

I've been getting recruiting emails from Roblox recently.... Maybe they really do need my help.

christkv · on Oct 31, 2021

I feel for the people in the trenches on this one. It’s got to suck bad.

tgsovlerkhgsel · on Oct 31, 2021

In the immediate short term, yes, obviously - stress, long hours, etc. In the immediate aftermath, probably too: Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".

But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.

i386 · on Oct 31, 2021

> Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".

When something like this threatens to end the entire party (no one pays or get paid) you god damn want to figure out why and not make it happen again. That’s not burdensome, that’s business.

waych · on Oct 31, 2021

> management that isn't always aware of reality

This is the burdensome part.

christkv · on Oct 31, 2021

They are still a public company so I guess we will see the damage on Monday. Might be an opportunity to buy low.

bastardoperator · on Nov 1, 2021

Who wants to bet there will be a Kubernetes versus Nomad/Consul debate coming to a Roblox meeting soon? I'd like to hear what Hashicorp has to say here.

amelius · on Oct 31, 2021

Is there a possibility this is caused by ransomware?

PeterisP · on Oct 31, 2021

From TFA - “We believe we have identified an underlying internal cause of the outage with no evidence of an external intrusion,” says a Roblox spokesperson.

amelius · on Oct 31, 2021

Thanks, but can we trust any company in this situation to be transparent, given that such an attack would also mean that user data could have been leaked?

PeterisP · on Oct 31, 2021

Reports of IT incidents from public companies tend to be obscure, vague, missing or technically true but misleading but they almost universally stop short of outright lying (mainly because for public company officers lying to shareholders i.e. the general public has harsher personal risks&consequences than the company losing lots of money or even shutting down; being incompetent is completely legal but lying to your shareholders about material aspects of company finances is a crime and also makes them personally financially liable), so if they say that they have seen no evidence of an external intrusion, then I would presume that it's definitely not ransomware, where signs of external intrusion - namely, the ransom demand - would be obvious and undeniable.

Perhaps (though IMHO not that likely) it may be some other kind of attack e.g. one intended to secretly steal customer data, which does not give signs of external intrusion if the company doesn't look for them much (and it might have motivation to not look very hard), but if it was ransomware, I'm quite sure they would not say what they said.

Groxx · on Oct 31, 2021

It's possible, but that'd be a theory entirely founded in paranoia, as it both needs no evidence (we have no sign it's an attack) and accepts no counter-evidence (they have said it is not an attack).

So while it's always a possibility, it's also kinda pointless to wonder about until there's supporting evidence.

whimsicalism · on Oct 31, 2021

A ransomware intrusion into a modern large tech company like Roblox would be unusual and impressive.

mattnewton · on Oct 31, 2021

Two day outage of a product the size of Roblox is unusual too. I hope they publish a postmortem - my guess is some kind of database issue to have stayed down this long.

_otwt · on Oct 31, 2021

At the same time, Twitch and Kaseya both got broken into recently.

whimsicalism · on Oct 31, 2021

Twitch was not breached by ransomware, and Kaseya is not the caliber of company I am discussing.

chipotle_coyote · on Oct 31, 2021

Well, I'm glad I'm finally off the hook.

EastOfTruth · on Oct 31, 2021

Apparently, according to roblox.com, some player are able to play: "We are incrementally opening to groups of players and will continue rolling out."

https://i.imgur.com/KgDxNsg.png

abricot · on Oct 31, 2021

They write that, but according to various game Discords it's absolutely not true. No one is allowed to log in.

jeffal · on Oct 31, 2021

Like when Camelcamelcamel was down for a week in 2019

https://news.ycombinator.com/item?id=19038198

aderdale · on Oct 31, 2021

Seems to work at least partly now. I just jumped into one of the more popular games and a bunch of people playing.

winrid · on Oct 31, 2021

I see one of their DBs is Mongo. I wonder if they ran into some sharding related nightmare.

devmunchies · on Oct 31, 2021

back online https://twitter.com/Roblox/status/1454954676726628356

hourislate · on Oct 31, 2021

A couple of possibilities...

I know they said they weren't hacked but they were hacked.

or

They are completely inept and have no disaster recovery plan in place, etc.

reilly3000 · on Oct 31, 2021

It feels like some kind of catastrophic data loss. I can’t imagine that app servers or network infrastructure could have been the root cause, especially because they are running on AWS and there hasn’t been any reports of outages or other customers impacted. Restoring an old backup and rebuilding data from logs seems like the only thing that could take so long. That or an entirely dysfunctional IT org that can’t get out of its own way in a crisis.

Best to them.

bink · on Oct 31, 2021

This article from 2019 suggests they use a mix of cloud and a dedicated data center.

https://portworx.com/blog/architects-corner-roblox-runs-plat...

wly_cdgr · on Oct 31, 2021

"Roblox is very popular, especially with kids — more than 50 percent of Roblox players are under the age of 13. More than 40 million people play it daily" ....from misleading logical non sequitur to parroting Roblox marketing numbers in under 30 words, nice. Verge is such trash