Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a lot of bad karma in this discussion. It's hard to run large services. Careful when you set a precedent of pillorying after an outage. It could be you next!

Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.

I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.

(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)

As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: