A Richter scale for outages
10.07, Thursday 12 Mar 2015 Link to this post
Increasing reliance on invisible centralised software. Recent Apple outage, recent HSBC/contactless/tube outage. How long before a big one?
And I guess what I mean is that we’re all using these same software systems. And they interact in ways that are totally emergent. So they go down in unpredictable ways. I feel like these systems are not resilient… for example, the credit card system is less resilient than a distributed payment system like cash.
It got brought to my attention because for, about a day, I couldn’t use my HSBC contactless card on London tubes or buses – who knows why. I had to top up my separate Oyster card, and it ended up costing me a couple quid more that day. Then, yesterday, several of Apple’s systems were down for about 12 hours: Main effect for me, I couldn’t listen to any music in iTunes. Nothing major.
But as software eats the world,and as the Internet of Things brings more of the physical world into that same domain, I think it would be helpful to have a language to talk about this.
So I made some notes on the bus.
A Richter scale for system outage
Like the Richter magnitude scale, each magnitude is incrementally ten times bigger. So 4.0 is 100x bigger than 2.0. But like apparent magnitude it’s subjective: The scale of the human effect is taken into account.
Here’s what I reckon the scale might look like.
Less than 2.0. Not distinguishable from normal network noise, like a call dropping, webpage not loading, or a computer crash.
2.0. Facebook down, Gmail down, Apple App Store down, HSBC contactless cards not working on London transport. Duration of shorter than a day. Underlying problem is probably a single component and lack of resilience (e.g. power outage at a single cloud hosting location). Fixable.
4.0. Minor network freeze but can be recovered with a reboot; broad human inconvenience without threat. e.g. regional ATM network down for a day, cellular network down for a day for single operator.
6.0. Collapse of minor network requiring rebuild. e.g. recent Sony hack that meant no computers, printers, or existing network infrastructure could be re-used without manual check of each item.
8.0. Major network freeze, can be recovered with time or reboot; major human impact. Examples include the 2008 credit crunch where bank lending gridlock precipitated the global financial crisis; power network outage major enough to require black start; the Icelandic volcano that grounded European flights for 6.5 days.
10.0. Major network collapse, global and unrepairable. e.g. Cascading, emergent fault that wipes Internet routers and shuts down power grids, traffic and logistics, internet and non-cash payment. Can only be fixed by re-programming and re-architecting all separate components.
So the questions this makes me asks…
Is there a more objective way to measure system outage magnitude? Can we also measure resilience, with a language that cuts across different systems? Is there an equivalent scale for non-software system outages, like would Gulf Stream switch-off be a 8.0? Are we really going to see more software-related black swans over the coming decades?
How long before the big one?