My old Richter scale for system outages, revisited (Interconnected)

My old Richter scale for system outages, revisited

17.36, Wednesday 23 Nov 2022 Link to this post

It’s follow-up week! I’m blogging new words about old posts.

Re: A Richter scale for outages (2015).

Following a flurry of system outages (part of the Visa network was down, then iCloud for a bit) I scribbled some notes about quantifying the disruption…

Like the Richter magnitude scale, each magnitude is incrementally ten times bigger. So 4.0 is 100x bigger than 2.0. But like apparent magnitude it’s subjective: The scale of the human effect is taken into account.

Here’s what I reckon the scale might look like.

Full details in the post, but some highlights:

2.0 (Facebook down, outage lasts less than a day)
4.0 (broad human inconvenience without threat)
8.0 (e.g. the 2008 credit crunch or the Icelandic volcano grounding European flights)
10.0 (major network collapse, global and unrepairable).

It’s an idea that keeps coming back in my head since 2015. It feels like it would be useful to have in the public discourse! It’s not like we’re going to have fewer system outages in the future.

But I’ve never been really satisfied with the scale itself. I’ve always been meaning to try to put my finger on why.

Short reviews of a few other scales

Beaufort wind force scale (1805) – I like how practical, human, and sometimes poetic this is. Beaufort 2 is: Wind felt on face; leaves rustle. Beaufort 12, hurricane-force: Devastation.

I think one of the reasons it works so well is that the lower numbers are very everyday, so it you can extrapolate and build a visceral understanding of extreme and rare events.

Kardashev scale (1964) – the classic scale of cosmic civilisational complexity, as measured by energy use. Type I: like the Earth, a planet making use of energy ultimately derived from its sun. Type II: a civilisation which has captured the entire energy of its sun, for example by building a Dyson sphere around the star. Type III: as II but able to direct the energy of an entire galaxy.

Not sure how applicable the Kardeshev scale is here but it’s fun…

Rohn Emergency Scale (2006) – this scale has three independent dimensions: scope (measured in % of max population, or % loss in GDP); topography (the estimated visual fractional change in the environment – the collapse of a house is high; the collapse of a stock exchange is low); and speed of change.

It’s more of a descriptive framework than a scale, I’d say. I like that speed of change is in there.

Viking Impact Magnitude (2012) aka “A Richter Scale for Power Outages” – this paper shows how the scale is derived in a bottom-up fashion, which is super interesting. It’s rigorous, but again there’s the focus on the human impact. The scale of 1–10 is obtained by multiplying the number of affected people by the duration of the interruption.

The scale makes different events comparable: for example a 2007 cyclone in Sweden has the same impact as an earthquake, or maybe a hacker attack. Then the scale number can be correlated with a $ cost.

ALSO, one fictional datapoint. The jackpot, coined by William Gibson in The Peripheral (2014) and summarised here. The climate crisis, mass extinctions; no more bees, antibiotics exhausted; rolling pandemics and water shortages, just… all of it and all at once. Whatever the scale is, the jackpot is the Big One.

Towards a revised Richter scale for system outages

Learning from the above, a revised scale should ideally:

start with recognisable, everyday occurrences, and be more irritating than disruptive until about 4.0. It’s worth working to maintain a 1 to 10 scale.
describe the impact not the cause. Dimensions are probably similar to the Viking scale, number of people affected and duration.

I wonder how to include some measure of damage. Like, a 7 day WhatsApp outage would be a massive main but you can route around it (although not if you’re an informal worker in Brazil). A 7 day water outage is a catastrophe in the making.

But maybe that’s not for this? A Richter 8.0 earthquake in the middle of a city and a Richter 8.0 in the remote wilderness are given the same number on the scale. You differentiate by giving the location.

Also under damage I’d put “effort to remedy.” Like, is it a reboot required, or a product recall?

A thought experiment: the 8 years and counting Flint water crisis is a water infrastructure disaster affecting 100,000 residents. In terms of damage, it’s way up there – but the remedy has taken its time probably due to a lack of will rather than actual severity.

Compare with a WhatsApp outage that is less sever but would affect 2 billion users. Should they both be a 6.5? Or do we add context – is WhatsApp a widespread 4 and Flint a localised 8? The latter I think.

Being careful to specify the location answers many of my concerns I think. Twitter lost its timeline for a couple of hours the other day; we could describe that as a short sharp Twitter-localised 3.5, just enough to remind us of what we might lose, and you’d known what I meant.

One thing I’m certain of: this scale is for system outages. If there’s a fault on a weather satellite, then it’s not the satellite that this scale is concerned with, it’s our weather forecast infrastructure generally.

Taking all of this into account, the scale in that old blog post stands up ok. I’d add some notes about usage and interpretation but that’s it.

So I’m going to leave the 2015 scale intact for now.

It needs a v2. But the purpose of that work should be to refine and add rigour. It should start with collected examples, and work to define its terms on both infrastructure and impact.

That’s not something I can do on my own…

However there are not one but two upcoming books about infrastructure I am excited about: Public Utility by Debbie Chachra (she briefly ran me through the core argument and I can’t wait). And, by Georgina Voss, her new book on complex systems for Verso. I don’t know the title but I got a preview of the chapter topics and I am equally psyched.

Which means my next step is to wait until those are published, inhale them both, chase down some references, and then start buying people coffee until someone who actually knows what they’re talking about wants to co-author a paper.

If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.

Interconnected

My old Richter scale for system outages, revisited

17.36, Wednesday 23 Nov 2022 Link to this post

More posts tagged:

Follow-up posts: