In my first blog on this site, For What It’s Worth I referred to Ross Button’s Scatter Architecture. Ross was looking at the situation in which organizations let more and more of their activities be performed outside of the physical and legal boundaries of the organization. In particular he focused on the use of information and communication technology in this context. One of his earliest examples was inspired by the Amazon outage back in 2011. What Ross saw was that, if even a major and reputable supplier like Amazon can have this kind of problem and if the problem can be of sufficient scale to cause companies to lose data and business, then we need to consider alternative recovery channels. In other words we need to add variety to our ability to respond. This wasn’t just some abstract recommendation. One of the things that emerged from the analysis of that outage was that some organizations had little or no problem with it, because they had made their own arrangements for redundant solutions – in some cases using options available from Amazon and in at least one using their own data centres as backup/failover locations!
Ross argues that you need to scatter your key processes across multiple “locations” in order to have a reasonable guarantee of business continuity – just as a farmer scatters seed rather than planting it all in one place.
Is Scatter a direct illustration of Ashby’s law? Is it really a response to variety – from the perspective of the customer? If all you do is move your IT systems from your own data centre to Amazon’s, have you introduced additional variety? On the face of it quite the opposite. In our own data centre we can see all the possible sources of variety that we need to match, whereas when we put everything in the cloud we only “see” one big undifferentiated source of variety. So isn’t there now less variety? Well, if we restrict variety to mean visible sources of variation, then yes. But the Amazon outage showed that adding variety to our responses makes a difference. So maybe there are other factors to consider.
If we actually go about developing a business continuity strategy, we start with risk analysis. We look at:
- the likelihood (as far as we know) of an undesirable event (like a data centre outage) taking place
- the scale of the consequences should it happen (e.g. how long can we afford to let a particular business process remain unavailable?)
- the available mitigations (i.e. what can we do to recover, if the worst happens?)
The analysis process would be the same whether we were looking at our own data centre or at Amazon’s. So what might we conclude? I dare to suggest that the likelihood of an outage is lower for Amazon than for your average in-house facility. The scale of the consequences can’t be different, because it’s the same business process. What is different is the mitigations. If we put the entire farm with Amazon (or any similar provider) in a vanilla configuration , we’re entirely dependent on Amazon to get us back online within the acceptable time limits. Let’s assume that unmatched variety was part of the problem at Amazon. We can’t put additional controls in place to match that. Only Amazon can. So, if something does go wrong, those mitigations are no longer available to us. The level of risk has therefore increased.
Whatever level of variety there is inside the cloud, it is from our perspective unmatched. So we need new mitigations, new ways of matching the variety that we can see, which is what Scatter is about.
It’s clear from the various analyses of that Amazon outage (and some subsequent problems) that variety was indeed part of the problem. Uncompensated variety led to unpredicted behavior in the recovery process, which to all intents and purposes created a sort of DDOS attack on their own servers. This isn’t just a Cloud problem and certainly not just an Amazon problem. It’s the sheer scale of the Amazon operation that made the outage so extreme – and so public – but it would apply just as well to traditional outsourcing wherever there is shared infrastructure involved. It doesn’t even have to be multi-tenant. Unpredictable things happen during recovery. They may be explicable in a subsequent root cause analysis but not in advance. The Cloud/outsourcing provider has to compensate the variety with adequate controls, so they have the possibility of reacting effectively.
It’s true that, as a consumer of Cloud/outsourcing services we see one big, undifferentiated blob of variety. We can’t look inside the blob to see what the variety is. But, as we’ve seen, that doesn’t mean it isn’t there. That variety may be compensated by the provider but we can’t know that. So there’s an increased element of uncertainty in our risk assessment, which means the risk level is higher and that too requires us to put our own compensating mechanisms in place.
To summarize this, it seems that both a lack of direct control and an increased level of uncertainty are equivalent to a form of variety. Or perhaps they are amplifiers of variety. Variety isn’t just about counting sources of variation. It’s not just about what we can know but also about what we can’t know. And that’s what Scatter is designed to address. It’s not some rules driven, if a then b approach. That only works when everything is knowable. Scatter is just a way of providing options, which require intelligence and intuition to make use of them on a case by case basis.
I hope it’s obvious that all of this doesn’t just apply to IT scenarios. It’s the same for any situation in which some third party performs some part of our business. The extent and criticality of the variety involved and what, if any, controls we choose to put in place are dependent on how simple or complex our relationship with the external service and its provider is and what part of our business is exposed to the variety. In the next blog I’ll make use of some of Tom Graves’s methods to show how we can capture that information.