Why Multi-Cloud Probably Isn’t the Answer for Resilience

Background Graphic
~ 6min read

Multi-cloud – It’s the hot topic that’s herethere, and everywhere. Following some high-profile outages, businesses are more anxious than ever about having all their eggs in one cloudy basket and are looking to ensure they don’t fall foul of the next one. However, this grumpy cloud architect thinks that multi-cloud might not be the answer to every problem, as some blogs make it out to be, and we need to revisit the basics with some good ol’ fashioned resilient architecture.

Note: here I’m discussing multi-cloud not in the hybrid-cloud on-premise sense but rather in a public cloud context, i.e. multi-cloud ⊂ {gcp, aws, azure, …}, |multi-cloud| > 1

Let’s balance that with first remembering that there are some scenarios where multi-cloud is the solution you’re looking for. Best of breed products should always be used regardless of the particular hyper-scaler that offers them; for example, who wouldn’t want to use BigQuery for analytics, regardless of where the rest of your infrastructure lives (as demonstrated by Monzo)? Also, having flexibility in your infrastructure choices will keep your cloud account manager on their toes when it comes to pricing and negotiation, ensuring your budget holders and CFO are happy. That same flexibility should also help your Risk Officer sleep better at night.

The Resilience and Availability Question

With the pleasantries out of the way, we can begin the rant in earnest. I don’t believe that carving a solution down the middle, and putting half in GCP and half in A. N. Other, is the correct way to improve your resilience and uptime for the vast majority of cloud consumers.

The current zeitgeist of multi-cloud is primarily being driven by a desire to not go offline when your cloud provider does, and this is now on the radar of the highest echelons of organisations. Downtime is scary and bad for business, and the hyper-scalers aren’t infallible; tech fails, squishy organic matter presses the wrong button – if it can happen, it usually does.

However, fleeing to another cloud is addressing the symptom and not the problem. If you lose your infrastructure when a whole region or zone goes offline, then why not replicate the same infrastructure in a second region, there’s plenty more to choose from – this will achieve exactly the same resilience as using a second cloud, and likely be an order of magnitude easier to set up and maintain. Your chosen cloud provider has fleets of tools to make this as simple as possible, from global load balancing in GCP to global SQL databases in AWS, and storage that spans continents in Azure. (Or all three in GCP).

Going offline with your cloud provider is a failure in redundant design, which is not going to be automatically solved by writing a cheque to a second.

What if my cloud provider loses all of its regions in the US?” I hear you cry. “Then fail to Europe,” the abyss cries back. This brings us to the caveat of this debate – regulated industries. If you have strict compliance requirements around sovereignty, and you’re in a smaller country without much choice for regions (the UK or Switzerland, for example), then I accept multi-cloud might be the only option to increase your number of 9s.

A Hot / Hot Cloud

Option A – run two clouds at all times.

If we unpick this, problems start to fall out, with the lowest common denominator infrastructure top of the list. Does Pub/Sub have the same delivery semantics as SQS? Will this operator work on both GKE and AKS? Can this authentication method work with both API Gateway and Cloud Functions? Does my Vertex AI model deploy to Cognitive Services? Cloud providers offer a range of tools to make your life as easy as possible – frictionless delivery and time to deployment are one of the biggest value adds and trying to be all things to all clouds neutralises this benefit. As you slide down the scale away from fully managed services towards cloud-agnostic solutions, your infrastructure management burden and tooling requirements only increase, in turn driving costs up.

Managing the simultaneous deployment across clouds is no mean feat, requiring extra abstraction layers in your infrastructure as code, and adding cognitive overhead for your developers. Tools like Anthos and Crossplane will certainly support you, but containerised services are likely only a subset of your total infrastructure. Don’t forget – whilst adding multi-cloud infrastructure mathematically improves your resilience, it’s harder to do well, and getting it wrong will only undermine any net gains and could even decay your resilience.

Routing between the clouds is the thorny issue; by definition, you don’t want this exclusively in either cloud provider, likely leading to using an edge provider such as Cloudflare or Akamai, but then you’re at the mercy of a single edge provider’s SLA (who also aren’t infallible), bringing us full circle back to where we began.

A Hot / Cold Cloud

Option B – run one cloud, with a second on standby. This is what springs to mind when people say ‘multi-cloud’ and it’s assumed to be the simplest setup.

However, the abstraction and lowest common denominator infrastructure points of option A stand. On top of that, you have a few more disadvantages, with the added question of “How cold is cold?”. Do you have all of your secondary infrastructure offline, in which case there’s going to be a lead time to bringing it online when the day dawns, or do you leave it available, but scaled down? This is going to incur a cost 24/7 and increase your TCO.

Wherever you sit on this spectrum, the secondary cloud will always run the risk of being pilot light infrastructure. Unless it gets TLC every day to stay in lockstep with developments in the primary setup and demonstrates it can be stood up equally regularly, it’s very likely to fall over when the time comes.

Keeping the data in sync will be an even harder task. Imagine you have a handful of queues in flight, a database or two, and maybe a session cache. Without some careful (and likely costly) data replication and syncing, when Azure UK South disappears into a black hole, even if your second cloud is ready to go, it’s going to be tough to piece together where you were. And don’t forget – you’ve also got to get the interim data back in the other direction to then return to the primary cloud.

The crux of this option for me is “You’re good, but are you better-than-Google good?”. If a zone’s compute goes offline at 2AM, what’s likely to happen quicker?

  1. You can bring online your cold cloud that was last tested 3 weeks ago, ensure the data is all coherent, and start routing to it
  2. Google mobilises a world-class SRE function to bring the compute resources back

Maybe I’m a pessimist, or I’ve been drinking the Kool-Aid for too long (likely both), but I’d rather fail to a nice clean static “We’ll be back soon” page in a bucket than try to stand up a complete second infrastructure faster than a hyper-scaler.

Closing Thoughts

Multi-cloud is a great tool that empowers the consumer, and shouldn’t be written off in the space of one therapy session thinly veiled as a blog post. However, as my colleague the excellent Will Parsons described it, it comes down to where you are on your cloud journey:

  1. Can your application be load-balanced at all? If not, work on separating out your state and application layer.
  2. Is your infrastructure all in one availability zone? Perhaps you should start to cross zone boundaries.
  3. Is your infrastructure all in one region? Next step, roll it out to a second.
  4. Lots of regions happily clunking away? Add regions as far as your compliance requirements allow – bonus points now your far-flung customers will see improved latencies.
  5. Fully geo-distributed? If you still want more 9s in your uptime, now start thinking about multi-cloud.

And remember – this is not just an academic exercise. Before trying to move on, prove to yourself you can seamlessly and automatically fail from one zone to another through DR exercises. Then prove it again. And one more time for good luck. This is where true resilience is found, not just buying a VM in a second cloud.

Multi-cloud has a place, but there’s other foundational work to do first, which will reap benefits quicker, and with significantly less effort. When you take the multi-cloud leap be prepared to operate at abstraction layers with vendor-neutral tooling, requiring careful thought around what is actually going to be portable, reproducible and manageable across environments. However, if you’re at step four, beyond the inevitable heat death of the universe, the odds are astronomically low that your cloud provider will lose the USA, EU and Asia simultaneously. Even during one of the worst global ‘outages’ which lasted all of 47 minutes, only a grand total of ~5% of Kubernetes clusters were impacted.

Since AWS took out Tinder and Disney+, companies are keener than ever to avoid being in the spotlight for the same reason, and multi-cloud is being pushed as the answer from a lot of corners. However, very often these same companies still have a long way to go in their single cloud journey.

William of Ockham sums it up perfectly: “Entities are not to be multiplied beyond necessity.

Discover how tomorrow starts now

Get in touch