Uptime recommends how users can handle cloud outages

To improve the availability of an application that is deployed in a public cloud, it is common to distribute it across data centers. This can be a straightforward and inexpensive process, but it does not guarantee that the application will be available. To reduce the impact of more frequent and widespread cloud outages, users must architect cloud-native applications that can handle the failure of VMs, availability zones, and regions. Designs that look more resilient need not offer meaningful guarantees regarding outage compensation or availability.

A recent report by Uptime Intelligence quantifies the costs, levels of resiliency, and cloud outage compensation of different stateless cloud application architectures.

Zone-wise resiliency architecture for cloud outages

Availability zones offer easily configured redundancy at a smaller cost premium. Many cloud services are designed to be resilient across zones as standard because architecting across availability zones provides higher availability compared to a single zone but with almost no impact on management overhead or cost.

If you want your application to be protected against machine and zone failures, it will cost you about 45% more than if you don’t have that protection. However, if your business can tolerate a delay of 15 minutes to recover, then the premium for protection drops down to just 15%. With the consideration that zone-level resiliency is relatively cheap, Uptime Intelligence recommends users distribute workloads across multiple zones.

Most cloud services aren’t designed to be resilient across regions as standard. Therefore, regional resiliency requires more significant consideration. Users may have to pay a premium of about 111% of the cost of an unprotected application for protecting it against the machine, zone, and regional failures with zero recovery time. According to the Uptime report, this cost premium can come down to 52% of the baseline in a pilot light model if the user can tolerate a delay for additional capacity to come online.

It is cheaper to set up failover to a region. A pre-enabled DNS service can be used to switch to a backup region during significant cloud outages provided the region is pre-prepared with minimal resources that aid recovery.

Challenges in designing resiliency architecture

When a failure occurs, users have to raise a report request with service logs to show proof of the cloud outage and request compensation. If the cloud provider approves the request, users receive the compensation in service credits, not cash. With rates of compensation being limited, the effort taken to apply for compensation may not always be worth the return. SLA compensation is poor and does not sufficiently cover the business impacts due to downtime.

Building architectures across availability zones and regions increase resiliency compared to a single VM in a single zone. But it is challenging to quantify it with a high degree of accuracy. The cloud providers’ SLAs and design objectives do not offer a good view of resiliency. Moreover, inaccurate knowledge about cloud providers’ infrastructure and software resiliency is a hindrance in correctly assessing reliability. Resiliency across different cloud providers varies, making it more complex. Therefore, designing resiliency will be more of art focused on what will probably work.

You may download the detailed report here.