It’s interesting to see that Microsoft Azure are previewing availability zones in two of their data center regions.
Amazon AWS have had this capability for quite a while – but what exactly is an availability zone, and does it provide full resiliency for applications in the cloud? Is it a good disaster recovery strategy?
For both cloud vendors, the original premise was to provide better availability, and to protect against failure of the underlying platform (hypervisor, physical server, network, perhaps storage).
Back in the early days of cloud computing, I was working for VMware on their Cloud team. For many years since, what used to be called Virtual Infrastructure 3, VMware has provided high availability and resilience through capabilities such as DRS, HA, and vMotion. Storage vMotion came in with version 3.5.
- DRS (Dynamic Resource Scheduler) provides initial VM placement (which host shall I run this VM on), as well as moving VMs around a cluster based on usage to balance out the cluster. Typically, quiet VMs will be moved first.
- HA provides the capability to restart VMs on other hosts in a cluster when either a host fails, or a VM crashes for some reason (BSOD, etc).
- vMotion provides the ability to move a running VM between hosts in a cluster with no loss of service.
- Storage vMotion (svMotion) allows the underlying virtual disks (VMDKs) to be moved between datastores to achieve more balanced throughput, or to allow migration of data.
- VMware ESXi hosts can be put into maintenance mode, which then uses vMotion to evacuate hosts prior to shutting them down. Very useful for maintenance, upgrades, etc.
Now that we're on the same page with capabilities, let's dive into comparing the availability zones of AWS and Azure.
Crashes or Downtime
For the most part, AWS and Azure only provide an HA capability. In the event of a host hypervisor crash or deliberate shutdown, VMs will be restarted on other hosts as they have shared storage. They do not have a vMotion or svMotion concept, or DRS — although they do have an initial placement calculation. Although organizations running Hyper-V on-premises have Live Migration, this has not been carried through to Azure.
Aside from crashes, there is a problem with this model around planned maintenance. When hosts get updated, there is no way to move the VMs running on them without loss of service, so VMs will be subject to the rug being pulled out from under them on occasion.
With this in mind, AWS and Azure talk about a different model when designing resilient services, that of ‘Design for Failure’. In a nutshell, design your cloud infrastructure on the premise that parts of it will fail, so provide resiliency at the application level. At the least, this requires doubling up of everything, and for many deployments this will require additional licensing as well as costs for the VMs themselves.
When you think about the traditional virtualized applications of the past five years or longer, many solution architects and administrators have been very pleased with the high availability features offered by VMware, and therefore have typically architected around single VMs for many applications. That said, people would often build out clusters for database applications such as SQL Server, but that was often to provide the ability to upgrade the database, one node at a time.
Aside from the compute elements of high availability, another area to discuss is that of persistent storage. In the past, storage was protected using RAID techniques. As we move to the public cloud, object storage has appeared as a popular way of storing data, and this uses the availability zone topology to protect data — but only if you choose it and pay for it. To protect against individual disk failure, three copies of the data are spread across the storage subsystems.
In the AWS world, both object storage (S3) and block storage (EBS) are available. For virtual machines requiring persistent storage, Elastic Block Storage (EBS) is used, and this is replicated within the availability zone to protect against failure of the underlying storage platform at no extra cost. EBS storage would not appear to be replicated to other regions.
In the Azure world, they do not have an EBS equivalent, and instead use Azure Blob (Binary Large Object) storage. This is available in a number of guises, depending on your use case:
- Block Blob
- Page Blob (used for VM storage)
- Append Blob
For availability purposes, the following types of Blob storage are provided:
- Locally Redundant Storage (LRS) – three copies of the data in a single data center
- Zone Redundant Storage (ZRS) – three copies of the data in one data center, and another three copies in another data center. Can only be used for block blobs. The second copy is not available for use until Microsoft enables it for you.
- Geo-redundant storage (GRS) – three copies of the data in one data center and another three copies in a secondary region hundreds of miles away. The second copy is not available for use until Microsoft enables it for you.
In the case of Azure, having data replicated to another region, does not mean that the VMs are necessarily available there (just the storage), they would need to be created from the underlying replicated storage (imported into Azure VMs).
Another important consideration is that replicating storage to another availability zone or region only really protects against storage subsystem failure. It does not protect against storage corruption, accidental deletion, or recent threats such as, ransomware encrypting the files within the storage. To that extent, it is definitely not creating a Disaster Recovery solution.
How about iland?
Back in the iland Secure Cloud, an SLA of 100% availability is offered for a single data center solution, with service credits being offered if that is not obtained. For customers wanting to implement a disaster recovery solution to protect against the other aspects mentioned above, iland offers cloud-to-cloud DRaaS using Zerto between any of our data centers, depending on customer requirement. As with our on-premises to cloud DRaaS solution, this only involves paying for storage and inter-DC bandwidth most of the time, and then only paying for CPU/RAM when invoking DR for testing or real scenarios.
When considering a true DRaaS solution it is important to understand what RPOs and RTOs are required. That is, how much data can be lost due to an outage or corruption, and how quickly you need to get things running again. With the iland DRaaS solution, RPOs of minutes or even seconds are easily achievable, while the RTO is how long it takes to boot up the VMs. Another benefit of the Zerto-powered DRaaS solution is that it is a continuous replication solution with a journal supporting up to 30 days, this means that you can literally wind back to just before things went wrong — especially useful for ransomware attacks.
Another striking benefit of the iland DRaaS solution is the ability to carry out self-service testing whenever required, while the replication carries on in the background. There is no need to call support to enable the replicated storage (as in the Azure example above). Everything is available to the authenticated/authorized user through the iland Secure Cloud Console.
From a compliance and auditing perspective, the iland console gives you the ability to generate and access documentation to show what has been protected and how long the failover took. We also have an in-house compliance team that is available to answer any questions you might have.
In summary, as customers think about migrating their traditional virtualized services to the public, they need to consider the topics discussed above, and whether they need to rearchitect their applications to accommodate the nuances of AWS and Azure. They also have the opiton to use a VMware-powered public cloud provider such as iland, and enjoy the same capabilities they've been using on-premises, with the added prospect of cloud-to-cloud DRaaS for additional protection against data loss or corruption.