DRaaSZerto

Test vs Live Failovers in Zerto

ByMay 31, 2016
zerto test vs live failover
In my role as a Cloud Services Engineer, I get a lot of questions from customers about the difference between Zerto Live and Test Failovers – many people seem todoubt the need to do Test Failovers. Test Failovers have become the industry standard for DRaaS – let me explain why and, in the process, outline the important differences between live and test failovers.

Ultimately, you have the ability to run both a Live and Test failover at any point with Zerto Disaster Recovery. The failover process for the Test and Live failover performs the same operations and allows for the same networking access and server functionality. In your Zerto console and the iland Enterprise Cloud Services (ECS) Console, you have the option to initiate a Zerto Test or Live Failover. In the Zerto console, the bottom right hand corner always displays the red Failover button – with a notch on the side to set this failover as a Test or a Live failover. Alternatively, under the Continuity tab in the iland ECS Console, you can click the Failover Wizard button and initiate a Live or Test failover.

Recommended approach:
Before performing a Live Failover on your full production environment, iland recommends running a test failover to ensure user access is set up and configured ahead of time to test access and look for possible issues before bringing down the production environment. It may also be useful to perform a Live Failover on test or dev servers or environments to get a good handle on the process. DR testing with Zerto performs a no-impact failover that will spin up and import your VMs to your target DR environment. When the failover servers come online at the DR site, the firewall can be configured for IPSec VPN and SSL VPN access for remote access. You can also choose to open NAT and firewall rules for public access – for testing web or terminal services as an example. This can all be done without affecting the production environment or networking. Essentially, the DR site at this point is a separate copy of your live production environment and can be sandboxed to prevent any communication to the public network or your production environment.

Alternatively, you can perform a Live Failover of your VPGs to the DR site – but caution is needed during a Live Failover. The Zerto Live Failover option is designed to be used when your production environment is taken offline – either by a natural disaster, complete hardware failure, network failure, etc. With Zerto, you are able to bring your production environment online in a DR site in a matter of minutes and just a couple of mouse clicks. However, the Live Failover can be more intrusive and impactful to the production environment and may require more effort to return to normal replication.

Live Failover Process:
The overall process for a Live and Test failover is very similar but the Live Failover operation includes a few extra parameters. The Execution Parameters for Live Failovers has 3 extra settings that are not found in the Test Failover wizard and I have detailed those options below.

1. The Commit Policy. The commit policy can be set to Auto-Commit, Auto-Rollback or None. Selecting Auto-Commit means that after a designated time (Default is 0 minutes), Zerto will commit the failover which promotes the failed over VMs to the new live production servers. Once the failover is committed, the DR servers will need to be failed back to production once the production site is restored to keep any changes made on the servers while failed over. To complete this, Reverse Replication will need to be enabled to replicate the changes from the target site back to the production site. The Auto-Rollback option allows you to designate a time after the Live Failover (Default 10 minutes) for the failover to be rolled back to production. This works similar to a Test Failover as you have a window to test your servers and applications and then undo the changes. This will also remove any changes that were made on the servers while at the DR site and does not require reverse replication. If you set ‘None’ for the commit policy, you will have the option to either Rollback or Commit the failover later in time. This may be used in a situation where your production site is down, but could possibly be brought back online quickly. You have the option to commit the failover if you do not foresee a time production will be back online. However if the option is quickly fixed you can perform a Rollback.
 
Pic_1.png
Commit Policy Options
 
2. VM Shutdown Option – The VM Shutdown option can be set to No, Yes and Force. Setting this option to ‘No’ will prevent Zerto from shutting down the production servers during the Live Failover process. If you set VM Shutdown to ‘Yes’, the servers will be gracefully shutdown using the “Shutdown Guest Operating System” option in VMware. However, this process and the Failover will fail for a VM that does not have VMware Tools installed. Last, you can set to use the Force Option, which will forcefully shutdown a server if VMware Tools are not installed.
 
Pic_2.png
VM Shutdown Options
 
3. The Reverse Protection Option – Reverse Protection can only be enabled if the Failover is configured with an Auto-Commit policy, or is later committed after the Failover Process. To enable Reverse Protection, you check the box for the VPG under the column. Next you will need to click the “REVERSE” link to configure the Reverse Protection settings. This opens a wizard similar to the VPG Creation wizard where you will need to set a Host, Datastore and Network to be used for the reverse replication. Once Reverse Protection is configured, the production server will be powered down and unregistered from your VMware environment. The Reverse Protection will also overwrite the data on the original production server – basically, the original server will be used as a seed for the Reverse Replication.
 
Pic3-1.png
Enabling Reverse Protection
 
Pic_4.png
Host and Datastore options need to be set to configure Reverse Replication
 
If a Live Failover process is initiated as a test and the Auto-Commit policy is enabled – or if the test is later committed manually, you run the risk of impacting your production environment. Once the commit is completed for the failover, your production servers and live servers may both be online. From this point, you have two options to rollback to the normal replication set up – you will need to failback to production, or delete your VPGs and recreate them in Zerto.
 
To Failback to production, reverse replication will need to be enabled. When this happens, if the production servers were not already powered off, they will now power down and unregister from your production vCenter environment. The failed over servers will now be running Live on the DR site, meaning Public DNS records that customers or users access the servers with will now need to be updated to the DR site public IP addresses. Also, if the full environment was not failed over, you would need to create access between the production and DR site. This may not be possible to do with an IPSec VPN tunnel as both sites could be using the same IP subnets Any changes during this time will also be replicated back to your production environment, possibly generating unwanted changes on production databases or records. Once the reverse replication has completed a Delta Sync and displays Meeting SLA status, you can then failback to production. This will promote the original production environment as the live production environment again.
 
If you do not want to enable reverse replication and bring down your live production environment, the other course of action will be to delete the failed over servers running at the DR site. Once this is done, you can then delete the VPGs in Zerto and then recreate the VPGs. This will require a new initial sync to resume protection and there will be a time window in which you would not be fully protected anymore.
 
To summarize, iland recommends using the Test Failover process so the test may easily be undone and your production environment is not impacted. While you can run a Live Failover to test your DR solution, this has the ability to be more impactful depending on the options detailed above.
 
Click here for detailed instructions on executing both Live and Test Failovers in Zerto.
Mike Mosley

Mike Mosley

Mike Mosely is a cloud engineer at iland and has worked at the company for over 3 years. He holds a number of VMware certifications including VCP5 as well as the Veeam VMCE certification. Mike works closely with customers to build cloud solutions that fit their requirements.