cloud backupDRaaS

The Continuity Tools Explained: Replication

ByApril 29, 2016
This is the third blog in the our brief series on continuity terminology. My first covered Snapshots – and the second covered Backup. As noted, this guide is a simple introduction to each of these concepts, and helps to explain the differences in clear, simple terms. We will discuss what each one is, isn’t, and how you might want to use them in your environment.

So – let’s get to the last one – Replication.

iland offers several solutions to help you in your time of need, should you need them. Snapshots work as a save point that you can use as needed, before a large change. Our 7-day rotational backups allow you to restore a machine to a previous state, and run once a day in all of our ECS environments. But the most powerful and flexible solution is by far our true Disaster Recovery (DR) offering. If you want the peace of mind that comes with true DR, replication is the way to go.

Replication – What it is:

Disaster Recovery Plan DialIn the simplest terms, Replication is a streaming backup of your entire production environment. Once your data has been pushed to us, and the initial seeding is complete, data is streamed continuously to your DR environment, ensuring that the replicated VMs are as close to their production doppelgangers as possible. Typically, this ensures you can bring up an environment that is minutes, sometimes even seconds, behind your live environment in the event of a true disaster.

iland partners with Zerto for replication in our Enterprise Cloud Services (ECS) environments.  A typical installation consists of two essential parts: A Virtual Replication Appliance (VRA) installed on each host, and a Zerto Virtual Manager (ZVM) server that allows centralized communication and management both locally and site-to-site.

After an initial install, your entire environment begins streaming over to your DR site through a pre-configured, secure VPN Tunnel. Replication does not affect the performance of your VMs in the slightest. Since it is performed at the host level and allows for multiple VMs to be streamed at a time, syncing your entire environment is seamless, quick, and easy.

Once the initial sync (seeding) is complete, Zerto begins syncing data in an attempt to “catch up” to your live environment. Once this sync has completed, it begins to build a journal history (4 hours worth, by default) of restore points. You can choose one of hundreds of points in that window to restore your environment to, which allows you to select the most appropriate version of your environment to bring back up after a disaster. Replication will continue streaming in the background from this point on, making sure that your DR site meets your required Recovery Point Objective (RPO) – The number of minutes or seconds your DR environment is  “behind” your production environment.

In the event you do have a disaster, the technology enables you to configure Virtual Protection Groups (VPGs) that can organize and plan your failover. When failing over, these VPGs are what you select to bring up in your DR environment. VPGs allow you to restore clusters of the VMs most integral to your infrastructure with a single click, instead of restoring one VM at a time. They also allow the machines to be “pre-configured” to work in the new environment, with new network settings that match your DR site. Both of these are HUGE time savers in an actual disaster and allow you to move faster and focus on getting your environment back online as quickly as possible.

But just having a good DR plan using replication isn’t enough. You need to be able to prove that it actually works, and you and your team need to be familiar with how things will go in case of a disaster.

Replication – How it works:

Once replication is up and running in your environment, you’ll have the option to failover one of two ways:

  • Live Failover – This is the failover designed for a full on DR event. A live failover will assume that your production environment is down, and that the VMs being failed over are now supposed to be the primary VMs. Once you have failed over, replication stops coming from your production site, and it will need to be re-configured in order to continue, which will result in a need to re-seed the data. This makes sense in a true DR event, since the original environment would be destroyed or inaccessible. Obviously, this is not the best way to test your DR plan, since it breaks the replication. It can also wreak havoc on a functional live environment, depending on the settings you select (powering off live machines, for example.)
  • Failover Test – A failover test is the only only option you should ever choose, outside of an actual disaster. This will bring up the VMs that have been replicated to your DR environment, and allows you to do whatever testing you need to do, without impacting anything in your production environment whatsoever. Once you end the test, the VMs will be removed, and the data will begin syncing again. You can test any time you like, and the process is very straightforward.
Either way, reports are available in the console that show the results of the failover once it has completed.

To sum up:

  • Replication is the preferred methodfor DR for a few reasons.
    – First, it constantly streams data from your live environment to your DR site. This allows the data in your DR site to be much closer in time to the data in your live environment.
    – Second, since data is replicated at the host level, it allows for syncing of multiple VMs at a time, instead of the “one job at a time” limitations found in most Backup solutions. Plus, it has a minimal impact on your production environment, and the performance of the VMs being replicated.
  • By default, replication allows for a history of restore points stretching back 4 hours. You can failover to any of these points at any time.
  • VMs that are replicated are only accessible during a Failover Test, or after a Live Failover. While they are being replicated, they will not be accessible or configurable.
  • VMs that are involved in a current failover test will not be backed up as part of the daily backups that run in our ECS environments, since they are intended to exist temporarily.
  • Replication at iland can be controlled and managed via the Console, on the Continuity tab. You can perform Failover Tests, Live Failovers, and view any information or reports you need, all in one centralized location. This allows you to initiate a failover even if your production environment is unavailable, without having to call our support team to kick it off for you (though, they are glad to do so.)

Replication – What it isn’t:

  • Replication is not a good choice for file level backup. While it is possible to recover OS level files from a failed-over VM, using replication as a file system backup is like using a bazooka to fish. It might technically work, but it’s a mess to deal with, and you are kind of missing the point.
  • Replication is not something that will help you recover a single lost machine in your production environment. Not only is replication used to cover groups of machines, but it’s also designed to spin them up in another physical location, so it won’t help you if one of your local VMs goes bad during an update.
  • Replication does not create backup copies of VMs that you can log into. VMs in your DR site are only accessible once they have been failed over, either in a live failover, or in a test.
Brandon Cottrell

Brandon Cottrell

Brandon is the Product Support Specialist for the Development team at iland and has worked at the company for 2 years. He holds a number of VMware certifications including VCA-DCV AND VCA-Cloud. Brandon works closely with customers to incorporate their feedback and suggestions to ensure the effectiveness and usability of our Cloud Management Console.