Backup testing: What to test, when to test, how often to test Posted by Damien Biddulph on Wed 11th Oct 2017
Data protection is an essential component in any data management strategy, and one that all system and storage administrators should fully embrace.
We take backups for various reasons: hardware can fail, software has bugs, and users make mistakes and delete or change data unintentionally.
There is also the risk of deliberate and malicious attempts to destroy or encrypt data for financial gain or to “get back” at a previous employer.
People say you only find out how good your insurance cover is when you make a claim. With backups, we don’t want to wait until we need to restore data to find out whether our backups are any good.
Data recovery can be a stressful scenario that doesn’t need the additional pressure of worrying whether backups are valid or not.
The solution, of course, is to test that backups have worked by restoring data.
Historically, this was a difficult and time-consuming task that was limited in terms of what was possible.
When there was a physical server for each application, restoring data meant having additional hardware on which to perform the restore process. It was not possible or practical to recover to the production environment in anything other than a limited way.
So a full restore of an entire platform was rarely done. Other reasons included the conflict of the restored system with the production one – of which more later.
But with the widespread adoption of virtualisation, things have become much easier.
A virtual machine (VM) is just a set of files that contain the operating system and data of the VM, plus details on VM configuration (processor count, memory, network, and so on). This means that a VM can easily be recovered from backups and powered up to validate that the application can be recovered and made accessible.
It is worth remembering that testing the restore of an application provides two purposes. First, it validates that the restore does actually work. Second, it provides a benchmark to ensure that the recovery process can be completed within agreed service levels – mainly recovery time objectives (RTOs).
Regular testing can be provided back to the business to show that application recovery targets can be met, or perhaps reviewed if the process cannot be completed in time.
Backups: What to test?
At this point, we should think about exactly what we want to recover as part of a test. There are multiple levels to consider:
File recovery – Can I recover individual files from the backup? This process is easy to apply to physical and virtual servers, as well as backups of file servers. The choice of data to recover really depends on what data is being stored. It could make sense to recover the same file each time, or to recover new data each time. Automation can have a benefit here, which we will cover later.
VM recovery – Can I bring back a virtual machine and power it up? This is clearly one for virtual environments, rather than physical ones. Recovering a virtual machine image is relatively easy, but consideration has to be given to where the VM will be powered up. Starting the VM on the same production environment brings up immediate issues of network IP conflicts, and SID conflicts for Windows systems. There may also be issues with whatever application services the VM offers. The choice here is to power up the VM in an isolated environment (which can be done using a “DMZ” subnet on the hypervisor) and provide access only through that DMZ network. Be aware that powering up recovered VMs with new IDs may have an impact on application licensing. Check with your software provider on what the terms and conditions allow.
Physical recovery – Physical server recovery is more complex and depends on the configuration of the platform. Some servers may boot from SAN, whereas others may have local boot disks. The recovery process then depends on the configuration. Recovering an application to alternative hardware removes a lot of risk, but it does not fully represent the recovery process. Recovering an application to the running hardware means an outage and so the test is likely to have more risk and be carried out less frequently.
Data recovery – Depending on the backup process, data recovery can be an option in testing. For example, if data in a database is backed up at the application level (rather than the entire VM), then data can be restored to a test recovery server and accessed in an isolated environment.
Application recovery – Full application testing can be more complex because it relies on understanding the relationships between individual VMs and physical servers. Again, recovering a suite of servers as part of full application testing is best done in an isolated environment with separate networking.
It is clear that more extensive testing has impact and risk, but can provide more reassuring results. Choosing a recovery test scenario depends on the backup and restore methodology in use. If the recovery process is to restore an entire VM, then that is what the test needs to do. If the recovery process means rebuilding a VM and recovering the data, then that is what the test process should reflect.
Backups: How often to test
How often should testing be performed? In an ideal world, a test should be scheduled after every backup to validate that the data has been successfully secured. This is not always practical, so there is a trade-off to be made between the impact and effort of recovery and having a degree of confidence in the restore.
As a minimum, there are four options:
As part of a regular cycle (for example, monthly). Schedule a restore test for each application on a regular interval.
When an application changes significantly (patches, upgrades, for instance). Schedule a (more comprehensive) restore test when significant changes have been made to an application, such as upgrading to a new software release or when installing a major patch package or operating system change.
When application data changes significantly. If an application has a regular import of data from an external source, for example, performing a test restore can help validate timings for data recovery.
When a new application is built. This means testing the restore of a new VM or server when first created. This may seem excessive, but it makes sense to ensure that the server/VM has been added to the backup schedule.
The ability to test recovery can be significantly improved by the use of automation. At the most basic level, this can mean scripting the restore of individual files. But more complex testing can be done with the use of software tools, many of which are integrated into backup software products.
Veeam and Zerto are two companies that provide the ability to automate the testing of restores without affecting the production environment.
Suppliers such as Rubrik and Cohesity offer dedicated hardware platforms to manage backup data and can be used as a temporary datastore for recovered VMs. This allows recovery to be scripted and automated relatively easily.
These solutions are mostly focused around VM recovery, so more complex scenarios (such as recovering a Microsoft Exchange platform) may need additional manual steps (especially to confirm the application is actually working). This means setting some definitions around what successful recovery looks like – either the ability to get back individual files or, at the most detailed level, the ability to access the application being recovered.
As we move into a hybrid cloud world and increasing use of containers, backup testing offers challenges and opportunities. Having public cloud as a backup target allows applications to be recovered and tested in the cloud, reducing on-premise costs. Containers represent a new application deployment paradigm, so will have challenges around backup and restore. As we move forward, the fundamentals remain the same – check your backups regularly and ensure recovery processes are well documented.