Part 1 of 2.
In today’s world of technology, many times we are lulled into the state of mind that our systems are always available. We check our email from anywhere, access files from anywhere, and connect to friends and business colleagues from anywhere. This ubiquitous, always on capability gives us a false sense of security and availability. So when the system “Crashes” we are surprised, frustrated and immediately question, “How can this happen?” It is in these moments of a “Crash” that the recovery process takes center stage.
The very word Backup gives us a hint as to its real meaning: BACK And UP. Yes, to get a system returned to operational status. System recovery is a complex topic. However, here, we will focus on core principles to develop an overriding sense of good data recovery processes.
What is Data Backup?
Simply put, it is the protection of system information and data in the last known moment prior to the backup. This can also be referred to as “Live State”. Meaning, if we did a restore at this precise moment, the system would recover to a specific moment in time. We use Backups to restore business operations after a data loss, hardware or software failure, or ultimately after disasters.
When we think about backups, who are the stakeholders and what are their expectations for data recovery?
Management, customers, vendors, and employees, all of these groups are affected by system failures and are dependent on the data recovery process. Here, two terms must be defined: Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
With RPO, we ask the question: “What is the last known good point in time from which I can recover the system?” The recovery point is critical because it lets us know how much data, if any, we will lose when we restore. If we had a simple hardware failure, once repaired, the recovery would restore us to this last good system copy and we would resume operation. However, if we had a data corruption issue, such as a database corruption or virus attack, RPO becomes a critical element in recovery. To understand how backup works, think of RPO as “Snapshots” in time. How many times a day do I take a system “SNAP”? Is this a full snap shot or only the incremental data changes? How many days, weeks or months do I go back if I had to roll back to a point in time before the corruption? It is important to review your RPO with key stakeholders to establish a recovery point objective that meets expectations.
RTO is often overlooked until a “Crash” happens. It is a good practice to ask, “If we went down, how long would it take for us to restore the system to operational status?” This length of time must match the expectations of all the stakeholders. It should not be left to system administrators to determine what is acceptable. Instead, I always suggest that each stakeholder be surveyed to determine their expectations of a recovery time objective. When properly reviewed, this will give important data to create Backup and Restore processes that will meet expectations.
At this point, it is important to mention backup MEDIA. Magnetic tape has been around for many decades. It has proven to be a source of both relief and frustration for many system administrators. Most of us in the IT world have “Scar Tissue” from a tape that verified properly, but ultimately failed when we needed it the most, during a recovery. In today’s world, advancements in the use of “Disk Media” should have eliminated the idea of using tape media. With the use of hard drives as a primary backup target, the ability to take many system snapshots during a working day becomes a viable option. The ability to copy these snaps to an offsite location is simplified. And of course, the recovery time is reduced dramatically as recovery from hard disk is many times faster and more reliable than tape.
Again, Data Backup is about RECOVERY.
Each organization can ask simple questions and provide answers that will ultimately drive the processes and costs. As RPO and RTO expectations are established, costs associated with these expectations will become apparent. Generally speaking, shortening the length of both RTO and RPO will increase the costs associated with data recovery.
Finally, one must ask another question. “In the event of a dislocation, meaning, a building fire, hurricane, etc… Where will I reconstitute my operational status and how will it be done? “
We answer those questions in Backup and Disaster Recovery, Part 2.