Disaster Recovery Testing: How Do You Know It Works?

While presenting for various groups around the country I am always asked, “What is the best recovery solution out there?”  My answer has always been the same: It doesn’t matter what your solution is, but more importantly, have you TESTED it? Not just sent a file or two, or restored a subset of data, but actually tested the solution under conditions as close to real world scenarios as possible.

Thousands of dollars are spent every year to provide disaster recovery and/or business continuity and yet businesses large and small barely test the solution to make sure it will really work. Until you test realistically, the kinks in your solution will hide until you really need things to work, such as after the disaster has occurred.

A large hospital in New Orleans during Katrina scrambled to send armed security guards back into the calamity to retrieve backup tapes, only to discover that data was never properly written to the tapes. No one knew how long the backup process had been broken and had to start almost from scratch retrieving patient information from paper files.

Another company had redundant generators that performed exactly as designed when the town flooded and the utility power dropped. Uninterruptable power was provided to the data center servers with backup generators and UPS without hesitation, but it suddenly started getting warmer and warmer. Suddenly the company realized that they had neglected to include the power connections for the air conditioning on the backup generator circuits. The data center overheated and they were forced to shut it down in order to protect the equipment.

Testing has a side benefit of making better technicians, managers, and employees. Most technicians have never built the system they maintain from scratch. In a test they can try new things, review their current day-to-day processes, and in many cases apply new knowledge to daily activities. Lab environments are an amazing learning tool because you start over again and again without causing revenue loss for server downtime. Think about it – you are providing a lab for the IT team to recreate the entire IT business; the potential for learning is unavoidable.

Think your plan is good? Test it. Find the kinks; find out if processes aren’t working as you thought they were working, and make sure your team knows what to do when the kinks do appear. Imagine if your local fire department never tested the fire hoses – feel good about them keeping your house from burning down?

By: Steve O'Neal