Disaster Recovery Testing: Network Outages Are Like Radio Silence On The Far Side Of The Moon

Do you have a resilient and tested network recovery communication solution in place? Imagine the tense moments during the Apollo 11 mission when, for 48 minutes out of each orbit, there was complete radio silence while the spacecraft transited the far side of the moon. Imagine the engineers, computers and logistics that came to an absolute standstill while awaiting communication from the spacecraft after it emerged from the other side of the moon.

The radio silence experienced during the Apollo 11 mission is similar to what many businesses encounter when attempting to communicate with their branch offices during a disaster. Without network communication, all engineers, computers and logistics slow down because the status of other facilities is unknown: Was there any damage to the facilities? Do employees need help getting desktops and servers running again? What works and what doesn’t? You must test your communications and, as with any test, the more the test reflects real-world conditions, the better.

Some companies request that satellite phones be accessible during network outages, but often don’t realize that satellite phones don’t work inside of a building or car. Then there’s the hurdle of finding a power source to recharge the phone. Problems such as these can occur during the most critical time of recovery if you don’t have a tested communication solution in place.

I have also seen some companies avoid rerouting voice and data circuits for fear of causing problems when swinging them back to production mode. While this is a valid concern, I would suggest that this is exactly why you test.

The more comfortable you are with learning the problems of rerouting circuits, the more agile you become and the more experienced your network technicians become. When a disaster hits, you do not want to be making guesses about what will work and what will not.

Network testing should also include running applications across other types of communication circuits such as MPLS, Metro Ethernet, wireless point-to-point and satellite. Testing these circuits allows you to have multiple options available should a disaster happen. Things may not go exactly as planned, so you need to have various solutions available for various scenarios.

For example, you might find that within the first 72 hours of a disaster, you must communicate over satellite because the communications infrastructure is down, and then in the following week, different data or voice circuits may come back up, allowing you to transition to less costly communication options.

Without testing, your network technicians will step into each disaster response meeting like deer in headlights and not be able to provide confident answers on what solutions will or will not work.

What if NASA had not run the Apollo communications exercises before liftoff? What would have happened each time the engineers lost communications with the spacecraft? I think there probably would have been a flurry of button pushing on both ends as engineers frantically tried to restore communications, followed by a long, tense period of time trying to get things back to normal.

Test your network recovery, give your technicians some experience to build their confidence, and learn from it! You will be amazed at the lessons you will learn from testing that can be applied to your day-to-day activities.

By: Steve O'Neal