Meeting or exceeding your availability requirements is a matter of assessing, designing and implementing the correct solution, right?
Well it starts there, but the real secret is staying diligent after you put your solution in place. Many organizations initially do a great job of eliminating single points of failure, but years later nobody remembers how it all works. Even with the best documentation, this erosion of knowledge still takes place.
I have witnessed things that will make your hair curl. The most recent was a well-known disk array manufacturer lost three drives in a RAID 6 at the same time. This occurred after the transfer switch from the generator failed to close completely--therefore draining the uninterruptible power supply (UPS) and dropping the array cold.
So, lets start at the beginning. The generator/UPS combination was designed and tested weekly to be up and running in 30 seconds after a power failure. Over a five-year period this combination failed to work only once (over 300 tests including real outages). This one failure caused a load bank test and many mock utility outages to be performed and the solution performed flawlessly. But just in case it did not work, scripts were written to shut down all servers in the event that the UPS battery fell below five minutes.
So years go by and the computer room is never even powered off and on. Then a situation like the above occurs. Things are forgotten and have changed and eroded overtime. What is the secret? Test completely!! Don’t just test what should work, but test what might happen if something that should work doesn’t.
visit Adexis at www.adexisstorage.com
Recent Comments