Tuesday, August 10, 2010

Server Room Series – Chapter 5: Quarterly Outages

I had a friend who was a Millwright and every summer they had a two-week “Shutdown” at the mill so they could rebuild all the machinery that had to run 24\7 the rest of the year. My friend was mad that he had to work double-time for two weeks while everyone else had the time off, so he eventually switched from millwright to truck driver even though the pay was less and the job was not as interesting. No one wants to do the dirty work.

In the hi-tech world the shutdowns are more frequent and our company has settled on once every three months around the second week of February, May, August and November. We have two basic types; minor and major Quarterly Outages. A minor outage has to have basic services up so people can keep working on projects while the major outages require every device be turned off for a complete fresh start.

The timing of the shutdown is critical so the last one down has to be the first one up or every other machine will sit there on boot up waiting for the other guy to start the handshake. The network has to be first, the DNS name resolvers are second, and then come all the secondary service providers so that none are left hanging. Each type of server has further dependencies to get the desired clean start and it is not uncommon to have to go back and start over.

There is that one brief moment in the middle of the major Outages when everything is turned off and the server room is almost silent for once, except for the background hum of the AC and UPS units. The place is almost dark with all the racks silent and missing their blinking lights it is as spooky and feels as bad as having the power go out completely. It is almost a peaceful moment except for the anticipation of something going wrong on reboot and we are always anxious to get it over.

Even though we have test machines that we practice on before each outage and we try to anticipate every contingency there is still always some fallout after the Outage is over. Every outage is good for a few panic attacks and there is almost always that one server or service that does not come back up and some patch has to be removed or the whole machine rolled back to a previous version. Finally we declare the Outage is over and take a comp day off to recover from the all 24 hour marathon but right after that we immediately start planning for the next Quarterly Outage.

No comments:

Post a Comment