Storms of June 29th 2012 in Mid Atlantic region of the USA
On June 29th 2012 a severe windstorm reffered to as a derecho tore through the Midwest and MidAtlantic regions of the US. Over 1,750,000 homes and businesses were left without electricity. Datacenters supporting Amazon's AWS, Netflix and other large organizations were taken offline, and there were several deaths reported.
The story that follows offers some lessons relearned and possibly a few new ones.
I work for a company with a NOC and primary data-center in the path of the storm. A number of events took place. With day time temperatures near 108F and the windstorm coming through the battery on the backup generator powering the data-center cracked and was not able to start the generator. Notifications went out but due to hazardous road conditions no one was able to get onsite to address a clean shutdown of services. Remote access was offline as UPS batteries provided insufficient time. This is a known factor as we rely on the backup generator to operate. The generator is tested weekly, the test the day prior was ok, and the battery maintenance was performed on the same day as the test. But a generator that does not start when needed is no generator at all.
Power was restored only a few hours later which compounded the problem. The power came back before the first admin could safely get on site. When he did he found all systems powered on but none of the systems were reachable. The environment is highly virtualized, with a well-designed and thought out set of VM hosts and systems. The VM hosts are connected to redundant switches with redundant connections. However when power was restored the servers came online before the switches did. The VM hosts deactivated the NICs and prevented local communications. It looked at first glance like a NIC or Vswitch failure. A simple shut/noshut on the switch ports resolved this ultimately. Additionally services such as DHCP servers, AD servers, and RADIUS servers are all VMs. None of which were available. IP subnets were not documented in an emergency manual, nor were some key passwords for access to switches when RADIUS is down. They were all documented but not in an easy to locate emergency manual. A few phone calls resolved each of these situations with each taking additional time and delaying recovery.
The failure boiled down to the VM server not bringing up the NICs properly since they were up before the switches, this was then compounded by a sysadmin assuming the problem was a VM problem and beginning to reconfigure Vswitches (at the direction of a VMWare tech support technician). Once all parties were onsite resolution was fast and a complete recovery was obtained.
So on to old lessons learned – geographic redundancy is desirable, document everything in simple accessible procedures, some physical servers may be desirable, such as DHCP, and AD. Keys services such as RADIUS must be available from multiple locations. Securely documenting addresses and passwords in an offline reachable manner is essential as well as documenting system startup procedures.
Some new to me lessons learned are a little more esoteric. Complacency is a huge risk to an organization. Our company is undergoing a reorganization that is creating a lot of complacent and lackadaisical attitudes. It is hard to fight that. We are losing good people fast and hiring replacements very slowly. There is no technical solution to this problem. It puts a lot of pressure on individuals. I had not experienced a battery exploding in the past. Though I am finding that at least on this day it was a common event. I have learned of three or 4 similar events that same evening. Inter-team communication is a constant struggle. We all work well together but do not have a well-orchestrated effort to create and document our procedures across team boundaries. Lastly having a clearly identified roster of who to call for what problem when is a must and it must not be electronic. Much of our roster was not available and calls were made to people who may not have been the correct on call person for a group, and then personal relationships took over as the way to get things done. It worked, but is not an ideal scenario.
While I am not proud to be in the company of such giants as Amazon and Netflix I am glad that we restored service 100% in only a few hours and had no loss of data and business was not hurt by the event. I am sure I will identify more specific and achievable lessons from this event.
Please share your stories about this event, or lessons you learned in a recent event.
--
Dan
Dan@MADJiC.net
×
Diary Archives
Comments
VB
Jul 2nd 2012
1 decade ago
Especially if you're on a 24V start system and have a bad connection somewhere. I've seen battery terminals instantly erupt into a cloud of lead vapor.
1 cubic inch of lead instantly going from metallic to vapor phase. Not a good environment to be in.
Sean
Jul 2nd 2012
1 decade ago
Sean
Jul 2nd 2012
1 decade ago
hacks4pancakes
Jul 2nd 2012
1 decade ago
Brent
Jul 2nd 2012
1 decade ago
When we finally did have a power failure, the generator started, but the power didn't cut-over from battery to the generator. Why? Apparently the mechanical switch that was responsible for that was encrusted with bird poop and none of the admins on site had physical access to the generator, and the facilities guy who did was in stop 'n go traffic trying to get there during rush hour.
Luckily, we were able to get most critical systems cleanly shut down since the AC wasn't on the UPS and had stopped running...
Brent
Jul 2nd 2012
1 decade ago
Al of Your Data Center
Jul 2nd 2012
1 decade ago
Amadeus/Altea the global airline check-in system also used by Qantas and Virgin Australia went down generating long queues and flight delays. My bookings dissapeared on the Qantas portal for at least half a day.
Gerb
Jul 2nd 2012
1 decade ago
Electrical circuits were not documented with startup and running load and were subsequently overloaded when everything came up, resulting in bouncing circuits.
The facility was found to have multiple 'unknown' grounds and the electrical circuits were not mapped properly.
Pass-card door locks that were designed to fail open failed closed - every single one. Only the facility manger had keys, in his pocket, at home…
Too many techs showed up without a 'go bag'. They did not have cellphone chargers, tools, multiple long console cables, extension cords, written documentation, phone books, etc. These guys said 'we have this stuff in the datacenter', but only enough for a few people. This made many techs look like amateurs.
Almost every bit of documentation you can use, that's on a server somewhere, should be on your laptop in an encrypted text file. Super important stuff should be actually printed and in a safe. Especially detailed network maps - semi accurate is better than nothing.
ttpm
Jul 3rd 2012
1 decade ago