Last Widget Phenomenon
Posted by BPuhl on December 14, 2007
Since we dogfood Active Directory quite a bit, it probably makes sense that we’re pretty careful when it comes to introducing new changes in the environment. But even though we do our “due diligence” before and during deployments, we are after all dogfooding – so more often than not we’re tied to product release schedules and the (very) aggressive deployment requirements. So sometimes in the name of agility, we’ll keep moving forward when some more conservative shops might slow down. Then again, that’s actually half the fun of dogfooding!
In the past, we learned that it was prudent to take extra precautions when we were making an initial change. For example, the first Windows 2000 domain controller in the NT4 domain – PDC piling on anybody? – or some of the application commutability issues that came with the early beta Windows Server 2003 deployments.
Recently, we started to notice a trend with our Server 2008 deployments. We’ve dubbed this the “last widget phenomenon”, because we’ve found that paying extra notice to the “last” of something in the environment needs the extra attention as well. For example, during an early beta deployment of Windows Server 2008, we had upgraded 4 out the 5 DC’s in the empty root domain of our CORP forests (root + 8 child domains). The time was right, and we wanted to run the EMPTY root domain on all Longhorn server. No problem right? When the first 4 upgrades went great, then the 5th should be a piece of cake. Unfortunately, we didn’t consider the last widget, and when the last Server 2003 DC was demoted we exposed a (previously known) performance bug in Kerberos which was fixed in a future build. Since this was the “empty root”, all transitive authentication between domains in the forest failed while the 2008 DC’s skyrocketed to 100% CPU utilization. So NOW what do you do? You want to re-promote the 2003 DC, but all of it’s potential replication partners are burning bits as fast as they can. Not to mention, even on a good day it’s going to take a couple hours to do the DCPromo (even with IFM).
So what should we have done? Well – it’s much easier to power-up a DC that was just turned off rather than re-promote it in a hurry when the barn is burning.
Of course, if we do our jobs right, then you’ll never have to experience these bugs we’ll hit them for you first, and the PG’s will fix them). But as a general administrative mindset, it’s useful to remember that the last widget can be as important as the first. It’s pretty easy to get a little complacent, when you over prepared for #1 and nothing happened, and then nothing continued to happen for numbers 2-99. But when it’s time to do the last one, #100 – be on the lookout.
Is it just me, or has anyone else noticed that the more prepared you are to respond to a situation, the less likely you are to ever need to?