BPuhl’s Blog

A little bit of everything without actually being much of anything

Passive Safety & Situational Awareness

Posted by BPuhl on January 5, 2010

I find that there are a lot of concepts which I bring to my job in MSIT, from my hobby as a private pilot.  In this case, I am “borrowing” the title for this post from an article by Bruce Landsberg in one of the magazines I subscribe to.  He starts:

… Compared to machines, the homo sapiens’ conceit of being masters of the universe shows us to be consistently unreliable when it comes to repetitive tasks.  We do excel, however, in thinking up ways to get out of mindless chores to refocus our short attention spans on really important stuff…

This hit me, because earlier today I was talking to some of our engineers about our “team server” – which is the box that we use to run all of our recurring scripts from, collect data to, store utilities/tools/scripts on, and generally dump stuff.  Appropriately named Dumpster (does that make us dumpster divers?  probably).  We run a lot of scripts to collect a lot of data for our own use.  Although we’ve got the full blown monitoring infrastructure in place and we own all the settings for alerting, etc…  SCOM is owned by one team, the alerts go to our 24×7 operations center (who resolves the bulk of them), etc…  So if the administrators are abstracted from most of the chaff, how do they maintain situational awareness?

Situational awareness, another one of those terms I picked up flying.  Basically, the understanding of what’s going on around you.  Easily demonstrated with the following question to your administrator:  “How’s AD doing today?” – by default, the answer will be, “AD’s running great” (that is their job after all…) – the follow-up question though, “How do you know?” is usually the zinger.  If the answer is, because nobody from Help Desk is screaming at us, then that’s probably not a good sign.  If the answer is, because there are no trouble tickets, that’s probably also not a good sign… 

Lack of bad doesn’t necessarily equal good.

When I have a chance to talk to the MS Directory Masters classes, I usually try to work in the following story:

In 2002, I was one of a small group of AD administrators for MSIT, we were knee-deep in dogfooding Whistler, which shipped as Windows Server 2003, when one day my GM walks by, sticks his head in the door (never a good sign), and asks “How’s AD doing today?”.  Default response at the time was something like, “Looks good, couple of DC’s being upgraded, so far so good… why do you ask?”  It’s at this point that he says, because I just got a call saying that our Extranet is offline, nobody can authenticate to any applications, our partners aren’t able to do business with us, and I was wondering what you were doing about it?  If I remember correctly, it was about that time that he looked a little worried about his hiring decision, turned and walked away…

Quickly (trying) to log onto the domain controllers, all 6 of the DC’s were running at 100% CPU utilization.  Perfmon, SPA traces, expensive/inefficient query logging – nada/zero/zip – we were in trouble.

Within a couple of hours, we’re all in a big room – techies around the table, managers looking over our shoulders, and we had the AD rock stars from the product group (the developers) sitting in the room, taking apart the DC’s in the debuggers.  They were all shaking their heads, when someone mumbled under their breath, “this almost looks like normal load…just a lot of it”

That’s when we decided to pull some perf data for the past 6 months, which looked something like this:

perf

Sure enough, we had been growing load for the past year or so, all the DC’s were running at 100% CPU, we stole 4 servers which were racked & built for some other application, DCPromo’d them and perf dropped down to a reasonable level…

oops…

As you can see from my MSPAINT representation – WE ACTUALLY HAD THE PERF DATA!  The problem was, that we had lost situational awareness of what was going on in our other environments, because we were so focused on dogfooding.

The moral of the store then, being that it’s good to HAVE data, but it’s much better to LOOK AT the data occasionally…  

One Response to “Passive Safety & Situational Awareness”

  1. Great post! Reminds me of a story I read once about two pilots who were so distracted about a warning light (they were trying to determine if it was a real problem or not) that they forgot to, you know, fly the plane and crashed into a mountain.

    Hence flight rule 1: Someone always has to fly the plane.

Leave a comment