What do you do, when you don’t know what to do?
Posted by BPuhl on March 8, 2008
[Edited 3/8] Talk about a reminder! I started to write this post back in February. Like most things, it got tossed to the back burner when everything else in life started flying sideways. I’ve started to develop the bad habit, that if I can’t throw a blog post out in the 2 minutes that I’m thinking of something, then I don’t seem to be getting back to them. Not good.
While out to dinner with several friends at DEC the other night, the conversation turned to, “What do you do when you don’t know what to do?” What was really funny, is that I thought I had already blogged about it…well…without further ado, here’s a fairly legitimate blog post
[Original from 2/11]
So what do you do when you don’t know what to do? This happened recently, when one of the vendors that is helping us with Longhorn deployments comes running down the hallway because he thought he had accidentally seized the RID master in the REDMOND domain and then brought the original back online. Of course, it’s 9am, there’s hardly anybody around, but fortunately he was able to find someone who suggested that he just turn off the DC until the DC guys came in. Back down the hallway and that’s what he did.
Make whatever comments you like about the impacting of seizing the RID master, knowledge of what the role does, it’s impact, etc… That’s not the point of this post. The point of this post is, that few people will ever have 100% understanding of every aspect of the system, everything that could possibly change in it, and the downstream effect that every possible change will have on all dependent systems. Without that impossible depth and breadth of knowledge – then sooner or later SOMETHING is going to happen that you don’t know what to do about.
Back to the AD team, it didn’t take long to realize that few people had actually considered what they would do if, say for example, they thought they accidentally seized the RID master, or they accidentally ran a script that performed some bulk change (or delete), or accidentally clicked “ok” when they should have hit cancel. The problem is that if you’ve never thought about it, then you don’t know what’s in your toolbox. And how can you fix something, if you don’t have the tools?
The basic idea: Slow or stop the bleeding, until you can get help.
– If it’s a server issue and you aren’t sure what happened – Turn it off!
– If you think some change happened in AD that could replicate around the world – REPADMIN /OPTIONS * +DISABLE_OUTBOUND_REPLICATION It’s the big red button that will stop replication.
These are just a couple of things that you can do, but the point is, nobody is going to get mad when you stop forest wide replication or for turning off a server, if it saves them from needing to do a forest recovery. Damage control is about accepting that you’re going to have an impact, so let’s just limit it.
And one last note – When you do finally get a hold of the cavalry, and they come rushing in to help – Tell them what happened! Yes, it might be embarrassing to admit that you made a mistake…or even to find out that you reacted quickly and forcefully to something that wasn’t a mistake at all…but I haven’t heard of people being fired over embarrassment. I have heard of jobs being lost when people tried to hide their actions.
[From 3/8] – Talking about this at dinner, Stuart told a story that he had read in a book about the CDC (I wish I could remember the name of the book!), and the people who work in labs with ultra-dangerous viruses. What do they do if they accidentally nick their spacesuit and their finger, exposing themselves to deadly viruses? They have 10-20 seconds to cut off their finger, otherwise they are as good as dead. Kind of makes me glad that I only work on computer systems 
 My memory of this story is a little bit clouded by the wine. Any inaccuracies are mine, but it still makes for a good story!