We all have limits, we usually just don’t think about them as such. Or when we do think about them, it’s usually because somebody else is pushing them and it’s making us mad. But this is more about personal limits, especially with respect to the way that we run our servers in MSIT. (Important to note that MSIT is probably unlike any other environment out there, so your limits may be different, but the idea is probably the same)
On Sunday, Rey Diaz included some of the following thoughts in a conversation about Wisdom versus “right”
Do you try to come as close to breaking the law as you can, without breaking it?
Do you try to push your morals as far as allowed, without being immoral?
Do you try to move as close to disaster as you can, without actually feeling the consequences of disaster?
This evening, I was reading an article in one of my flying magazines, and Rod Machado had an article where he talks about one of my favorite factoids in aviation:
On page 6-26 of the FAA’s Pilots’s Handbook of Aeronautical Knowledge, we find that, “Aircraft certification rules require accurace in fuel gauges only when they read empty
In fact, if you look up the actual Federal Air Regulation, it says:
FAR 23.1337 (b.1) Each fuel quantity indicator must be calibrated to read “zero” during level flight when the quantity of fuel remaining in the tank is equal to the unusable fuel supply determined under part 23.959(a)
Is there anyone around that actually wants to be flying in an airplane when it’s fuel gauges read the only accurate calibrated measurement that they have to? I like flying gliders, but not like that!
Ok – but this is a geek blog – so what does this have to do with anything remotely interesting to you?
Well, in MSIT, the question often comes up about what are our DC performance. Sure, you can go graph hundreds of counters and things, but see my earlier post about situational awareness and then tell me how easy it is to keep yourself aware of all that flaff.
Instead, we’ve got some personal limits when it comes to our DC performance that have worked pretty well for us over the past few years. They are:
20-40% Target sustained CPU utilization
40-50% CPU utilization and we start checking for unusual causes of load, but if this is just normal trend growth then we either bring it back down with hardware replacement, or additional servers
> 50% CPU utilization – evaluate the trends, this is indicative of a potential problem – may need to start budgeting process for new servers
> 60% CPU utilization – we consider ourselves “broken”, and we go into break/fix mode to either reduce load or increase capacity
Of course, nobody’s saying that AD is broken at 60% CPU, these our just our personal limits. After all, if I wanted to wait for AD to break before I did anything, I might as well spend my free time polishing up my resume. The idea of course, is that you want to decide when it’s broken for you, and think about what you’re going to do ahead of time – this way, you’re not thrashing around, and management isn’t surprised when you have an out of band budget request.
A few other tidbits while I’m thinking about this.
- These numbers are purely based on our experience in our environment. We know that when we run over 75-80% CPU, we’re running very hot and some sensitive applications can be impacted by latency. We also know that our standard operating procedure is to have 3-5 DC’s offline at any given time. We have to account for the fact that when servers are offline (dogfooding, debugging, etc…) we need the headroom for the load on the other boxes.
- We consider “sustained utilization” to be averages over 15 minutes, across all DC’s, but we’re also applying the human element to the data. We’re looking for the trends, not the spikes…we know that spikes happen…
- In a perfect world, you’d at least know where the load came from. More often than not, there isn’t a single smoking gun, it’s just increased utilization as other systems in the environment are leveraging AD. At the moment, I can only think of one time when we could trace the increased utilization back to a single project, and that was our IPSEC deployment – of course, we couldn’t roll that back anyway, we still had to increase capacity – so it’s not like it really mattered, but it was nice to know what caused a 20%+ jump across the board.
Personal limits, a good thing to bring to work with you.