BPuhl’s Blog

A little bit of everything without actually being much of anything

Strict Replication Consistency (fun with dogfood)

Posted by BPuhl on December 3, 2008

For the past few months, our AD operations team has been busy deploying Windows 7 domain controllers across our CORP forest (our main production environment).  When we’re dogfooding a new operating system, we’ll rarely do  OS upgrades, rather we spend a lot of time demoting, building, promoting servers and pretty frequently (more than 50% of the time), our replication monitoring would just go ballistic when the new server was replicating in it’s GC partitions.

Background details:  CORP.MICROSOFT.COM is our production forest, with an empty-root, 8 child domains, 125K users, 300K machines, 22GB DIT etc…

In addition to the standard SCOM monitoring, we’ve got a script that runs repadmin /replsummary every couple of hours, parses the error messages out, and send the result to the operations team in an e-mail.  We were often seeing messages that looked like this:

Destination DC        largest delta           fails/total %%   error
TK5-FE-DC-01           07d.18h:11m:36s           2 / 82    2   (8361) A local object with this GUID (dead or alive) already exists.
TK5-ME-DC-01                   14m:10s           1 / 72    1   (8361) A local object with this GUID (dead or alive) already exists.
TK5-SP-DC-01                   11m:06s           5 / 123   4   (8361) A local object with this GUID (dead or alive) already exists.

After some period of time (hours, days, more…), these errors would just disappear.  I fired off a mail to ask what they were doing to fix this problem.  The reply was that when this error was encountered, the operations team would disable Strict Replication Consistency, allow replication, and then re-enable it. 

AGH!  What in the world were they doing!  Well…at least, “AGH! What in the world were they doing”, was my response.  We’re usually the type who practices the “never reboot, always resolve the problem or find the bug” religion, so this was out of character.  Of course, my reaction was promptly squashed with the e-mail that I had sent 2 years ago, when they were hitting this problem and asking me for help, where I told them as an interim workaround, that they could just disable strict.  Ooops, so much for righteous indignation.

After a little bit of digging around through several years of mail in my PST’s, I found the e-mail thread where Nathan and I had run into this exact issue when we first enabled strict replication consistency.  A quick check of some objects verified the problem.

Let’s take a look at the metadata for the Domain Controllers OU object in our REDMOND domain, on both a REDMOND domain controller, and in the GC partition of a NORTHAMERICA domain controller:

On the REDMOND GC:

C:\>repamdin /showobjmeta tk5-red-dc-02 “ou=domain controllers,dc=redmond,dc=corp,dc=microsoft,dc=com”
repamdin /showobjmeta tk5-red-dc-02 ou=domain controllers,dc=redmond,dc=corp,dc=microsoft,dc=com

13 entries.
Loc.USN                          Originating DC   Org.USN  Org.Time/Date        Ver Attribute
=======                          =============== ========= =============        === =========
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 objectClass
  73066     886277be-f530-4749-af0e-a24da9abed8f     73066 2008-01-15 10:15:12    1 ou
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 description
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 instanceType
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 whenCreated
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 showInAdvancedViewOnly
606627423     1d9f7205-c504-47dc-a1a5-df67e4456f70 754676891 2008-09-18 11:59:33    8 nTSecurityDescriptor
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 name
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 systemFlags
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 objectCategory
  73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 isCriticalSystemObject
  73066     5ebf8fe0-cc0e-4831-8c7b-d5e6eaaf35cc 284539815 2007-08-28 10:56:37   35 gPLink
  73066     ecb902d0-6519-4086-bdff-a62f148ed78f 340813230 2002-04-17 19:30:28    3 gPOptions
0 entries.
Type    Attribute     Last Mod Time                             Originating DC  Loc.USN Org.USN Ver
======= ============  =============                           ================= ======= ======= ===
        Distinguished Name
        =============================

The REDMOND object on the NORTHAMERICA GC:

C:\>repamdin /showobjmeta tk5-na-dc-02 “ou=domain controllers,dc=redmond,dc=corp,dc=microsoft,dc=com”
repamdin /showobjmeta tk5-na-dc-02 ou=domain controllers,dc=redmond,dc=corp,dc=microsoft,dc=com

9 entries.
Loc.USN                          Originating DC   Org.USN  Org.Time/Date        Ver Attribute
=======                          =============== ========= =============        === =========
1133084     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 objectClass
1133084                 NA-WA-TUKDC\TK5-NA-DC-02   1133084 2008-11-05 08:32:57    1 ou
1133084     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 description
1133084     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 instanceType
1133084     cde06d9c-04bd-11d3-bff8-0008c72877a5     31928 1999-05-11 16:36:10    1 whenCreated
1133084     1d9f7205-c504-47dc-a1a5-df67e4456f70 754676891 2008-09-18 11:59:33    8 nTSecurityDescriptor
1133084     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 name
1133084     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 objectCategory
1133084     5ebf8fe0-cc0e-4831-8c7b-d5e6eaaf35cc 284539815 2007-08-28 10:56:37   35 gPLink
0 entries.
Type    Attribute     Last Mod Time                             Originating DC  Loc.USN Org.USN Ver
======= ============  =============                           ================= ======= ======= ===
        Distinguished Name
        =============================

So what was causing replication to fail, but when we turned off strict replication, allowed it to succeed?  We all know that strict replication consistency was put in place to help prevent lingering objects from replicating, but the Domain Controllers OU is certainly not going to be a lingering object (remember, lingering objects are ones which exist in the non-writeable partition on a GC, but not in the writeable partition of a DC).

Actually, strict replication was doing it’s job, and in fact, the error message “An object with this GUID already exists” was telling us exactly why it was stopping replication.  One of the things that strict replication protects against, is the rare possibility that 2 objects could be created on separate domain controllers, with the same objectGUID.  The comment in the code for strict replication says:

    // Duplicate guid detection.
    // See if we are trying to apply creation time attributes to an existing object.
    // The WhenCreated timestamp acts as an unchanging internal id of the object.  Even
    // if two objects get created with the same external id, the guid, we can distinguish
    // them based on their WhenCreated timestamp. This check allows for the attribute to
    // be rewritten with the same value, but never with a different one.
    // Skip this check if deleted so replication can be easily repaired.

So let’s go back and look at the 2 whenCreated attributes.  For ease, you can just look at the metadata again:

On the REDMOND DC’s:
73066     e74757d7-eeef-11d2-8f16-0008c74b8557      2495 1999-04-09 19:49:13    1 whenCreated

On the NORTHAMERICA GC:
1133084     cde06d9c-04bd-11d3-bff8-0008c72877a5     31928 1999-05-11 16:36:10    1 whenCreated

So the whenCreated timestamps are different on the 2 objects, which is why strict replication is blocking replication. 

For those that are curious, “How in the world did that happen?” – Back in Windows 2000 beta, which we did the original forest upgrade on.  The internal build of Windows 2000 that we deployed, didn’t include the whenCreated attribute in the schema.  The attribute was actually added in a later build, and the inconsistency occurred as the new build was deployed and values for the attribute were “fixed” (sort of). 

The next 2 questions were fairly easy:
     1.  How many more objects do we have in this condition
     2.  How do you fix them?

Well, one of the engineers used his newfound Powershell skillz and found that darn near EVERY object that was created when we upgraded our NT4 domain to Beta Windows 2000, was in this condition.  Thousands of objects in every domain.  Sux0r.

The fix is straightforward, if a bit non-intuitive.  What we really want, is for the whenCreated timestamps to be the same.  For that to happen, what we need is for a single value to be replicated out.  Since replication is based on the replication version number, we just needed to increment the version of whenCreated from 1 to 2, disable strict (again), the “authoritative value” would replicate on top of all GC’s in the forest, and we’d be set.  Anyone ever seen whenCreated replication version = 2? 

To increment the version number is easy, just change the value of the attribute.  Instead of 19:49:13 change it to 19:49:14 and you’re done.  Of course, you can’t just go around changing whenCreated, it’s protected from changes by the system (as you would hope and expect).  The secret squirrel sauce to being able to update the value is that you have to set the SchemaUpgradeInProgress operational attribute using a tool like LDP or similar.  Not something that I’d recommend you do casually, but a straightforward operation for the average Domain Admin with Schema Admin rights (and don’t forget to turn it off when you are done).  :)

Since this novella is already long enough, rather than explaining how to set the operational attribute, there is an excellent article HERE on how to do it.

I don’t suspect that there are too many people with production infrastructures that were created on beta versions of Windows 2000, so you’re unlikely to find anyone else who has this issue.  But after telling this story to a Directory Masters Certification class I was presenting to, I realized that it’s a good learning experience and an interesting story about replication in general so I wanted to share.  Thanks for taking the time to read through this one.

~B

About these ads

5 Responses to “Strict Replication Consistency (fun with dogfood)”

  1. Mike Kline said

    Another great and informative post Brian. I don’t think I would have gotten that fix (SchemaUpgradeInProgress)

    If this is the kind of thing that is discussed in the AD Masters course then I hope to someday come to that training. I don’t even care about the certification but you don’t get this stuff at most training classes.

    When you talk about Windows 7 DCs is that the same as 2008 R2?

  2. Glad Bee said

    Great,thanks! keep the good job you doing here!

  3. Matheesha said

    Excellent post. Would love to see more gems like that are shared in the masters course :)

  4. There is a reason Brian is one of our top instructors.

    Thanks for the shout out Brian. :)

  5. This makes me chuckle.

    You had unknown problems lingering from your NT4 to beta Windows 2000 upgrade which you have now resolved. But you may again be creating a new set of issues pushing the envelope so early in your production environment.

    You and your crew are definitely bleeders. :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: