Troubleshooting CMS Replication

Like most of my other articles, this one comes from a real world scenario I had to solve.  While working with a client, we ran into two front end servers and an edge server
that wouldn’t replicate.  The CMS was hosted in another country with plenty of firewalls in between, which definitely complicated the issue.  However, the root cause wasn’t a network issue or firewall.

We started by tackling the front end servers, two out of four front end servers in a pool were showing out of date in the topology.  From the server hosting the CMS I verified I could ping each front end server by name and to my surprise I could also telnet to them on port 445.

Depending on your level of familiarity with CMS, at this point you may be wondering why port 445?  The server hosting the CMS will push the data to all replicas in the topology (other than edge) using SMB[4].  For more information on CMS, please have a look at Jens’ blog here, it goes into great detail.

Since we knew the path used to connect was valid and the server was listening for the connection, the next thing on my list was a packet capture to see what was happening.  I started a packet capture using NetMon, and ran the invoke-csmanagementstorereplication
CMDlet to kick off replication.

After capturing data for 30 seconds I stopped the capture and applied a Display Filter so I could look at the SMB traffic first.

It didn’t take long to figure out what was going on from here; there was “Access Denied” all over the logs.  I tried to view the properties of the RTCReplicaRoot folder (by default this folder is on the root of the drive Lync is installed on), but didn’t have permission.  Although this would seem like an error, it is actually expected behavior and it is best not to try to modify permissions to this directory.  At this point, we discussed the build process with the client and determined a security script meant to tighten NTFS permissions had inadvertently broken CMS replication for those two servers.  Instead of trying to fix the
NTFS issue and risking other problems down the road, we removed the boxes from the topology, re-installed the OS, and added them back as Lync servers after a clean rebuild without the security script.

Now that all the front ends were replicating, it was time to find out what was going on with that edge server.  The customer had rebuilt this server as well, assuming the script had caused the same issue on it, but we soon found out that wasn’t the case.  To start troubleshooting I ran a network capture and this time I filtered for traffic on port 4443 since the edge server receives its updates via this port over https, not 445/SMB like the front ends.  After verifying traffic was coming in on the appropriate port, I couldn’t do much more with network captures since the traffic was encrypted.  My next step was to begin logging on the server hosting the CMS, grabbing logs for all three of the CMS related
services.  One thing of note here- when you are trying to troubleshoot CMS issues, you won’t see CMS listed in Lync Server Logging Tool.  However, you will see XDS- that’s the guy you want to log.

I started logging of all three XDS options with all flags, all levels, as shown below:

Next I ran the invoke-csmanagementstorereplication CMDlet with the –ReplicaFQDN parameter to limit the replication to just that one machine.  I let the log run for about 30
seconds and then stopped it.  I started by analyzing the FTA (File Transfer Agent) logs.  As a quick hint, these logs are in trace format which is a bit harder to read than the messages format but you do have yellow and red highlights to indicate warnings and errors.  Also, the search comes in quite handy.  I ran a search for the edge servers name and while viewing the results I found a yellow bar (warning).  I clicked on the yellow bar and saw my problem almost immediately:

In the log we see the error “Failed to copy files from Replica” and “Invalid certificate presented by remote source”.  In this case, the “remote source” is actually our server hosting the CMS (seems a bit backwards).  This means our edge server doesn’t trust the certificate the CMS is using.  In a more simplistic install, that may not be a huge issue but in this case there were a number of intermediate CA’s and tracking down all the certificates one by one and reviewing everything wasn’t going to be much fun.

That’s when my good friend Paolo from Microsoft, whom I just happen to be IM’ing about the issue, let me in on a cool little trick.  We exported the certificate from the server hosting the CMS (without the private key) and copied the file to the edge server (C:\tmp\CMSCert.cer).  From acommand prompt I ran the following command:

Certutil -verify -urlfetch “C:\tmp\CMSCert.cer” > c:\CRL.TXT

This command runs a check on the certificate (including accessing the CRLs) and dumps the results to a text file, it may take a few minutes to complete.  After reading through the text file we found the information we needed:

This told us exactly what certificate was missing and we were able to get it installed and the edge server started replicating.

To wrap this all up, I’d recommend running through the following when CMS replication isn’t working:

  1. Ping from the server hosting the CMS to the host that isn’t replicating
  2. Telnet from the server hosting the CMS server to the host that isn’t replicating (port 445 for all servers except edge, which is 4443)
  3.  Network Capture to see if the traffic is making it and to look for possible SMB errors
  4. XDS logging on the server hosting the CMS, review with Snooper
  5. CertUtil – using the commands above will test the CA chain all the way through to verify trusts.

Hope this helps!

Advertisements

About Kevin Peters

My name is Kevin Peters.
This entry was posted in Uncategorized and tagged , , , , , , , , . Bookmark the permalink.

16 Responses to Troubleshooting CMS Replication

  1. Pingback: Troubleshooting CMS Replication | The OCS Guy’s Blog « JC’s Blog-O-Gibberish

  2. rizwan says:

    Download path not set

  3. Jonatan says:

    Hi Kevin
    Good post!
    I have a similar problem where the replication with the edge servers have suddenly stopped working.
    Tracing on the front end gives errors like this:
    Failed to copy files from temp directory. Exception: [System.ServiceModel.Seurity.MessageSecurityException: The HTTP request was forbidden with client authentication scheme ‘Anonymous’. —>System.Net.WebException: The remote server returned an error: (403) Forbidden.

    As far as i can see everything related to the services and certificates are in order.
    Any suggestions?

  4. Jonatan says:

    Never mind, i sorted it out.
    Here is the regfix that solves it.
    Edit the registry on the Edge server to add a DWord value, SendTrustedIssuerList, to the
    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL
    key and assign it a value of 0. This will prevent schannell.dll from truncating the Root CA list from the edge server, and allow validation tests to pass.

  5. Sam says:

    I am seeing a similar issue. We have a very small deployment with one collocated FE/Mediation server and we are looking to add an edge server. I ran Export-CsConfiguration on my FE and copied the zip over to my edge where I did step 1 in the deployment wizard and pointed it to the zip. It said it completed successfully but the green check mark does not show up, nor can I do the services step.

    I ran the cert command and I see a few errors:

    ChainContext.dwErrorStatus = CERT_TRUST_REVOCATION_STATUS_UNKNOWN (0x40)
    ChainContext.dwErrorStatus = CERT_TRUST_IS_OFFLINE_REVOCATION (0x1000000)
    ChainContext.dwErrorStatus = CERT_TRUST_IS_PARTIAL_CHAIN (0x10000)
    SimpleChain.dwErrorStatus = CERT_TRUST_REVOCATION_STATUS_UNKNOWN (0x40)
    SimpleChain.dwErrorStatus = CERT_TRUST_IS_OFFLINE_REVOCATION (0x1000000)
    SimpleChain.dwErrorStatus = CERT_TRUST_IS_PARTIAL_CHAIN (0x10000)

    Failed “AIA” Time: 0
    Error retrieving URL: The specified network resource or device is no longer available. 0x80070037 (WIN32: 55)
    ldap:///CN=domain-PDC01-CA,CN=AIA,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=domain,DC=contoso,DC=com?cACertificate?base?objectClass=certificationAuthority

    —————- Certificate CDP —————-
    Failed “CDP” Time: 0
    Error retrieving URL: The specified network resource or device is no longer available. 0x80070037 (WIN32: 55)
    ldap:///CN=domain-PDC01-CA,CN=pdc01,CN=CDP,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=domain,DC=contoso,DC=com?certificateRevocationList?base?objectClass=cRLDistributionPoint

    Missing Issuer: CN=domain-PDC01-CA, DC=domain, DC=contoso, DC=com

    Incomplete certificate chain
    Cannot find certificate:
    CN=domain-PDC01-CA, DC=domain, DC=contoso, DC=com
    Cert is an End Entity certificate

    ERROR: Verifying leaf certificate revocation status returned The revocation function was unable to check revocation because the revocation server was offline. 0x80092013 (-2146885613)
    CertUtil: The revocation function was unable to check revocation because the revocation server was offline.

    What can I do to alleviate this? We were going to deploy our edge today but now it is looking like I cannot.

  6. Danie says:

    Hi There – in your case was it the default cert from the lyncfront server? I have the root ca and issueing cert allready on the edge. If I export the cert from the lynfront I get the same result as you, though it mentions the cert details from the default cert from the lyncfront I tested against.
    The lyncedge is in a workgroup and can ping the root ca server, though it can’t reach the certsrv page. My other question will be do you then export the default cert from the lyncfront (it won’t allow to export the private key though). I’ve also tried the schannel reg edit, but that does not work – I still see the same error in the logger on the lyncfront. Hope you can reply soon.

    • Kevin Peters says:

      Danie,

      This should be the root CA that issued the certificates to the FE being added to the Trusted Certificate Authority certificate store on the edge server. Not the FE server cert.

      HTH
      -kp

  7. Danie says:

    I fixed this today – the reg edit solution worked for us, but we have 2 CA servers in our domain – on the lyncfront I had the the issuing certs from both and only one of the certs on the edge. When I imported the 2nd cert on the egde replicatoin started to work. So the solution above is allmost correct except that the missing cert is not from the edge server but from the lyncfront server, but the schannel reg fix should also be applied.

    • Kevin Peters says:

      Hi Danie,

      I think youa re a bit confused on the info above, the cert you need to install is the trusted root CA certificate that issued the front end cert, not the FE server cert itself. If you install just the FE server cert you will end up with other problems if you have other Lync servers trying to to talk to the edge. The solution above is the correct solution for the scenario.

      The registry setting in the comments above was for a different issue than the actual problem(s) shown in the article.

      HTH
      -kp

  8. Great post KP, helped me out today on an Edge Server where CMS replication was failing. Found that I hadn’t imported the intermediate cert. 🙂

  9. Richard says:

    Thanks Matt, my root CA wasn’t sitting in the right store on the Edge Server. Thanks again.

  10. Chris Duva says:

    Thank you for a good article. My new Lync 2013 edge server on a Windows Server 2012 VM passed all of these tests, but still would not replicate. I had to add a DWord named: ClientAuthTrustMode (set to 2) to my SCHANNEL key in order to get it to replicate properly. I had to add the same value to my front end server in order to get the Front End service to start.

  11. Pingback: CMS replication issues between Lync 2013 FE and Edge when using internal CA for Edge » Saleh's TechNotes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s