Like most of my other articles, this one comes from a real world scenario I had to solve. While working with a client, we ran into two front end servers and an edge server
that wouldn’t replicate. The CMS was hosted in another country with plenty of firewalls in between, which definitely complicated the issue. However, the root cause wasn’t a network issue or firewall.
We started by tackling the front end servers, two out of four front end servers in a pool were showing out of date in the topology. From the server hosting the CMS I verified I could ping each front end server by name and to my surprise I could also telnet to them on port 445.
Depending on your level of familiarity with CMS, at this point you may be wondering why port 445? The server hosting the CMS will push the data to all replicas in the topology (other than edge) using SMB. For more information on CMS, please have a look at Jens’ blog here, it goes into great detail.
Since we knew the path used to connect was valid and the server was listening for the connection, the next thing on my list was a packet capture to see what was happening. I started a packet capture using NetMon, and ran the invoke-csmanagementstorereplication
CMDlet to kick off replication.
After capturing data for 30 seconds I stopped the capture and applied a Display Filter so I could look at the SMB traffic first.
It didn’t take long to figure out what was going on from here; there was “Access Denied” all over the logs. I tried to view the properties of the RTCReplicaRoot folder (by default this folder is on the root of the drive Lync is installed on), but didn’t have permission. Although this would seem like an error, it is actually expected behavior and it is best not to try to modify permissions to this directory. At this point, we discussed the build process with the client and determined a security script meant to tighten NTFS permissions had inadvertently broken CMS replication for those two servers. Instead of trying to fix the
NTFS issue and risking other problems down the road, we removed the boxes from the topology, re-installed the OS, and added them back as Lync servers after a clean rebuild without the security script.
Now that all the front ends were replicating, it was time to find out what was going on with that edge server. The customer had rebuilt this server as well, assuming the script had caused the same issue on it, but we soon found out that wasn’t the case. To start troubleshooting I ran a network capture and this time I filtered for traffic on port 4443 since the edge server receives its updates via this port over https, not 445/SMB like the front ends. After verifying traffic was coming in on the appropriate port, I couldn’t do much more with network captures since the traffic was encrypted. My next step was to begin logging on the server hosting the CMS, grabbing logs for all three of the CMS related
services. One thing of note here- when you are trying to troubleshoot CMS issues, you won’t see CMS listed in Lync Server Logging Tool. However, you will see XDS- that’s the guy you want to log.
I started logging of all three XDS options with all flags, all levels, as shown below:
Next I ran the invoke-csmanagementstorereplication CMDlet with the –ReplicaFQDN parameter to limit the replication to just that one machine. I let the log run for about 30
seconds and then stopped it. I started by analyzing the FTA (File Transfer Agent) logs. As a quick hint, these logs are in trace format which is a bit harder to read than the messages format but you do have yellow and red highlights to indicate warnings and errors. Also, the search comes in quite handy. I ran a search for the edge servers name and while viewing the results I found a yellow bar (warning). I clicked on the yellow bar and saw my problem almost immediately:
In the log we see the error “Failed to copy files from Replica” and “Invalid certificate presented by remote source”. In this case, the “remote source” is actually our server hosting the CMS (seems a bit backwards). This means our edge server doesn’t trust the certificate the CMS is using. In a more simplistic install, that may not be a huge issue but in this case there were a number of intermediate CA’s and tracking down all the certificates one by one and reviewing everything wasn’t going to be much fun.
That’s when my good friend Paolo from Microsoft, whom I just happen to be IM’ing about the issue, let me in on a cool little trick. We exported the certificate from the server hosting the CMS (without the private key) and copied the file to the edge server (C:\tmp\CMSCert.cer). From acommand prompt I ran the following command:
Certutil -verify -urlfetch “C:\tmp\CMSCert.cer” > c:\CRL.TXT
This command runs a check on the certificate (including accessing the CRLs) and dumps the results to a text file, it may take a few minutes to complete. After reading through the text file we found the information we needed:
This told us exactly what certificate was missing and we were able to get it installed and the edge server started replicating.
To wrap this all up, I’d recommend running through the following when CMS replication isn’t working:
Ping from the server hosting the CMS to the host that isn’t replicating
Telnet from the server hosting the CMS server to the host that isn’t replicating (port 445 for all servers except edge, which is 4443)
Network Capture to see if the traffic is making it and to look for possible SMB errors
XDS logging on the server hosting the CMS, review with Snooper
CertUtil – using the commands above will test the CA chain all the way through to verify trusts.
Hope this helps!