Push Notification Fails with a 504 Server Time Out

While troubleshooting push notification failure issues with a client, I found an interesting problem.  The client had already configured the SRV record as required (http://blog.ucmadeeasy.com/), and disabled the URL filtering as required (http://support.microsoft.com/kb/2664650), but push notification was still failing with a 504 error code.  To take it one step further, we completely disable all IM filtering just in case.  However, we still received a 504 error (server timeout) from the Push service.

As background, the Push Notification Clearing House (PNCH) runs in the office 365 Cloud using Lync Edge servers and dynamic federation.  For more information on the 3 types of federation, refer to the article I wrote here: https://ocsguy.com/2011/04/20/a-few-words-on-federation/.

We were unable to troubleshoot the issue from the Office 365 side, so I decided to reconfigure my company’s Edge server with dynamic federation (it was configured with direct federation) and see if I could find any errors related to the customers configuration.

I began by removing the customer’s domain information from the Federated Domains tab within the Lync Server Control Panel in my Lync environment.  Next I signed into a test account on the customers Lync environment (jsmith@contoso.com) and attempted to IM an account in my environment (kevinp@tailspintoys.com).  The IM failed immediately and I began reviewing the UCCAPI log from my client and the SIP Stack logs from my Edge servers.

It didn’t take long to find the 504 error in the logs, including some useful diagnostic information:

In the “ms-diagnostics” line we see “No match for domain in DNS SRV results” followed by the domain name (contoso.com) and the A record usim.us.contoso.com.

The problem lies within the A record the Federation SRV record is using.  It doesn’t match the SIP domain.

Now you may be thinking “they are both contoso.com”, and if you are, you are not alone!  The catch however, is there is a subdomain in the A record (usim.US.contoso.com) that does not exist in the SIP domain.  This causes a failure to match the SIP domain with the SRV record.  Since they don’t match, you would have to use Direct Federation instead of Enhanced or Dynamic Federation to federate with this organization.  This would seem to be an easy fix, but since Office 365 only supports Dynamic Federation, the fix is a configuration change on the customer side.

To resolve the issue we created a new DNS A record to use for Federation (sip.contoso.com).  We also updated the Access Edge certificate to include this name in the SAN field. Once these steps were completed, Push Notification began working on the mobile clients.

Lesson learned: As a best practice, make sure you’re DNS A records for Federation don’t have a subdomain unless your sip domain does as well.

Advertisements
Posted in Uncategorized | Tagged , , , , , , , | 4 Comments

Lync Mobility Server Side Bits Available

Quick update, the MCX package is now available from Microsoft.  These bits along with CU4 allow you to turn on Lync mobile client functionality.  The client bits will be available on the appstores/mareketplaces for the phone some time before the end of the year, but I have no inside information on when.

Before deploying the new bits or CU4 in production make sure to test in your lab, read the documentation and update your load balancer configurations (links below).  If time allows and someone else doesn’t beat me to it, I’ll publish an article on the mobility configuration shortly.

Links:

Mobility and Autodiscover Services: http://www.microsoft.com/download/en/details.aspx?id=28356

Mobility Deployment Guide: http://www.microsoft.com/download/en/details.aspx?id=28355

Hardware Load Balancer Requirements for Lync 2010 (Updated for CU4): http://blogs.technet.com/b/nexthop/archive/2011/11/03/hardware-load-balancer-requirements-for-lync-server-2010.aspx

Happy patching!

Posted in Uncategorized | Tagged , , , , | 10 Comments

Lync Hardware Load Balancer Monitoring Port

If you are using a hardware load balancer, it will do periodic health checks for Lync to make sure it is distributing the load to servers that are functioning.  Because of the checks, you may end up with a large number of protocol errors in your FE logs showing a connection error with the VIP IP from the load balancer or one of its SNAT addresses.  Here is an example error:

Source: LS Protocol Stack

Event ID: 14502

Level: Error

A significant number of connection failures have occurred with remote server IP 10.255.106.202. There have been 120 failures in the last 180 minutes. There have been a total of 291 failures.

The specific failure types and their counts are identified below.

Instance count   – Failure Type

291                 0x80072746(WSAECONNRESET)      

This can be due to credential issues, DNS, firewalls or proxies. The specific failure types above should identify the problem.

Notice in the error, the IP of my VIP is listed(10.255.106.202).

Although these are expected, if you haven’t specified an HLB monitoring port, they certainly cause an awful lot of unwanted noise in the logs.

To combat the issue, enable an HLB port on your FE servers (or any other pool you are using HLB on) and configure the health checks for the load balancer to use that port instead of the port used for TLS traffic.

Start by configuring the pool in Topology Builder, right click the pool, and choose Edit Properties>General.  Place a check in the “Enable Hardware Load Balancer monitoring port” and specify a port.

 If you have the mediation server role on the pool and have specified a TCP port of 5060, you will need to use a different port.

Once this is configured, you can log into your load balancer and specify the health checks.  Use this port instead of 5061 (for your SIP traffic).  Here is how I configured it on my Kemp VLM in my lab (please consult your product literature for the correct configuration based on your devices manufacturers’ suggestion).

Once everything was configured, I went ahead and stopped the Front End services on one of the servers in the pool, and just as expected, the load balancer showed it as down and directed the traffic elsewhere.

Posted in Uncategorized | Tagged , , , , , , | 5 Comments

Using the Analog Device Creation Script

Last week I launched a new script on the script center to bulk create analog devices in Lync.  This script uses a source CSV file (reference file included in the download) to create a large (or small) number of analog devices.  The zip also includes a “readme” detailing how to run the script, but in case it wasn’t clear, I wanted to cover a few of the fields in this article.

The first field I want to cover is the “LineURI” field.  This field is used to establish the phone number of the analog device.  Once the analog device object has a number listed here, Lync will use that number to route to it (via the analog gateway or ATA).  It is important to include the “tel:+” in this field, followed by the full number, without any spaces, dashes or periods. For example, the number 513-555-1212 would be entered as shown below:

Next, we have the “Gateway” field. This one can either be the name or IP address of the analog gateway that the analog device is plugged into, not your PSTN gateway. For example, if my analog gateway was 192.168.1.5, I would enter it as show below:

Finally, I want to cover the “OU” field.  This field defines which OU in your domain the analog device object will be created in.  For example, lets say I have an OU named “LyncAnalogDevices” in my AD domain, “contoso.local”:

I would populate the “OU” field entry with “ou=LyncAnalogDevices,dc=contoso,dc=local”.

Other than those items I think the readme covers everything but if you have questions feel free to post them here.

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Script Center

Hi All,

I’ve added a script center (link in the bar above) to the site and will be adding a number of new scripts shortly.  All scripts will have a transcript feature and will be digitally signed so hopefully they run without issue for you.  If you have suggestions for new scripts or find a bug, please post in the comments section below.

 

Enoy!

-kp

Posted in Uncategorized | Leave a comment

Troubleshooting CMS Replication

Like most of my other articles, this one comes from a real world scenario I had to solve.  While working with a client, we ran into two front end servers and an edge server
that wouldn’t replicate.  The CMS was hosted in another country with plenty of firewalls in between, which definitely complicated the issue.  However, the root cause wasn’t a network issue or firewall.

We started by tackling the front end servers, two out of four front end servers in a pool were showing out of date in the topology.  From the server hosting the CMS I verified I could ping each front end server by name and to my surprise I could also telnet to them on port 445.

Depending on your level of familiarity with CMS, at this point you may be wondering why port 445?  The server hosting the CMS will push the data to all replicas in the topology (other than edge) using SMB[4].  For more information on CMS, please have a look at Jens’ blog here, it goes into great detail.

Since we knew the path used to connect was valid and the server was listening for the connection, the next thing on my list was a packet capture to see what was happening.  I started a packet capture using NetMon, and ran the invoke-csmanagementstorereplication
CMDlet to kick off replication.

After capturing data for 30 seconds I stopped the capture and applied a Display Filter so I could look at the SMB traffic first.

It didn’t take long to figure out what was going on from here; there was “Access Denied” all over the logs.  I tried to view the properties of the RTCReplicaRoot folder (by default this folder is on the root of the drive Lync is installed on), but didn’t have permission.  Although this would seem like an error, it is actually expected behavior and it is best not to try to modify permissions to this directory.  At this point, we discussed the build process with the client and determined a security script meant to tighten NTFS permissions had inadvertently broken CMS replication for those two servers.  Instead of trying to fix the
NTFS issue and risking other problems down the road, we removed the boxes from the topology, re-installed the OS, and added them back as Lync servers after a clean rebuild without the security script.

Now that all the front ends were replicating, it was time to find out what was going on with that edge server.  The customer had rebuilt this server as well, assuming the script had caused the same issue on it, but we soon found out that wasn’t the case.  To start troubleshooting I ran a network capture and this time I filtered for traffic on port 4443 since the edge server receives its updates via this port over https, not 445/SMB like the front ends.  After verifying traffic was coming in on the appropriate port, I couldn’t do much more with network captures since the traffic was encrypted.  My next step was to begin logging on the server hosting the CMS, grabbing logs for all three of the CMS related
services.  One thing of note here- when you are trying to troubleshoot CMS issues, you won’t see CMS listed in Lync Server Logging Tool.  However, you will see XDS- that’s the guy you want to log.

I started logging of all three XDS options with all flags, all levels, as shown below:

Next I ran the invoke-csmanagementstorereplication CMDlet with the –ReplicaFQDN parameter to limit the replication to just that one machine.  I let the log run for about 30
seconds and then stopped it.  I started by analyzing the FTA (File Transfer Agent) logs.  As a quick hint, these logs are in trace format which is a bit harder to read than the messages format but you do have yellow and red highlights to indicate warnings and errors.  Also, the search comes in quite handy.  I ran a search for the edge servers name and while viewing the results I found a yellow bar (warning).  I clicked on the yellow bar and saw my problem almost immediately:

In the log we see the error “Failed to copy files from Replica” and “Invalid certificate presented by remote source”.  In this case, the “remote source” is actually our server hosting the CMS (seems a bit backwards).  This means our edge server doesn’t trust the certificate the CMS is using.  In a more simplistic install, that may not be a huge issue but in this case there were a number of intermediate CA’s and tracking down all the certificates one by one and reviewing everything wasn’t going to be much fun.

That’s when my good friend Paolo from Microsoft, whom I just happen to be IM’ing about the issue, let me in on a cool little trick.  We exported the certificate from the server hosting the CMS (without the private key) and copied the file to the edge server (C:\tmp\CMSCert.cer).  From acommand prompt I ran the following command:

Certutil -verify -urlfetch “C:\tmp\CMSCert.cer” > c:\CRL.TXT

This command runs a check on the certificate (including accessing the CRLs) and dumps the results to a text file, it may take a few minutes to complete.  After reading through the text file we found the information we needed:

This told us exactly what certificate was missing and we were able to get it installed and the edge server started replicating.

To wrap this all up, I’d recommend running through the following when CMS replication isn’t working:

  1. Ping from the server hosting the CMS to the host that isn’t replicating
  2. Telnet from the server hosting the CMS server to the host that isn’t replicating (port 445 for all servers except edge, which is 4443)
  3.  Network Capture to see if the traffic is making it and to look for possible SMB errors
  4. XDS logging on the server hosting the CMS, review with Snooper
  5. CertUtil – using the commands above will test the CA chain all the way through to verify trusts.

Hope this helps!

Posted in Uncategorized | Tagged , , , , , , , , | 16 Comments

Southwestern Ohio UC Users Group

Hi All,

A non-technical post for today.  Adam Curry, Travis Swank and I are working to start a UC Users Group based out of Cincinnati.  We have scheduled the first meeting and have a website live at www.ucusersgroup.com.

The first nights content will start with an Office 365 overview and move into technical content focusing on the edge roles.  Food and drinks will be provided.  Please join us for an evening of networking and learning new things about Lync.

-Kevin

Posted in Uncategorized | 2 Comments