Since the blog I did on Live Meeting troubleshooting I have seen a lot of queries leading people to the site for troubleshooting OCS. I’ve also seen a ton of questions on the subject on the MS forums. All this has lead me to the conclusion that OCS troubleshooting isn’t that easy to get a handle on. With that in mind I’m writing this post on troubleshooting federation. First and foremost, this article is about troubleshooting a mistake I made during a deployment recently, and if you ask any PSS engineer they will tell you 80% of the problems they face with OCS are based on the same thing: human error/configuration error. As I said in my last troubleshooting post, I’m no expert on troubleshooting OCS, but hopefully this post will help someone out there. As always I encourage you to share your stories and methods if you think they may help someone else.
So recently while working on an enterprise edition install of 2007 R2 I ran into an issue with federation. The issue was I could send an IM from the client, but an attempt to reply from our OCS environment ended with an ID 504 error in MOC. I just so happened to be federating the client with my own company, so I was able to trace from both sides and find the resolution.
Since 504 errors are typically routing, firewall or DNS related (boy that really limits it doesn’t it!) I started out with the standard DNS and telnet test. I could resolve the access edge server appropriately and could also telnet to it from my edge server on 5061. Since the firewalls on both sides looked good and DNS was doing its job, I started a SIP Stack trace. I started with the Edge server at my company, as we were the ones who couldn’t communicate, and most likely we would see the errors on our side.
On our edge server I started my SIP Stack trace and attempted to send an IM to my test account in the clients environment. Keep in mind there is a lot of information in a SIP trace, so you want to be quick about this so you don’t overwhelm yourself with logs.
Here’s how I configured logging:
After the test message was sent and the error was received in MOC I stopped logging and clicked the “Analyze Log Files” button.
I made sure only my SIP Stack was selected and clicked “Analyze”
I followed the path listed in the “Output File” field and grabbed the text file that was created. Once the log file was on my machine I opened Snooper and examined the log. Here’s what I saw:
I selected the first red line that was relevant to my conversation with the test contact; a “Server Time-Out” error. From here I moved one line up so I would get the request right before the error and looked at the information in the right hand column. Under the “Route” section I see not only the pool name of the customers EE pool, but I aslo see the FQDN of the server. At this point I realized where my error was.
Since the edge server was behind a NAT it had to be able to resolve the public IPs for the public facing edge services (Sip., AV., and Meeting.). Also to protect the network we had not allowed the server to even resolve internal names. To enable the edge server to talk to the pool I had created entries in the host file. However, I only created an entry for the FQDN of the pool and not of the individual servers in the pool by mistake. I added an entry into my host file for the FQDN of the front end server and that corrected my issue.
Although this won’t cure every 504, hopefully the methods used help shed some light on troubleshooting.
Keep in mind 504’s are usually routing, firewall, or DNS related and its best to troubleshoot them from the end receiving the error. If anyone is interested I can provide a copy of the log file (names and IPs changed of course).