***Updated – 4/16/2010***
After installing CU5 for the servers and Tanjay this delay is now down to 2-3 seconds. The patches appear to have shaved right around 2 seconds off of the time between:
CCallStateManager::OnAggregateStatesChanged: IsUserInACall = 1
The patches are here:
At a recent deployment I ran into an issue with delays in response groups. This time it was a bit different than before, though. This time the delay was only 4 to 5 seconds from the time the agent picked up the phone until audio started, and the delay was only affecting Tanjays.
I started off with the standard stuff: SAN field on the edge server cert, access list on the router between subnets, CRL not valid or not accessible, checking the cert algorithm, etc…
The CRL (Certificate Revocation List) distribution point was only published with an LDAP request. I logged into the CA, setup an HTTP based distribution point and updated the certificate templates and certificates on the edge, mediation and front end servers.
Now that certificates included a valid CRL, I restarted the services to make sure the certificate change was applied and restarted the Tanjay to get a nice clean connection. Another call to the response group to test and I see the same delay. I checked and rechecked, the delay was present whether I was calling from MOC to the response group or from the PSTN.
At this point I really wanted to see what was going on with the client. The Tanjays have logging built in (Settings>Enable Logging) but I also wanted a bit more detail so I created a mirror port on the Cisco switches to duplicate all traffic from the Tanjays port to another port when I had wireshark running, enabled SIP Stack logging on the Front End server and grabbed a packet capture from the Front End server (all at once).
To enable Port SPAN on a Cisco 3750 I ran the following commands:
C3750(config)#monitor session 1 source interface Gig 0/2
C3750(config)#monitor session 1 destination interface Gig 0/3
*In the above example Gig 0/2 is the port the Tanjay is plugged into and Gig 0/3 is the port we’re sending copies of all the traffic to, more info here*
I placed the test call and saw about a 4 second delay before audio started (as expected). I stopped the traces and network captures, and then sent the logs from the Tanjay to the front end server. The logs are uploaded to %Program Files\Microsoft Office Communications Server 2007 R2\Web Components\DeviceUpdateFiles\Logs\Client\CELog. Once I had the Tanjay logs (a really long file name ending in CLG1) I used readlog to convert them into a format I could read:
I opened a case with Microsoft as well to make sure I wasn’t going crazy. After looking everything over and discussing with a few other people we realized this was actually a bug (one they knew about).
To get a good look at the bug we check the Tanjay log file, the first thing we see is a 3 second delay between here (14:45:06.000):
And here (14:45:09:000):
We’re not really sure what’s going on during this delay, but we know it is not expected.
Next, while examining the SIP Stack from the front end, we see another delay of around half a second between the Refer being sent and the 202 coming back:
The two delay’s total up to almost four seconds of time spent building the call. This same delay occurs even if the Tanjay is USB connected to the computer, but doesn’t occur if you are using MOC with USB attached devices. Which means the problem is actually in the Tanjay code.
Microsoft is aware of the issue and they are actively working to correct it. Once there is a firmware fix out there I will update this post to notify everyone.