Wednesday, August 11, 2010

Are you also facing Windows Sockets 21 seconds Connection delay? Then you landed in a right place to find reason for it.

It took me more than 2 weeks to identiy the reason behind the 21 seconds delay. Now I am sharing my knowledge to you. This explanation will make you clear about the issue and its background.

Issue:
Whenever any non-blocking socket (WSAEVENTSELECT type) connect() call is given and if the server is down or plugged out of the network, then the response for the connect call will be received only after 21 seconds. The same 21 seconds delay will occur for every re-connection attempt made to that server. This will increase the clients waiting time. Suppose if the data from the server is critical and even a 5 seconds data loss from the server is critical to the client, then whole client software will be screwed.

Socket Background:
In case of Non-blocking sockets (of WSAEVENTSELECT type), if you try to connect to the server using windows socket Connect() call, the connect call will return a WSAEWOULDBLOCK error. This indicates, the connection is in progress and success or failure of the connection is identified based on the event raised by the network.

If the socket got connected successfully, then FD_CONNECT event will be raised without any error code. If the connection failed, then FD_CONNECT event will be raised with the error code based on the error occurred. If the error is timeout, then the FD_CONNECT event will be raised with error code WSAETIMEDOUT (10060) after 21 seconds. So after this event, client will either reconnect to the server or abort the connection and report the server status to the user.

Rationale behind the 21 second delay:

TCP Layer States of Windows Socket:
The start state of every socket is the CLOSED state. When a client initiates a connection, it sends a SYN packet to the server and puts the client socket in the SYN_SENT state. When the server receives the SYN packet, it sends a SYN-and-ACK packet, which the client responds to with an ACK packet. At this point, the client's socket is in the ESTABLISHED state. If the server never sends a SYN-ACK packet, the client times out and revert to the CLOSED state.

A sample application is developed and TCP level states of the socket are analyzed using TCPVIEW software for the WSAETIMEOUT scenario and following observation is made:

When a connect() call is given, socket went to SYN_SENT state. A socket will be in SYN_SENT when it is waiting for response from the Server. In this case, the socket for which connect call is given, remains in SYN_SENT state, which means client dint get SYN_ACK packet from the server. The socket waits in this state for 21 seconds and then getting timed out, which cause WSAETIMEDOUT error event. The snapshot of sample socket is shown below.


So Why 21 seconds?

Why not 15 seconds or 30 seconds or 25seconds. Ya this question is quite obvious. This question makes me to dig deep in the TCP layer of windows sockets, which resulted in a reason behind the 21 seconds delay. Find the reasons below:


The delay is caused by the timing back-off mechanism in the connect() retry function. The connect() function sends out a SYN and expects to receive a SYN-ACK back from the target. If SYN-ACK is not received within 3 seconds (all numbers here are default numbers in the winsock stack), then it's assumed that the prior SYN was lost and another SYN is sent. connect() now doubles its waiting time before assuming that this second SYN was lost, and sends a third and final SYN. It again doubles the waiting time to 12 seconds before assuming that the SYN was lost, that the destination can't be reached, and that the connection attempt must be aborted. 3+6+12 seconds = 21 seconds of delay before an error is reported.  Thus these retry mechanism leading the Windows Sockets Connect call to 21 seconds of delay.

Possible Solution:

The problem we are facing is occurred when we are using WSAEVENTSELECT type of non-blocking socket call. This delay can be avoided if the non-blocking type for the connect event alone can be changed to Select method or WSAASYNCSELECT method which have options to reduce the delay. If you take Select command we can set the timeout time for the connect call and if the connect doesn’t succeed  within specified time then select call will fail with error code 0 which indicates the Connect timeout.