While testing clustering in our lab we noticed that when connectivity to one of the cluster members was lost by pulling the network cable (serving replication traffic), the entire cluster would become unresponsive. We were pulling the network cable to simulate catastrophic switch port failure or interface failure. We were testing under load, using synchronous replication. We found that existing replication sockets would honor our timeout (ackTimeout) configurations, but new connections established because of pool growth or retries would not timeout socket connect attempts. Because of not having a timeout, requests would backlog and effectively bring the cluster down. Theoretically, this connection establishment problem exists for all users of the DataSender class.
Created attachment 20366 [details] diff to use Socket.connect with timeout parameter Our fix was to change DataSender.createSocket to use the ackTimeout for connection establishment. This fix will only work with jdk 1.4 or higher.
Thanks for the report. This has been fixed in svn and will be in 5.5.25. Peter