I found cluster's regression in Tomcat6.0.26. The reproduction is as follows. ===== The cluster is composed of tomcat1 and tomcat2. (Transport className is org.apache.catalina.tribes.transport.nio.PooledParallelSender. Perhaps, I think PooledMultiSender to be the same. ) Tomcat2 is stopped during session replication. As a result, Session replication failed and ChannelException is thrown. Tomcat2 restart. Session replication again. As a result, following exception is thrown. org.apache.catalina.tribes.ChannelException: Sender not connected.; No faulty members identified. ===== The cause is http://svn.apache.org/viewvc?view=revision&revision=908741 When replication fails, sender is disconnected by this fix. The disconnect method is as follows in PooledParallelSender. === public synchronized void disconnect() { this.connected = false; super.disconnect(); } === this.connected is set to false, and super.disconnect() is called. In super.disconnect(), the queue is closed. I think. if connected is set to false once, it never becomes true again. and if queue is closed once, it never opened again. It is only ReplicationTransmitter#start to be able to set connected to true. It is also the same to open the queue. As a result, when replication fails once, replication can be never done again. I do not know the reason why r908741 is applied. However, if ChannelException is thrown once, it becomes impossible to use all Sender. This is not good thing. Can revert r908741 ? If it is not possible, what is the reason for the r908741? Best regards.
Created attachment 25146 [details] Bug fix Dear Fujino, as always you are right. The intended fix was to close sockets that were potentially left in a CLOSE_WAIT state when something went wrong. But instead of closing the actual sender that holds the TCP sockets, I accidentally closed the entire sender system
This has been fixed in 6.0.x and will be included in 6.0.27 onwards.