Bug 48934 - Cluster's regression. When replication fails once, replication can be never done again.
Summary: Cluster's regression. When replication fails once, replication can be never d...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 6
Classification: Unclassified
Component: Cluster (show other bugs)
Version: 6.0.26
Hardware: All All
: P2 regression (vote)
Target Milestone: default
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-18 11:09 UTC by Keiichi Fujino
Modified: 2010-04-11 10:11 UTC (History)
0 users



Attachments
Bug fix (1.47 KB, text/plain)
2010-03-18 13:56 UTC, Filip Hanik
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Keiichi Fujino 2010-03-18 11:09:23 UTC
I found cluster's regression in Tomcat6.0.26. 

The reproduction is as follows.
=====
The cluster is composed of tomcat1 and tomcat2. 
(Transport className is org.apache.catalina.tribes.transport.nio.PooledParallelSender.
 Perhaps, I think PooledMultiSender to be the same. )
Tomcat2 is stopped during session replication. 
As a result, Session replication failed and ChannelException is thrown. 
Tomcat2 restart. 
Session replication again.
As a result, following exception is thrown.
org.apache.catalina.tribes.ChannelException: Sender not connected.; No faulty members identified.
=====

The cause is 
http://svn.apache.org/viewvc?view=revision&revision=908741
When replication fails, sender is disconnected by this fix.

The disconnect method is as follows in PooledParallelSender. 
===
public synchronized void disconnect() {
    this.connected = false;
    super.disconnect();
    
}
===
this.connected is set to false, and super.disconnect() is called. 
In super.disconnect(), the queue is closed. 

I think.
if connected is set to false once, it never becomes true again. 
and
if queue is closed once, it never opened again.
It is only ReplicationTransmitter#start to be able to set connected to true.
It is also the same to open the queue.

As a result,
when replication fails once, replication can be never done again.

I do not know the reason why r908741 is applied. 
However, if ChannelException is thrown once, it becomes impossible to use all Sender.
This is not good thing.

Can revert r908741 ?
If it is not possible, what is the reason for the r908741?

Best regards.
Comment 1 Filip Hanik 2010-03-18 13:56:35 UTC
Created attachment 25146 [details]
Bug fix

Dear Fujino, as always you are right. The intended fix was to close sockets that were potentially left in a CLOSE_WAIT state when something went wrong. But instead of closing the actual sender that holds the TCP sockets, I accidentally closed the entire sender system
Comment 2 Mark Thomas 2010-04-11 10:11:49 UTC
This has been fixed in 6.0.x and will be included in 6.0.27 onwards.