Bug 46384

Summary: Due to missing synchronization, a member may disappear permanent.
Product: Tomcat 5 Reporter: Martin Harm <mharm>
Component: Catalina:ClusterAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: major    
Priority: P2    
Version: 5.5.27   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Patch to fix this issue
Updated patch for this issue

Description Martin Harm 2008-12-12 00:14:35 UTC
Below there is a "pseudo-code-extract" of the McastServiceImpl Recieiver- and Sender-Thread flow.

Now assume the following situation:
- ServerA,ServerB in a cluster; both had added each other to theire McastMembership 

On ServerA:
t0: The "Sender"-Thread is at position [P0], and found the mm = "ServerB". 
    So in the moment "ServerB" is not in the McastMembership.map!!
    
t1: The "Receiver"-Thread receives a packet from "ServerB", 
    add this to the McastMembership, 
    calls the SimpleTcpCluster.memberAdded("ServerB") 
    and blocks on [P1]
    
t2: The "Sender"-Thread continues,
    calls SimpleTcpCluster.memberDisappeared("ServerB").
    

This leads to the following situation:
- the "ServerB" is in the McastMembership.map (and without timeouts, it wont disappear)
- there is no Sessionreplication to "ServerB" 


That's it

  

Thread: Cluster-MembershipReceiver

McastServiceImpl.receive
   added= sync McastMembership.memberAlive(mm) { 
     if (mm not in map) then map+=mm;return true;
     else (mark mm as new); return false;
   }
   if (added) {
      SimpleTcpCluster.memberAdded(mm)
        log.info("Replication member added:" + member);
        sync ReplicationTransmitter.add(mm);
   }
     
  checkExpire
   ---[P1]---
   sync on McastServiceImpl(expiredMutex) {
      mm = sync McastMembership.expire() {
         if (mm in map to old) then map-=mm;
         return mm;
      }
      SimpleTcpCluster.memberDisappeared(mm);
        log.info("Received member disappeared:" + member);
        sync ReplicationTransmitter.remove(mm);
   }                      



Thread: Cluster-MembershipSender

  McastServiceImpl.send()
  
  checkExpire
   sync on McastServiceImpl(expiredMutex) {
      mm = sync McastMembership.expire() {
         if (mm in map to old) then map-=mm;
         return mm;
      }
      ---[P0]---
      SimpleTcpCluster.memberDisappeared(mm);
        log.info("Received member disappeared:" + member);
        sync ReplicationTransmitter.remove(mm);
   }
Comment 1 Mark Thomas 2009-04-16 13:49:55 UTC
Created attachment 23501 [details]
Patch to fix this issue

The attached patch should fix this although I haven't tested it.
Comment 2 Sebb 2009-04-16 15:55:45 UTC
Might be an idea to make the field "memebrshipMutex" (sic) final, as otherwise the synchronisation is not guaranteed to work.
Comment 3 Filip Hanik 2009-04-16 20:58:18 UTC
(In reply to comment #1)
> Created an attachment (id=23501) [details]
> Patch to fix this issue
> 
> The attached patch should fix this although I haven't tested it.

I don't think that patch will fix it. The key problem here is that if the sender thread gets locked up, it will stop broadcast the member itself, and other nodes will deem it gone.

The only solution here is to not lock up the sender thread ever. The same goes for the receiver thread. 

The code is a bit of a sync spaghetti mess, but Tomcat 6.0 has the fix for this, that will prevent it from locking up these two threads.

TC 6 also has secondary verification mechanism, that are unrelated to this.

You'd be better off backporting the fix from Tomcat 6 to Tomcat 5
Comment 4 Mark Thomas 2009-07-01 05:31:22 UTC
Patch withdrawn based on Filip's comment
Comment 5 Mark Thomas 2009-09-13 11:14:42 UTC
Created attachment 24253 [details]
Updated patch for this issue

I found the time to take another look at this.

Whilst Filip's comment about threads locking up is correct - and Tomcat 6 does have a fix for that - threads locking up is not at the root of this issue. At the root of this issue is there there are two lists of cluster members. One in McastServiceImpl.membership and one in ReplicationTransmitter.map

Whilst checkExpire() does update both lists with the sync on expiredMutex, the receiver thread updates the McastServiceImpl.membership outside of this mutex. That leads to the problem that the OP is describing here.

Whilst Tomcat 6 does contain a fix for this, the code bases have diverged sufficiently that the fix would be invasive. Therefore I am proposing a patch for Tomcat that is similar to my earlier patch but has a slightly wider sync block based on my better understanding of this issue.

I have tested the patch and whilst I can force this issue using a debugger without the patch, I can not force it with the patch in place.
Comment 6 Mark Thomas 2009-11-30 16:33:13 UTC
Fixed in trunk. Many thanks.
Comment 7 Konstantin Kolinko 2010-03-11 13:58:50 UTC
As said in http://marc.info/?l=tomcat-dev&m=125934902622453&w=2
it is fixed in 5.5.29.
(Probably that comment disappeared in the Bugzilla data loss/rollback incident)

The commit that fixed the issue is r884960.

As mentioned in comment 5 and comment 3, Tomcat 6 and trunk are not affected by this issue.