Summary: | Due to missing synchronization, a member may disappear permanent. | ||
---|---|---|---|
Product: | Tomcat 5 | Reporter: | Martin Harm <mharm> |
Component: | Catalina:Cluster | Assignee: | Tomcat Developers Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | major | ||
Priority: | P2 | ||
Version: | 5.5.27 | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All | ||
Attachments: |
Patch to fix this issue
Updated patch for this issue |
Description
Martin Harm
2008-12-12 00:14:35 UTC
Created attachment 23501 [details]
Patch to fix this issue
The attached patch should fix this although I haven't tested it.
Might be an idea to make the field "memebrshipMutex" (sic) final, as otherwise the synchronisation is not guaranteed to work. (In reply to comment #1) > Created an attachment (id=23501) [details] > Patch to fix this issue > > The attached patch should fix this although I haven't tested it. I don't think that patch will fix it. The key problem here is that if the sender thread gets locked up, it will stop broadcast the member itself, and other nodes will deem it gone. The only solution here is to not lock up the sender thread ever. The same goes for the receiver thread. The code is a bit of a sync spaghetti mess, but Tomcat 6.0 has the fix for this, that will prevent it from locking up these two threads. TC 6 also has secondary verification mechanism, that are unrelated to this. You'd be better off backporting the fix from Tomcat 6 to Tomcat 5 Patch withdrawn based on Filip's comment Created attachment 24253 [details]
Updated patch for this issue
I found the time to take another look at this.
Whilst Filip's comment about threads locking up is correct - and Tomcat 6 does have a fix for that - threads locking up is not at the root of this issue. At the root of this issue is there there are two lists of cluster members. One in McastServiceImpl.membership and one in ReplicationTransmitter.map
Whilst checkExpire() does update both lists with the sync on expiredMutex, the receiver thread updates the McastServiceImpl.membership outside of this mutex. That leads to the problem that the OP is describing here.
Whilst Tomcat 6 does contain a fix for this, the code bases have diverged sufficiently that the fix would be invasive. Therefore I am proposing a patch for Tomcat that is similar to my earlier patch but has a slightly wider sync block based on my better understanding of this issue.
I have tested the patch and whilst I can force this issue using a debugger without the patch, I can not force it with the patch in place.
Fixed in trunk. Many thanks. As said in http://marc.info/?l=tomcat-dev&m=125934902622453&w=2 it is fixed in 5.5.29. (Probably that comment disappeared in the Bugzilla data loss/rollback incident) The commit that fixed the issue is r884960. As mentioned in comment 5 and comment 3, Tomcat 6 and trunk are not affected by this issue. |