Bug 49051 - Decrease in response by TcpFailureDetector.
Summary: Decrease in response by TcpFailureDetector.
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 6
Classification: Unclassified
Component: Cluster (show other bugs)
Version: 6.0.26
Hardware: All All
: P2 normal (vote)
Target Milestone: default
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-06 09:12 UTC by Keiichi Fujino
Modified: 2010-04-09 08:47 UTC (History)
0 users



Attachments
TcpFailureDetector's patch (924 bytes, text/plain)
2010-04-06 09:13 UTC, Keiichi Fujino
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Keiichi Fujino 2010-04-06 09:12:14 UTC
[Configuration]
Cluster configuration.
TcpFailureDetector is used. 
Synchronous replication

ChannelException is thrown when the destination node downs in the session replication.
ChannelException is caught by TcpFailureDetector, 
and verifies the member in TcpFailureDetector#memberDisappeared.

In TcpFailureDetector#memberAlive method, 
the member who failed in replication is checked to see if the member really is down.
Because member already is gone, TcpFailureDetector#memberAlive do the timeout in 1 sec(default 1 sec).
Then, member is removed from membership by membership#removeMember, 
and super.memberDisappeared(member) will be called. 

TcpFailureDetector#memberDisappeared is as follows. 
===
public void memberDisappeared(Member member) {
...skip
    synchronized (membership) {
        //check to see if the member really is gone
        //if the payload is not a shutdown message
        if (shutdown || !memberAlive(member)) {
            //not correct, we need to maintain the map
            membership.removeMember( (MemberImpl) member);
            removeSuspects.remove(member);
            notify = true;
        } else {
            //add the member as suspect
            removeSuspects.put(member, new Long(System.currentTimeMillis()));
        }
    }
...skip
}
===
All threads to wait for the acquisition of the lock of membership call the memberAlive method every time. 
And, the timeout will be done every time in 1 sec. 
As result,
in high-concurrent, decrease in a cruel response may happen.

For instance, 
when 100 threads waiting for the lock of membership, 
the thread to have acquired the lock at the end can not return the response for 100 sec. 

If member has not already existed in membership, TcpFailureDetector#memberAlive method need not be called. 

I made a patch.

Best regards.
Comment 1 Keiichi Fujino 2010-04-06 09:13:50 UTC
Created attachment 25233 [details]
TcpFailureDetector's patch

I made a patch.
Comment 2 Keiichi Fujino 2010-04-06 09:39:29 UTC
Fixed in trunk and proposed for 6.0.x.
Comment 3 Keiichi Fujino 2010-04-09 08:47:12 UTC
This fix applied to 6.0, will be in 6.0.27 onwards.