[Configuration] Cluster configuration. TcpFailureDetector is used. Synchronous replication ChannelException is thrown when the destination node downs in the session replication. ChannelException is caught by TcpFailureDetector, and verifies the member in TcpFailureDetector#memberDisappeared. In TcpFailureDetector#memberAlive method, the member who failed in replication is checked to see if the member really is down. Because member already is gone, TcpFailureDetector#memberAlive do the timeout in 1 sec(default 1 sec). Then, member is removed from membership by membership#removeMember, and super.memberDisappeared(member) will be called. TcpFailureDetector#memberDisappeared is as follows. === public void memberDisappeared(Member member) { ...skip synchronized (membership) { //check to see if the member really is gone //if the payload is not a shutdown message if (shutdown || !memberAlive(member)) { //not correct, we need to maintain the map membership.removeMember( (MemberImpl) member); removeSuspects.remove(member); notify = true; } else { //add the member as suspect removeSuspects.put(member, new Long(System.currentTimeMillis())); } } ...skip } === All threads to wait for the acquisition of the lock of membership call the memberAlive method every time. And, the timeout will be done every time in 1 sec. As result, in high-concurrent, decrease in a cruel response may happen. For instance, when 100 threads waiting for the lock of membership, the thread to have acquired the lock at the end can not return the response for 100 sec. If member has not already existed in membership, TcpFailureDetector#memberAlive method need not be called. I made a patch. Best regards.
Created attachment 25233 [details] TcpFailureDetector's patch I made a patch.
Fixed in trunk and proposed for 6.0.x.
This fix applied to 6.0, will be in 6.0.27 onwards.