|Summary:||NPE in DeltaRequest.writeExternal(DeltaRequest.java:267) when handling remote session expiration|
|Product:||Tomcat 6||Reporter:||Dmitry Mikhaylov <mikhailov.dmitry>|
|Component:||Cluster||Assignee:||Tomcat Developers Mailing List <dev>|
Description Dmitry Mikhaylov 2011-06-01 12:21:28 UTC
Occasionally we got errors like following: ====== 2011-05-30 03:40:17,697 ERROR [pool-1-thread-2] (org.apache.catalina.ha.session.DeltaManager) Manager [localhost#]: Unable to receive message through TCP channel java.lang.NullPointerException at java.io.ObjectOutputStream$BlockDataOutputStream.getUTFLength(ObjectOutputStream.java:2106) at java.io.ObjectOutputStream$BlockDataOutputStream.writeUTF(ObjectOutputStream.java:1977) at java.io.ObjectOutputStream.writeUTF(ObjectOutputStream.java:849) at org.apache.catalina.ha.session.DeltaRequest.writeExternal(DeltaRequest.java:267) at org.apache.catalina.ha.session.DeltaRequest.serialize(DeltaRequest.java:287) at org.apache.catalina.ha.session.DeltaManager.serializeDeltaRequest(DeltaManager.java:716) at org.apache.catalina.ha.session.DeltaManager.requestCompleted(DeltaManager.java:1224) at org.apache.catalina.ha.session.DeltaSession.expire(DeltaSession.java:403) at org.apache.catalina.ha.session.DeltaManager.handleSESSION_EXPIRED(DeltaManager.java:1546) at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1452) at org.apache.catalina.ha.session.DeltaManager.messageDataReceived(DeltaManager.java:1173) at org.apache.catalina.ha.session.ClusterSessionListener.messageReceived(ClusterSessionListener.java:92) at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:901) at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:882) at org.apache.catalina.tribes.group.GroupChannel.messageReceived(GroupChannel.java:269) at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelInterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.messageReceived(TcpFailureDetector.java:110) at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelInterceptorBase.java:79) at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelInterceptorBase.java:79) at org.apache.catalina.tribes.group.ChannelCoordinator.messageReceived(ChannelCoordinator.java:241) at org.apache.catalina.tribes.transport.ReceiverBase.messageDataReceived(ReceiverBase.java:225) at org.apache.catalina.tribes.transport.nio.NioReplicationTask.drainChannel(NioReplicationTask.java:188) at org.apache.catalina.tribes.transport.nio.NioReplicationTask.run(NioReplicationTask.java:91) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) It looks like the functionality of Tomcat is not affected by this. The NPE comes from trying to write null sessionId, but why does Tomcat try to send something from handling received authoritative SESSION_EXPIRED message?
Comment 1 Mark Thomas 2011-06-02 17:37:37 UTC
http://svn.apache.org/viewvc?view=revision&revision=818062 explains why this code is here. It looks like a race condition to me at the moment but more research is required. Just for background, how many nodes are there in your cluster?
Comment 2 Dmitry Mikhaylov 2011-06-07 07:46:15 UTC
Our clusters vary in size from 2 to 11 Tomcats, these problems happen in all configurations.
Comment 3 Dmitry Mikhaylov 2011-06-07 07:46:49 UTC
I'm not sure http://svn.apache.org/viewvc?view=revision&revision=818062 explains what is happening. While what the comment says is totally true, the stack trace I pasted shows that the problem happens on *secondary nodes*: while handling received remove session expiration event, for some reasons DeltaManager tries to broadcast it to the whole cluster again.
Comment 4 Keiichi Fujino 2011-06-13 08:57:57 UTC
There seem to be two problems in this stack trace. The first problem is DeltaRequest#getSize() > 1 in non-primary node . Usually, DeltaRequest#getSize() must be 0 in non-primary node. and, r818062 works correctly. However, r818062 doesn't work correctly because DeltaRequest#getSize() is more than 1 in this case. This is a potential bug. The second problem is DeltaRequest#sessionId() is null in non-primary node . When handleSESSION_CREATED is completed, DeltaRequest#sessionId is always non-null. When setMaxInactiveInterval is called in handleSESSION_CREATED, the action is added to DeltaRequest. As a result, it becomes DeltaRequest#getSize() >1. This is probably bug. For instance, when handleSESSION_EXPIRED is processed while processing handleSESSION_CREATED, sessionId of DeltaRequest will become null and DeltaRequest#getSize() > 1. And, NPE will be thrown out. The workaround is.. Correct these bugs or Synchronize between handleSESSION_CREATED and handleSESSION_EXPIRED. I will correct not to add action to DeltaRequest when setMaxInactiveInterval is called in handleSESSION_CREATED. Additionally, I will correct the potential bug of r818062.
Comment 5 Keiichi Fujino 2011-06-13 09:52:21 UTC
Fixed in 7.0.x and will be included in 7.0.17 onwards. Proposed for 6.0.x
Comment 6 Ronald Klop 2011-06-14 10:37:39 UTC
Is there a workaround for this issue? What should I not do in my code to prevent this? The cause r818062 is from 2009, so I must go back very far to get a Tomcat version which doesn't have this bug, with the risk, that I get back old bugs.
Comment 7 Filip Hanik 2011-06-15 01:50:50 UTC
It's a bug in Tomcat, some funky cyclic implementation that I still fully don't understand. Invoking listeners on expiring sessions....maybe valid...maybe not needed.... However, does this bug really cause a problem in your app? The session is going to expire no matter what, even if this error message is printed. Besides the error message in the log, is your app actually experiencing any symptoms from this? best Filip
Comment 8 Ronald Klop 2011-06-15 08:59:35 UTC
I have customers who are 'logged out' in their words and that correlates with the times of these errors in the logs. No hard evidence though. And the weirdest thing is that it only happens since two or three weeks.
Comment 9 Keiichi Fujino 2011-06-21 11:36:50 UTC
This has been fixed in 6.0.x and will be included in 6.0.33 onwards.
Comment 10 Dmitry Mikhaylov 2011-06-28 09:16:36 UTC
Thanks a lot!