Created attachment 32539 [details] Crash log With the patch below applied (it forces writes onto a separate thread when using HTTP upgrade) repeated running of the org.apache.coyote.http11.upgrade.TestUpgrade unit test eventually (after a few minutes on OSX and Linux) triggers a JVM crash. I suspect that a socket is closed in one thread while it is still in the Poller. However, I have not yet been able to identify the code path that triggers this. I have attached the crash report. Index: java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java =================================================================== --- java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java (revision 1663237) +++ java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java (working copy) @@ -21,11 +21,9 @@ import javax.servlet.ServletOutputStream; import javax.servlet.WriteListener; -import org.apache.coyote.ContainerThreadMarker; import org.apache.juli.logging.Log; import org.apache.juli.logging.LogFactory; import org.apache.tomcat.util.ExceptionUtils; -import org.apache.tomcat.util.net.DispatchType; import org.apache.tomcat.util.net.SocketWrapperBase; import org.apache.tomcat.util.res.StringManager; @@ -114,11 +112,11 @@ synchronized (registeredLock) { registered = true; // Container is responsible for first call to onDataAvailable(). - if (ContainerThreadMarker.isContainerThread()) { - socketWrapper.addDispatch(DispatchType.NON_BLOCKING_WRITE); - } else { + //if (ContainerThreadMarker.isContainerThread()) { + // socketWrapper.addDispatch(DispatchType.NON_BLOCKING_WRITE); + //} else { socketWrapper.registerWriteInterest(); - } + //} } this.listener = listener;
NIO2 also did double closes of the sockets running the websockets tests before the latest refactoring. Syncing closeSocket could be the only solution.
Created attachment 32543 [details] Safety patch for tcnative 1.1.x branch Please patch tcnative with this and see if it catches the exact problem (likely "invalid socket"). Note that this is not a fix: it's only doing sanity-checking on the pollset and the sockets being manipulated.
Thanks for the safety patch. I can see what it is doing (and it looks) right but the crash still occurs.
This is the line that is failing: http://svn.eu.apache.org/viewvc/tomcat/native/branches/1.1.x/native/src/poll.c?view=annotate#l365 APR_RING_REMOVE(s->pe, link); And the core dump shows s->pe is 0x0. So that explains why it is crashing. I should be able to add another safety check for that to prevent the crash. Next step is to figure out how it gets into that state in the first place.
I think I have got to the bottom of this. Looking at the docs for apr_pollset_poll [1], it states "Multiple signalled conditions for the same descriptor may be reported in one or more returned apr_pollfd_t structures, depending on the implementation.". The crash is occurring when the socket is registered for read and write. The tc-native code assumes that there is only ever one returned structure per descriptor. The unit test + modified code I am using just happens - if the timing is right - to register a socket for read and write which almost instantly becomes available for both. It looks like OSX rerurns mutiple apr_pollfd_t structures for mutliple events on the same descriptor and that triggers the crash. I've thrown together a quick patch to test this theory and the unit test has been running in a loop now for 20+ minutes without crashing so I am fairly confident this is the issue. [1] http://apr.apache.org/docs/apr/1.4/group__apr__poll.html#ga6b31d7b3a7b2d356370403dd2b79ecf3
I had a look at the origins of that APR comment: http://svn.apache.org/viewvc?view=revision&revision=748951 It was especialy added to allow for an APR internal optimization when used in combination with the kqueue pollset implementation. And kqueue is the default pollset implof APR on OSX (and BSD). Quoting: "Allow the kqueue pollset implementation to support checking both APR_POLLIN and APR_POLLOUT for the same descriptor. This requires separate kevent structures for both conditions. Importantly, if both conditions are present, they will be returned in separate apr_pollfd_t structures with this implementation (to avoid the cycles to coalesce into the minimal number of structures)." So that fits nicely with your "The crash is occurring when the socket is registered for read and write." OTOH that means it would not in an obvious way provide a reason if this problem occured on other platforms than OSX or BSD. Although the code comment doesn't limit itself to kqueue ... Regards, Rainer
Not knowing a great deal about the underlying issue... If the fd is registered for both READ and WRITE, and we get multiple "responses" from apr_pollset_poll, are those happening (somewhat) simultaneously in two separate threads? So one of them fetches the fd, then the other fetches the same fd, then one of them removes the fd, then the other tries to do that same thing? Or do we need to examine the return from apr_pollset_poll to find out if we got the READ versus WRITE notification and make sure it's the one we were expecting?
(In reply to Christopher Schultz from comment #7) > Not knowing a great deal about the underlying issue... It is worth reading AprEndpoint.Poller.run() and the native implementation of poll() > If the fd is registered for both READ and WRITE, and we get multiple > "responses" from apr_pollset_poll, are those happening (somewhat) > simultaneously in two separate threads? No. > So one of them fetches the fd, then > the other fetches the same fd, then one of them removes the fd, then the > other tries to do that same thing? No. It happens sequentially in a single thread. > Or do we need to examine the return from apr_pollset_poll to find out if we > got the READ versus WRITE notification and make sure it's the one we were > expecting? That is handled in Poller.run() The fundamental problem is that poll() assumed that there would never be multiple notifications for the same socket. That assumption was not valid.
*** Bug 56313 has been marked as a duplicate of this bug. ***
(In reply to Rainer Jung from comment #6) > So that fits nicely with your "The crash is occurring when the socket is > registered for read and write." OTOH that means it would not in an obvious > way provide a reason if this problem occured on other platforms than OSX or > BSD. Although the code comment doesn't limit itself to kqueue ... Bug 56313 which appears to be the same issue is happening on Windows. It also appears to be a standard web applicaion without concurrent read/write. The poller event merging code added to APR probably needs a few tweaks: - enable it for all OS's - add some debug logging as to what events are being merged incase there is something else going on we haven't figured out yet. The patch I mentioned in comment #5 has been applied to tomcat-native 1.1.x and I am planning a 1.1.33 release shortly.
This has been fixed for Tomcat 8.0.21 and tc-native 1.1.33 onwards. i.e. you need at least these versions of both components to avoid this bug. It would be helpful if folks experiencing these issues on platforms other than OSX and BSD installed the above versions (when available) and enabled debug logging for AprEndpoint and reported any poller event merge messages that are logged.
where can I download from the new version of tomcat and the dll? do I need to compile it myself?
(In reply to Pavel from comment #12) > where can I download from the new version of tomcat and the dll? do I need > to compile it myself? I'm working on the tc-native 1.1.33 release at the moment. A release candidate will hopefully be available for testing later today (otherwise tomorrow). Keep an eye on the dev list for details. A 8.0.21 release will follow once we have a successful 1.1.33 release.
I saw that 1.1.33 was released - just wanted to confirm, can this be used with tomcat 7? or must i upgrade the tomcat as well? (I thought the fix was in the tc-native dll itself). currently using tomcat 7.0.54. Thanks.
Upgrading tc-native will avoid the crash but may still experience a bug where Tomcat doesn't recognise that a connection is ready for read/write. You'll need Tomcat 7.0.60 onwards for that part of the fix.
Got it. Do you have an ETA for tomcat 7.0.60? Thanks again for the quick reply.