Bug 57653 - APR/native crash during HTTP upgrade
Summary: APR/native crash during HTTP upgrade
Status: RESOLVED FIXED
Alias: None
Product: Tomcat Native
Classification: Unclassified
Component: Library (show other bugs)
Version: 1.1.32
Hardware: Macintosh All
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
: 56313 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-03-02 10:26 UTC by Mark Thomas
Modified: 2015-03-25 09:42 UTC (History)
1 user (show)



Attachments
Crash log (38.03 KB, text/plain)
2015-03-02 10:26 UTC, Mark Thomas
Details
Safety patch for tcnative 1.1.x branch (1.23 KB, patch)
2015-03-03 21:24 UTC, Christopher Schultz
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Thomas 2015-03-02 10:26:30 UTC
Created attachment 32539 [details]
Crash log

With the patch below applied (it forces writes onto a separate thread when using HTTP upgrade) repeated running of the org.apache.coyote.http11.upgrade.TestUpgrade unit test eventually (after a few minutes on OSX and Linux) triggers a JVM crash.

I suspect that a socket is closed in one thread while it is still in the Poller. However, I have not yet been able to identify the code path that triggers this.

I have attached the crash report.



Index: java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java
===================================================================
--- java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java	(revision 1663237)
+++ java/org/apache/coyote/http11/upgrade/UpgradeServletOutputStream.java	(working copy)
@@ -21,11 +21,9 @@
 import javax.servlet.ServletOutputStream;
 import javax.servlet.WriteListener;
 
-import org.apache.coyote.ContainerThreadMarker;
 import org.apache.juli.logging.Log;
 import org.apache.juli.logging.LogFactory;
 import org.apache.tomcat.util.ExceptionUtils;
-import org.apache.tomcat.util.net.DispatchType;
 import org.apache.tomcat.util.net.SocketWrapperBase;
 import org.apache.tomcat.util.res.StringManager;
 
@@ -114,11 +112,11 @@
         synchronized (registeredLock) {
             registered = true;
             // Container is responsible for first call to onDataAvailable().
-            if (ContainerThreadMarker.isContainerThread()) {
-                socketWrapper.addDispatch(DispatchType.NON_BLOCKING_WRITE);
-            } else {
+            //if (ContainerThreadMarker.isContainerThread()) {
+            //    socketWrapper.addDispatch(DispatchType.NON_BLOCKING_WRITE);
+            //} else {
                 socketWrapper.registerWriteInterest();
-            }
+            //}
         }
 
         this.listener = listener;
Comment 1 Remy Maucherat 2015-03-02 11:14:57 UTC
NIO2 also did double closes of the sockets running the websockets tests before the latest refactoring. Syncing closeSocket could be the only solution.
Comment 2 Christopher Schultz 2015-03-03 21:24:25 UTC
Created attachment 32543 [details]
Safety patch for tcnative 1.1.x branch

Please patch tcnative with this and see if it catches the exact problem (likely "invalid socket"). Note that this is not a fix: it's only doing sanity-checking on the pollset and the sockets being manipulated.
Comment 3 Mark Thomas 2015-03-11 10:30:07 UTC
Thanks for the safety patch. I can see what it is doing (and it looks) right but the crash still occurs.
Comment 4 Mark Thomas 2015-03-11 12:10:44 UTC
This is the line that is failing:
http://svn.eu.apache.org/viewvc/tomcat/native/branches/1.1.x/native/src/poll.c?view=annotate#l365

APR_RING_REMOVE(s->pe, link);

And the core dump shows s->pe is 0x0.

So that explains why it is crashing.

I should be able to add another safety check for that to prevent the crash.

Next step is to figure out how it gets into that state in the first place.
Comment 5 Mark Thomas 2015-03-11 14:35:12 UTC
I think I have got to the bottom of this.

Looking at the docs for apr_pollset_poll [1], it states "Multiple signalled conditions for the same descriptor may be reported in one or more returned apr_pollfd_t structures, depending on the implementation.".

The crash is occurring when the socket is registered for read and write.

The tc-native code assumes that there is only ever one returned structure per descriptor.

The unit test + modified code I am using just happens - if the timing is right - to register a socket for read and write which almost instantly becomes available for both. It looks like OSX rerurns mutiple apr_pollfd_t structures for mutliple events on the same descriptor and that triggers the crash.

I've thrown together a quick patch to test this theory and the unit test has been running in a loop now for 20+ minutes without crashing so I am fairly confident this is the issue.


[1] http://apr.apache.org/docs/apr/1.4/group__apr__poll.html#ga6b31d7b3a7b2d356370403dd2b79ecf3
Comment 6 Rainer Jung 2015-03-11 16:27:48 UTC
I had a look at the origins of that APR comment:

http://svn.apache.org/viewvc?view=revision&revision=748951

It was especialy added to allow for an APR internal optimization when used in combination with the kqueue pollset implementation. And kqueue is the default pollset implof APR on OSX (and BSD).

Quoting:

"Allow the kqueue pollset implementation to support checking both APR_POLLIN and
APR_POLLOUT for the same descriptor.  This requires separate kevent structures
for both conditions.

Importantly, if both conditions are present, they will be returned in separate
apr_pollfd_t structures with this implementation (to avoid the cycles to coalesce 
into the minimal number of structures)."

So that fits nicely with your "The crash is occurring when the socket is registered for read and write." OTOH that means it would not in an obvious way provide a reason if this problem occured on other platforms than OSX or BSD. Although the code comment doesn't limit itself to kqueue ...

Regards,

Rainer
Comment 7 Christopher Schultz 2015-03-11 18:20:13 UTC
Not knowing a great deal about the underlying issue...

If the fd is registered for both READ and WRITE, and we get multiple "responses" from apr_pollset_poll, are those happening (somewhat) simultaneously in two separate threads? So one of them fetches the fd, then the other fetches the same fd, then one of them removes the fd, then the other tries to do that same thing?

Or do we need to examine the return from apr_pollset_poll to find out if we got the READ versus WRITE notification and make sure it's the one we were expecting?
Comment 8 Mark Thomas 2015-03-11 18:24:03 UTC
(In reply to Christopher Schultz from comment #7)
> Not knowing a great deal about the underlying issue...

It is worth reading AprEndpoint.Poller.run() and the native implementation of poll()

> If the fd is registered for both READ and WRITE, and we get multiple
> "responses" from apr_pollset_poll, are those happening (somewhat)
> simultaneously in two separate threads?

No.

> So one of them fetches the fd, then
> the other fetches the same fd, then one of them removes the fd, then the
> other tries to do that same thing?

No. It happens sequentially in a single thread.

> Or do we need to examine the return from apr_pollset_poll to find out if we
> got the READ versus WRITE notification and make sure it's the one we were
> expecting?

That is handled in Poller.run()

The fundamental problem is that poll() assumed that there would never be multiple notifications for the same socket. That assumption was not valid.
Comment 9 Mark Thomas 2015-03-15 23:48:39 UTC
*** Bug 56313 has been marked as a duplicate of this bug. ***
Comment 10 Mark Thomas 2015-03-15 23:53:25 UTC
(In reply to Rainer Jung from comment #6)

> So that fits nicely with your "The crash is occurring when the socket is
> registered for read and write." OTOH that means it would not in an obvious
> way provide a reason if this problem occured on other platforms than OSX or
> BSD. Although the code comment doesn't limit itself to kqueue ...

Bug 56313 which appears to be the same issue is happening on Windows. It also appears to be a standard web applicaion without concurrent read/write.

The poller event merging code added to APR probably needs a few tweaks:
- enable it for all OS's
- add some debug logging as to what events are being merged incase there is something else going on we haven't figured out yet.

The patch I mentioned in comment #5 has been applied to tomcat-native 1.1.x and I am planning a 1.1.33 release shortly.
Comment 11 Mark Thomas 2015-03-16 13:58:31 UTC
This has been fixed for Tomcat 8.0.21 and tc-native 1.1.33 onwards. i.e. you need at least these versions of both components to avoid this bug.

It would be helpful if folks experiencing these issues on platforms other than OSX and BSD installed the above versions (when available) and enabled debug logging for AprEndpoint and reported any poller event merge messages that are logged.
Comment 12 Pavel 2015-03-16 14:57:27 UTC
where can I download from the new version of tomcat and the dll? do I need to compile it myself?
Comment 13 Mark Thomas 2015-03-16 15:01:21 UTC
(In reply to Pavel from comment #12)
> where can I download from the new version of tomcat and the dll? do I need
> to compile it myself?

I'm working on the tc-native 1.1.33 release at the moment. A release candidate will hopefully be available for testing later today (otherwise tomorrow). Keep an eye on the dev list for details.

A 8.0.21 release will follow once we have a successful 1.1.33 release.
Comment 14 Pavel 2015-03-25 09:21:17 UTC
I saw that 1.1.33 was released - just wanted to confirm, can this be used with tomcat 7? or must i upgrade the tomcat as well? (I thought the fix was in the tc-native dll itself).

currently using tomcat 7.0.54.

Thanks.
Comment 15 Mark Thomas 2015-03-25 09:29:38 UTC
Upgrading tc-native will avoid the crash but may still experience a bug where Tomcat doesn't recognise that a connection is ready for read/write. You'll need Tomcat 7.0.60 onwards for that part of the fix.
Comment 16 Pavel 2015-03-25 09:42:30 UTC
Got it.

Do you have an ETA for tomcat 7.0.60?

Thanks again for the quick reply.