Bug 58157 - RecycleRequiredException and missing AsyncListener.onTimeout call
RecycleRequiredException and missing AsyncListener.onTimeout call
Product: Tomcat 7
Classification: Unclassified
Component: Catalina
PC All
: P2 normal (vote)
: ---
Assigned To: Tomcat Developers Mailing List
Depends on:
  Show dependency tree
Reported: 2015-07-20 13:32 UTC by cg.throwaway
Modified: 2015-08-17 14:05 UTC (History)
0 users

Servlet which starts asynchronous processing and intentionally waits for a timeout before dispatching to a static HTML file. (1.38 KB, text/plain)
2015-07-20 13:32 UTC, cg.throwaway
JMeter script that continuously does a GET on the test servlet and verifies the response is 200. (8.66 KB, application/xml)
2015-08-07 13:02 UTC, cg.throwaway
Zip containing Tomcat 7.0.x logs after running JMeter script for an hour. (11.68 KB, application/x-zip-compressed)
2015-08-12 12:28 UTC, cg.throwaway
WAR file for servlet used to recreate this bug. (2.33 KB, application/zip)
2015-08-12 12:38 UTC, cg.throwaway

Note You need to log in before you can comment on or make changes to this bug.
Description cg.throwaway 2015-07-20 13:32:13 UTC
Created attachment 32916 [details]
Servlet which starts asynchronous processing and intentionally waits for a timeout before dispatching to a static HTML file.

When repeatedly polling an asynchronous servlet which (intentionally) times out, I occasionally see the following exception in the console:

Jul 20, 2015 8:58:45 AM org.apache.catalina.connector.CoyoteAdapter checkRecycled
INFO: Encountered a non-recycled request and recycled it forcedly.
	at org.apache.catalina.connector.CoyoteAdapter.checkRecycled(CoyoteAdapter.java:590)
	at org.apache.coyote.http11.AbstractHttp11Processor.recycle(AbstractHttp11Processor.java:1809)
	at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.release(Http11NioProtocol.java:199)
	at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:1100)
	at org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1466)
	at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1234)
	at java.lang.Thread.run(Thread.java:745)

This coincides with a request not receiving any response.

The majority of times the servlet is called, the correct response is returned as expected once the request times out, however on occasion (sometimes within 1 min, sometimes not until 30 mins of successful requests have passed) I see this exception and AsyncListener.onTimeout() is never called to produce the response.

Steps to reproduce:
0) Configure Tomcat to use the NIO connector
1) Deploy the attached TimeoutServlet (note: add a static HTML page 'timeout.html' to the root of the web application to serve as the timeout response).
2) Using a tool of your choice (I used JMeter), repeatedly send a request to the servlet and wait for the timeout response before sending the next request.

Actual Results:
Eventually, one of the requests will receive no response and the above exception will be written to the console.

Expected Results:
Every request should receive the timeout response (timeout.html) after 1 second and no exception should appear in the console.

Additional Information:
Appears to be very similar to https://bz.apache.org/bugzilla/show_bug.cgi?id=57011.
Comment 1 cg.throwaway 2015-07-20 14:00:32 UTC
Just to add: have also seen this issue in 7.0.63.
Comment 2 Mark Thomas 2015-08-03 20:39:39 UTC
Something that may take 30 minutes to reproduce is going to be pretty much impossible to debug. It looks like a rare timing problem. We can try and find it by code inspection but anything you can do to provide an easier to reproduce test case would be appreciated.
Comment 3 Mark Thomas 2015-08-04 13:05:29 UTC
I think what you are seeing is a symptom of bug 57943 where the same socket ends up in the Poller twice. That would explain the stack trace you see.

Are you able to build Tomcat 7.0.x from svn and test it? If not, you'll need to wait for the next 7.0.x release.

I'm leaving this as NEEDINFO for now as we need to know if this issue is still reproducible with the current 7.0.x trunk.
Comment 4 cg.throwaway 2015-08-07 11:23:24 UTC
Tried the test servlet on the latest Tomcat 7.0.x trunk build and still seeing the checkRecycled exceptions in the Tomcat console.
Comment 5 cg.throwaway 2015-08-07 13:02:44 UTC
Created attachment 32978 [details]
JMeter script that continuously does a GET on the test servlet and verifies the response is 200.
Comment 6 cg.throwaway 2015-08-07 13:04:08 UTC
For info - the checkRecycled exception was, on average, being generated within 1 minute of the JMeter script being run.
Comment 7 Mark Thomas 2015-08-11 20:40:16 UTC
Hi. Could you give 7.0.x trunk another try please. I've fixed an issue that might be the cause of this bug. I don't think it is but it is worth checking and I haven't been able to reproduce this yet.
Comment 8 cg.throwaway 2015-08-12 12:26:37 UTC
Hi, built the latest 7.0.x trunk and tried it today. Still seeing the checkRecycled exception - first one 4 minutes after starting the JMeter script.
I will upload the Tomcat logs if they help at all - I ran the JMeter script against the test servlet for an hour.
Comment 9 cg.throwaway 2015-08-12 12:28:36 UTC
Created attachment 32988 [details]
Zip containing Tomcat 7.0.x logs after running JMeter script for an hour.
Comment 10 cg.throwaway 2015-08-12 12:38:04 UTC
Created attachment 32989 [details]
WAR file for servlet used to recreate this bug.

Attaching the WAR file I have been running the JMeter script against.
Comment 11 cg.throwaway 2015-08-12 12:43:40 UTC
Another thing to note is that it takes much longer to reproduce this exception using a JavaScript client versus using JMeter (where it can be seen within a few minutes).
Comment 12 Mark Thomas 2015-08-12 18:05:27 UTC
Thanks for testing. I'll be taking another look at this over the next few days.
Comment 13 Mark Thomas 2015-08-13 19:58:44 UTC
Good news. I think I have managed to track down the root cause of this. It is a race condition between the Poller timing out the socket and new data arriving. I haven't - yet - figured out how to fix it.

The test case only triggered one failure in a hour for me but this is going to be *very* sensitive to timing so that isn't a surprise. I think I can trigger the correct execution sequence in a debugger so - assuming that is the case - I'll be able to confirm any fix fairly easily.
Comment 14 Mark Thomas 2015-08-14 06:54:24 UTC
I've found and fixed a couple of concurrency issues that could have contributed to this but I'm still seeing the occassional failure.

I haven't yet managed to reproduce the same exception with a debugger.

Investigations continue.
Comment 15 cg.throwaway 2015-08-14 09:19:24 UTC
Hi Mark, just thought I'd try out your latest update - still seeing the exception (as you have stated above). Thanks.
Comment 16 Mark Thomas 2015-08-14 21:38:49 UTC
A quick update before the weekend.

I have better grip on what is going wrong. The issue is that async dispatches are being processed multiple times. Depending on the timing, this can go wrong.

The fix (for 7.0.x at least) is non-trivial. I've tried various small patches but all introduce a different problem. At this point, I think a reafctoring of the async dispatch handling will be required.

I haven't looked at 9.0.x or 8.0.x as yet. The refactoring that has already taken place in those versions may have already addressed this. Or not.

I'll continue to look at this next week.
Comment 17 Mark Thomas 2015-08-17 07:37:49 UTC
I've spent some more time looking at this and - having found the 'right' place to fix this - the patch was actually fairly small.

I have fixed the issue in 9.0.x, 8.0.x (for 8.0.25 onwards) and 7.0.x (for 7.0.64 onwards).
Comment 18 cg.throwaway 2015-08-17 13:47:16 UTC
We have built the trunk and we can confirm that we haven't seen the bug appearing during our tests with the minimal servlet, and also with our full application.

We'll continue testing for more time but it's looking really good.

Thanks a lot for your help Thomas! Really appreciated!
Comment 19 cg.throwaway 2015-08-17 13:54:09 UTC
Sorry!. I meant to say Mark... or Mr.Thomas.... as you prefer :)
Comment 20 Mark Thomas 2015-08-17 14:05:11 UTC
I answer to most names - as long as they are polite ;)

I did some local testing before committing this patch with the simple test case and didn't see any issues for over 15 hours so I'm reasonably confident that this issue is fixed. If your testing shows otherwise just re-open this issue.