Bug 43683

Summary: Accessing Servlet while Reloading context gives 404 error
Product: Tomcat 6 Reporter: Joe Kislo <joekislos>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Severity: normal CC: carl.zetterberg
Priority: P2    
Version: 6.0.14   
Target Milestone: default   
Hardware: Other   
OS: other   
Attachments: The ServletRestartTest file you need to extract
The Context file for the ServletRestartTest
Testcase 2 - ServletRestartTest.war
Testcase 3 - ServletRestartTest.war
Test Harness #4

Description Joe Kislo 2007-10-23 09:43:12 UTC
If a servlet has finished it's init(ServletConfig), and is now ready to handle
requests, if you tell the tomcat manager to reload the context containing that
servlet (EG: w3m http://localhost:8080/manager/reload?path=/ServletRestartTest),
and you access that servlet via GET or POST (EG: w3m
http://localhost:8080/ServletRestartTest/ServletRestartTest) while the destroy()
method on the servlet is still executing, the GET or POST request will fail with
a 404 error after the destroy() method has been completed.  This is always
reproducible with my attached test case.  I believe this may be the root cause
of the issue described in #42518 which is described as an "Intermittent 404". 
With this test case, it is always reproducible, and not intermittent.
Comment 1 Joe Kislo 2007-10-23 09:44:03 UTC
Attached is a context file and a test servlet.  This servlet has alot of sleeps,
and prints a bunch of debug which will make it clear what is going on
when the servlet is being reloaded.  This problem is reproducible even without
any sleep statements in the servlet.

To setup the test do this:
extract the apache-tomcat 6.0.14 binary distro
cd conf
cd Catalina
mkdir localhost
cd localhost
cp ../../../webapps/host-manager/manager.xml .  (This sets up the manager
servlet for you)
Drop the ServletRestartTest.xml into this directory
cd ../..
edit the tomcat-users.xml and paste in: (this lets you access the manager
servlet with username tomcat password tomcat)
<role rolename="manager"/>
<user username="tomcat" password="tomcat" roles="manager"/>
cd ../webapps
extract the war file
cd ../bin
./catalina.sh run (this runs tomcat printing the stdio to the console)

To reproduce with this test case:
1. Visit this URL: http://localhost:8080/ServletRestartTest/ServletRestartTest
   You should get:
ServletRestartTest - Servlet is in the RUNNING state
2. Visit this URL: http://localhost:8080/manager/reload?path=/ServletRestartTest
3. Then immediately reload the original url
4. You will then receive the 404 error approximately 5 seconds later:

HTTP Status 404 - Servlet ServletRestartTest is not available

type Status report

message Servlet ServletRestartTest is not available

description The requested resource (Servlet ServletRestartTest is not available)
is not available.
Apache Tomcat/6.0.14

Optionally you may use the apachebench program to see that you will get many
404s, depending on how many requests come in:
Replace step 3 with: ab -c 10 -n 10
You will see that all 10 requests fail with non-200 responses:

This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done

Server Software:        Apache-Coyote/1.1
Server Hostname:        localhost
Server Port:            8080

Document Path:          /ServletRestartTest/ServletRestartTest
Document Length:        1081 bytes

Concurrency Level:      10
Time taken for tests:   5.49371 seconds
Complete requests:      10
Failed requests:        0
Write errors:           0
Non-2xx responses:      10
Total transferred:      12850 bytes
HTML transferred:       10810 bytes
Requests per second:    1.98 [#/sec] (mean)
Time per request:       5049.371 [ms] (mean)
Time per request:       504.937 [ms] (mean, across all concurrent requests)
Transfer rate:          2.38 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       1
Processing:  5026 5043   6.4   5045    5048
Waiting:     5026 5038   8.0   5038    5048
Total:       5027 5043   6.1   5045    5048

Percentage of the requests served within a certain time (ms)
  50%   5045
  66%   5045
  75%   5046
  80%   5048
  90%   5048
  95%   5048
  98%   5048
  99%   5048
 100%   5048 (longest request)

Comment 2 Joe Kislo 2007-10-23 09:45:10 UTC
Created attachment 21025 [details]
The ServletRestartTest file you need to extract
Comment 3 Joe Kislo 2007-10-23 09:48:04 UTC
This is how I envision the reload function working:

1) The servlet is happily handling GET/POST requests
2) reload is requested from the manager servlet
3) All new requests should be put into a queue, and wait to be serviced
4) If there are any 'currently active requests', tomcat should wait up to
StandardContext.unloadDelay seconds waiting for these requests to finish
5) Any requests that finish should be sent to the browser and finished up
6) When unloadDelay has been exceeded, or all currently active requests are
completed, destroy() should be called on the servlet
7) Wait until destroy() has finished
8) Clean up the context, and switch the classloader to the new classloader
9) If there are any requests in the queue, start the servlet, call
init(ServletConfig) on the servlet
10) wait for init(ServletConfig) to complete
11) All new requests should nolonger be put into queue, and should be serviced
by the servlet
12) All requests which are in queue, should now be released and processed
against the new servlet

Looking at the code in StandardContext, I can see this is probably what it
supposed to happen (it tries to pause incoming requests)... it just doesn't
quite work.  What we see is that if a request comes in while the servlet is
being destroyed, it will get a 404 from the servlet when the destroy() method
completes (EG: not right away!).  I looked at the code very reasonably
thoroughly in tomcat 5.5 (more thoroughly than I am looking at it right now) and
I seem to remember discovering that it was the start() context method that was
actually destroying the waiting request.  But I can't vouch for that, it was
probably a year ago :)  

It is worth noting that if you send in a SECOND request while the servlet is
being destroyed, that request *IS* handled properly... However... you can't send
that second request in TOO quickly after the first.  If they all come in at the
same time (see the apache bench testcase), they will *all* fail.  However even
if you're doing it reasonably quickly from firefox, the second request will come
through correctly.  I should also note that we had a patch for tomcat 4.1 which
made the reload() function work the way described above (it was a rather minor
change), however the patch is useless against tomcat 5 and greater, since the
start/stop code has been rewritten.  
Comment 4 Joe Kislo 2007-10-23 09:49:45 UTC
Created attachment 21026 [details]
The Context file for the ServletRestartTest
Comment 5 Mark Thomas 2008-04-17 15:05:32 UTC
Thanks for the test case. The problem was that the wrapper originally allocated to the request is stopped as part of the reload.

I have committed a fix to trunk and proposed it for 6.0.x.
Comment 6 Mark Thomas 2008-05-01 10:27:40 UTC
This has been fixed for 6.0.x and will be in 6.0.17 onwards.
Comment 7 Joe Kislo 2008-05-22 16:12:38 UTC
Sorry for the delay testing the fix... It's been a busy couple of weeks.

So I checked out the code from svn trunk today (05-22-08) and tested it.  With the test case I provided, everything was good... However when I tested it with a more thorough test case (EG: a real web application :) ), i ran into some crashes implying I was using an the OLD classloader.  

It seems like the Thread that is calling the init() function has the *wrong* classloader attached to it (EG: the old servlet's class loader, instead of the new one).  I've narrowed down the issue (this... took some time :) ) and have a new test case for you.

Just replace the ServletRestartTest.war file with this new one (and re-extract it).  There are now two classes, the ServletRestartTest servlet itself, and another class (it happens to be blank) that I will try to load using the Classloader.

Do the same test as before... run:
ab -c 10 -n 100 http://localhost:8080/ServletRestartTest/ServletRestartTest

then query the reload by going to this url at the same time:

You'll see this:
servicing completed
servicing completed
destroy starting
destroy completed
init starting
May 22, 2008 7:02:13 PM org.apache.catalina.loader.WebappClassLoader loadClass
INFO: Illegal access: this web application instance has been stopped already.  Could not load OtherClass.  The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact.
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1273)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1233)
        at ServletRestartTest.init(ServletRestartTest.java:24)
        at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1161)
        at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:806)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:129)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:183)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:595)
init completed
starting servicing

Let me know if I can be of any further assistance
Comment 8 Joe Kislo 2008-05-22 16:14:32 UTC
Created attachment 21990 [details]
Testcase 2 - ServletRestartTest.war
Comment 9 Mark Thomas 2008-05-28 13:49:54 UTC
Thanks for your efforts on the test case. It made things so much easier to investigate. I have committed a fix to trunk that addresses this and proposed it for 6.0.x

If you are able to test my fix that would be very helpful.
Comment 10 Joe Kislo 2008-06-05 19:46:01 UTC
Well thanks for the compliment on the test cases... These issues are very important to us, so I'm *extremely* grateful you're spending time looking into them.  But I'm sorry to report that my test case for this next one isn't as good :)

So first, your fix with the classloader issues seems to have fixed up that problem.  

However, if I use the 2nd testcase, and do:
ab -c 10 -n 60 http://localhost:8080/ServletRestartTest2/ServletRestartTest

then reload the context:

I am still getting some non 200 requests:

Complete requests:      60
Failed requests:        11
   (Connect: 0, Length: 11, Exceptions: 0)
Write errors:           0
Non-2xx responses:      11

I swear I didn't this before... So maybe I'm crazy... But maybe I was distracted last time by trying to track down the classloader testcase.  If I try to do the tests from my browser... they *usually* work just fine.  But sometimes, I'll get the 404 in my browser too.  I would just say it's a race condition as to when I reload the context, but can't quite put my finger on it.  It seems like once it gets one 404... every time I try it after that (until I restart tomcat) it'll get the 404 on reload.  *sigh*  Sorry I can't be more helpful.  Can you tell me if you are still seeing non-200s with the apache bench test?  It seems like maybe you fixed one path where you can get a 404, but maybe there's another.  It looks like requests that come in right after the destroy finishes, but before the init starts that are getting the 404s now.

So I did find a *second* issue, which is also intermittent, and may have been introduced by your first fix (although I don't know).  I've spent the past two days (well, a couple hours yesterday and a few hours today) trying to get a narrow testcase for you.  I've got one, and I'll attach it as ServletRestartTest3.  In your browser, you need to open Three tabs

You need to start tomcat, then reload tab 2, then 1, then 3

This is what happens (sometimes!)

the servlet will init()
then the servlet will destroy()
but while the destroy is running, it will service the request for tab 2 on the same servlet
the destroy() will finish
the servlet will init()
the servlet will service tab 3.

Unfortunately, what it's doing is serving a request (tab #2) *while* it's in the middle of destroying the context.  Bad bad bad :(  In the testcase I have rigged up it will cause a nullpointer exception in tab #2 when that happens:


You'll see this on your tomcat console:

init starting
[GC [DefNew: 960K->64K(960K), 0.0029550 secs] 4176K->3426K(5056K), 0.0029980 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
Jun 5, 2008 10:13:07 PM org.apache.catalina.core.StandardContext reload
INFO: Reloading this Context has started
init completed
destroy starting
starting servicing
destroy completed
[GC [DefNew: 960K->64K(960K), 0.0034250 secs] 4322K->3539K(5056K), 0.0034690 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
[GC [DefNew: 960K->64K(960K), 0.0019920 secs] 4435K->3623K(5056K), 0.0020690 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
init starting
init completed
starting servicing
servicing completed

If you can't get it to happen first try, reload tab #1, which will shutdown the context... Then do tab #2, #1, #3.  It may take you a few times to get it to happen.  It's intermittent :(  You'll see that Testcase #3 has a very tight loop at the init().  I tried a sleep(5000), and it was *significantly* more difficult to get the problem to happen.  With the 5 second tight loop, it's much easier (although still not consistent).

Let me know if there's anything I can do to help
Comment 11 Joe Kislo 2008-06-05 19:47:16 UTC
Created attachment 22084 [details]
Testcase 3 - ServletRestartTest.war
Comment 12 Mark Thomas 2008-06-08 04:51:26 UTC
The classloader fix has been applied to 6.0.x and will be in 6.0.17. I'll take a look at the remaining issues.
Comment 13 Mark Thomas 2008-06-11 15:08:34 UTC
I do have some progress to report.

The issues you are seeing with test case 2 are due to the context.reload() calling context.stop() followed by context.start()

context.start() unregisters and re-registers the context with JMX. Whilst the context is unregistered it does not appear in the mapper so the request is mapped to the ROOT webapp and a 404 results. I am currently looking for the least ugly hack around this issue.

At first glance test case 3 is a separate issue which I will look at once I have test case 2 working properly.
Comment 14 Mark Thomas 2008-06-13 13:44:28 UTC
I have committed another patch to trunk and proposed it for 6.0.x that fixes another source of 404s on reload. The patch is:


With this patch I can't get any 404s with test case 2. On to test case 3...
Comment 15 Mark Thomas 2008-06-13 14:03:11 UTC
I can't reproduce the problem you had with case 3.

Have you set the unloadDelay on the context? I have set mine to 20000 (20 seconds) for this bug to make sure that the servlet is given a chance to destroy() cleanly.
Comment 16 Joe Kislo 2008-06-19 14:15:02 UTC
The Reloading looks like it works well now.  Thanks!

As for case #3... Yes I can still get that to happen.  I thought maybe it was easier for me to reproduce because my laptop had dual core.  I booted nosmp and I was still able to get it to happen (infact, I think it may have been easier).

Yes, there is a setting for the 20 second timeout... but it's not hitting that.  It's pretty much immediately blazing into servicing the request even though the destroy is in progress.  The Destroy starting you see in my previous message and the starting servicing come basically at the same time.  I think the key is to get the request on window 3 in before window 2 is finished causing the servlet to start.

It doesn't happen every time... And it may be easier to recreate on a slower machine...  it certainly appears easier to recreate when I replaced the init() sleep(5000) with a CPU bound loop

Let me know if you can get it to happen... and if not I'll see what I can do for a better testcase.

Thanks again
Comment 17 Mark Thomas 2008-06-21 10:51:47 UTC
Sorry, I still can't repeat the issue with test case 3. I have tried on a couple of different machines.
Comment 18 Joe Kislo 2008-06-25 08:32:23 UTC
Well that's frustrating :)  I've done some work on it today to try to narrow down when it happens.

I've written a java program that will make the required calls to the URLs in parallel to cause the issue.  I wrapped it in a loop, and it happened the first time I ran it!  Then I ran it again, and let it run for 2 hours without it happening once.

Then I restarted tomcat, and it happened first time!

So it seems to happen a lot *more* when tomcat has first started... but still not every time.  I'm using Sun's 1.5.0_15 jvm if that somehow makes a difference.

So I am attaching a unix shell script that will start+stop tomcat between each test run, and my java program that does the URL calling.  You may need to tweak a couple variables to make this work on your machine:

  TOMCAT_BIN_DIR - Set this to where the catalina.sh script is

  TESTCASE3_URL - Put the url to testcase3 (the one that spins cpu during init)
  RESTART_TOMCAT_URL - Put the url to restart that servlet
  RESTART_TOMCAT_BASIC_AUTH - Put in the basic authentication credentials for the manager servlet in format username:password

With those changes... I think that will get it going for you.  The theory here is:
while (true) {
 start tomcat
 Request the servlet
 Restart the servlet
 Request the servlet
 stop tomcat

When the testcase fails you'll see something like this:

Using CATALINA_BASE:   /var/tmp/tomcat/apache-tomcat-6.0.16
Using CATALINA_HOME:   /var/tmp/tomcat/apache-tomcat-6.0.16
Using CATALINA_TMPDIR: /var/tmp/tomcat/apache-tomcat-6.0.16/temp
Using JRE_HOME:       /usr/local/java
Sleeping for tomcat to start
Testcase Call 1: Requesting URL
Restart call: Requesting URL
Testcase Call 2: Requesting URL
Testcase Call 1: IOException in URL request
java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8080/ServletRestartTest3/ServletRestartTest
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1174)
        at WebRequestThread.run(WebRequestThread.java:28)
Using CATALINA_BASE:   /var/tmp/tomcat/apache-tomcat-6.0.16
Using CATALINA_HOME:   /var/tmp/tomcat/apache-tomcat-6.0.16
Using CATALINA_TMPDIR: /var/tmp/tomcat/apache-tomcat-6.0.16/temp
Using JRE_HOME:       /usr/local/java
Sleeping for tomcat to stop

In the cases where it tomcat succeeds, I see this on the server side:
INFO: Waiting for 1 instance(s) to be deallocated

I do not see that when my testcase fails.

FWIW, I am using a apache-tomcat-6.0.16 binary installation, and I dropped in lib/* and bin/*.jar from a newly compiled trunk svn checkout from yesterday (06-24-08)

Good luck!
Comment 19 Joe Kislo 2008-06-25 08:32:58 UTC
Created attachment 22176 [details]
Test Harness #4
Comment 20 Mark Thomas 2008-06-26 12:57:53 UTC
The patch for test case 2 has been applied to 6.0.x and will be in 6.0.17 onwards. I'm still looking at test case 3.
Comment 21 Mark Thomas 2008-06-26 15:48:59 UTC
The good news is I can now repeat test case 3 and have found the root cause.

There is a synchronisation issue with StandardWrapper. In short, one thread can enter allocate() and before countAllocated is incremented, another thread can enter unload().

The bad news is that I suspect the fix might not be pleasant. allocate() is on the critical path so just synchronising allocate() would likely be bad from a performance point of view. I haven't really thought about this much yet, so I am hoping there is a better fix to be found. I'll have a think about a solution for this over the next few days. Hopefully I'll have something next week. In the meantime, any suggestions are welcome.
Comment 22 Mark Thomas 2008-06-27 14:45:55 UTC
I have committed a fix for test case 3 to trunk, It isn't perfect but it should be enough to fix the issue you are seeing. For the record, I could only get test case 3 to fail by using a debugger to control the execution sequence of the threads.

The patch is http://svn.apache.org/viewvc?rev=672397&view=rev

Any testing feedback would be appreciated.

The patch has been proposed for 6.0.x
Comment 23 Mark Thomas 2008-07-04 00:28:31 UTC
The patch for test case #3 has been applied to trunk.

Whilst the fix isn't perfect is should be extremely difficult (verging on impossible) to reproduce this issue under normal usage.

If you have a test case that still demonstrates this issue with the latest patches, please feel free to re-open this bug.
Comment 24 Mark Thomas 2008-11-03 06:30:13 UTC
*** Bug 42518 has been marked as a duplicate of this bug. ***