Bug 64765 - ThreadPoolExecutor#submittedCount wrong after undeploy
Summary: ThreadPoolExecutor#submittedCount wrong after undeploy
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Util (show other bugs)
Version: 9.0.38
Hardware: PC All
: P2 normal with 5 votes (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
Depends on:
Reported: 2020-09-24 15:16 UTC by mvi
Modified: 2020-10-13 09:03 UTC (History)
1 user (show)

screenshots (141.66 KB, application/x-zip-compressed)
2020-09-24 15:16 UTC, mvi

Note You need to log in before you can comment on or make changes to this bug.
Description mvi 2020-09-24 15:16:02 UTC
Created attachment 37468 [details]

Steps to reproduce:
1. download Tomcat 9.0.38 (older versions are affected as well)
2. configure user for the manager application
3. start Tomcat
4. navigate to the manager web application and install a webapp having a servlet that performs some long running tasks (e.g. 30 seconds for one request)
5. Open several browsers (because some browsers limit the number of connections to the same host) and navigate to the installed webapp (e.g. query the servlet 20 times in parallel). This puts the worker threads of the internal thread pool at load.
6. In the manager web application undeploy the webapp again while the 20 requests are still running.
7. After a few seconds the webapp is undeployed.
8. Redeploy the webapp again
9. Refresh the 20 web pages to put the worker threads at load again.

Observation after the redeploy: Some of the requests arrive at the servlet immediately (as it should) and some only arrive after a long delay (e.g. 30 seconds).

Problem Analysis:
The Thread Pool uses a core pool size of 10 and a max pool size of 200. So basically, there is space to create more worker threads.
However, the pool enqueues the request in the work queue without creating a new worker for it. And as the other workers are all busy handling the 30 seconds long requests, it stays in the queue until a worker finishes.
Expected behavior would be that a new worker thread is created and the request is processed immediately.

The reason why no worker is created can be found in code line [1]: The queue uses the submittedCount property of the pool to decide whether new workers are required. Unfortunately, at this point the value returned is wrong (too low).
As a result, the runnable is just saved without creating more workers.

The submittedCount value of the pool is running out of sync during the undeploy of the webapp. At a certain point (if the webapp is undeployed while there are long running requests working) this number becomes a negative value.
The attached screenshots illustrate the moment in which the submittedCount would become negative:
Screenshot 0 shows the two stack frames both standing at [2].
Screenshot 1 shows that at this moment the submittedCount property already has the value 0 and would become negative right afterwards.
Screenshot 2 shows the position of worker "exec-11" currently executing "afterExecute" for task with id=617
Screenshot 3 shows the position of worker "exec-13" currently executing "afterExecute" for the same task but in the catch block (see [4]).
Screenshot 4 shows the position of worker "exec-13" again but indicating that the exception is of type "StopPooledThreadException".

After all threads completed their work, submittedCount in the thread pool has a value of e.g. -8.
If the webapp is deployed again and e.g. 10 worker threads are still in the pool, the 10 first requests to the webapp are processed immediately as there are workers available.
And as these requests are quite long, these 10 workers stay in use. The next request would basically be the first that requires a new worker thread to be created.
But as the submittedCount now has the value 2 (-8 after the deploy + 10 new requests) the task queue will return at [1] putting the request in the waiting queue.
The correct behavior would be to return false (two lines further down) which would instruct the pool to create new workers.
The request will only be processed when the next 30 seconds request ends. In this situation tomcat must be restarted because responsiveness drops rapidly.

Some thoughts:
- I am not sure about the internals but is correct that the "afterExecute" is called several times in parallel for the same task as shown in the screenshots?
- In case the StopPooledThreadException is thrown at [3] it is caught from [4] which executes "afterExecute" again as shown in screenshot 3. Is this correct?

In general, somehow should be ensured that the submittedCount of "org.apache.tomcat.util.threads.ThreadPoolExecutor" cannot become negative. Otherwise the whole thread pool is broken.

[1] https://github.com/apache/tomcat/blob/9.0.38/java/org/apache/tomcat/util/threads/TaskQueue.java#L79
[2] https://github.com/apache/tomcat/blob/9.0.38/java/org/apache/tomcat/util/threads/ThreadPoolExecutor.java#L95
[3] https://github.com/apache/tomcat/blob/9.0.38/java/org/apache/tomcat/util/threads/ThreadPoolExecutor.java#L118
[4] https://github.com/openjdk/jdk/blob/jdk-11%2B7/src/java.base/share/classes/java/util/concurrent/ThreadPoolExecutor.java#L1131
Comment 1 mvi 2020-09-24 15:20:11 UTC
Possible workaround:

In the context.xml: set the renewThreadsWhenStoppingContext property to false.

See https://tomcat.apache.org/tomcat-9.0-doc/config/context.html
Comment 2 Mark Thomas 2020-09-29 20:00:25 UTC
Thanks for the report and for the analysis.

Fixed in:
- master for 10.0.0-M9 onwards
- 9.0.x for 9.0.39 onwards
- 8.5.x for 8.5.59 onwards
- 7.0.x for 7.0.107 onwards
Comment 3 mvi 2020-09-30 07:53:22 UTC
Thanks for the fast fix.
I will test it as soon as new builds are available.
Comment 4 mvi 2020-10-13 09:03:14 UTC
I confirm the issue seems to be fixed.
Tested with release 9.0.39.