Bug 65454

Summary: Race condition with idle thread removal results in request being improperly enqueued by default ThreadPoolExecutor
Product: Tomcat 9 Reporter: Aaron Ogburn <aogburn>
Component: UtilAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 9.0.x   
Target Milestone: -----   
Hardware: PC   
OS: All   
Attachments: Reproducer byteman rule and test app

Description Aaron Ogburn 2021-07-16 20:59:48 UTC
Created attachment 37959 [details]
Reproducer byteman rule and test app

I've found a race condition that can play out like below any time a non-core thread above minSpareThreads is idling out.  This can result in a request to be improperly enqueued like so:

1. The thread pool has 11 total threads, 10 threads are busy and 1 idle.  After the idle timeout (default 60 seconds), the idle thread pool member ends its ThreadPoolExecutor.getTask()/workQueue.poll() wait on the task queue.  It hasn't yet called ThreadPoolExecutor.compareAndDecrementWorkerCount()/ThreadPoolExecutor.processWorkerExit() to decrement the worker count and remove itself from the executor worker map.

2. A new connection comes in and is handed off to the executor, which calls TaskQueue.offer.  Since the idle thread hasn't removed itself from the map, parent.getPoolSize() still returns 11.  Thus, it passes this if check and so it is enqueued:

https://github.com/apache/tomcat/blob/9.0.x/java/org/apache/tomcat/util/threads/TaskQueue.java#L87

3. The idle thread then finishes exiting and removes itself from the executor.  The executor does not inherently replace that thread under this condition.  So there are now 10 busy threads in the pool and no idle thread available to process the new enqueued request so it sits in the queue until one of the other threads becomes idle.  This could then typically be a small imperceptible delay by the time another thread is created or returns idle.  But worst case, a very large unexpected delay is induced on the new request depending upon the run time of the current 10 busy tasks.


This can be hard to trigger or notice the effects of if triggered in most typical operations, but I was able to reproduce deterministically with some help from byteman.  To use byteman and reproduce:

1. Download the byteman 4.0.13 from https://downloads.jboss.org/byteman/4.0.13/byteman-download-4.0.13-bin.zip
2. Unzip byteman into install directory <mypath>.
3. Use the Byteman rules file DecrementWorkerSleep.btm from WorkerQueueRaceCondition.zip
4. Enable the Byteman instrumentation by adding the following to your JVM options:

-javaagent:<mypath>/lib/byteman.jar=script:<mypath>/DecrementWorkerSleep.btm,boot:<mypath>/lib/byteman.jar -Dorg.jboss.byteman.transform.all=true

5. Add the helloworld from WorkerQueueRaceCondition.zip to webapps for a simple slow jsp that can be easily used to trigger long requests
7. Start Tomcat
8. Make 10 slow requests (/helloworld/hi.jsp?sleep=true), resulting in usage of the 10 core threads
9. Make an 11th normal request (/helloworld/hi.jsp), resulting in the creation of the 11th non-core thread
10. Wait ~60 seconds after that to result in the idle timeout of that 11th thread.  With the extra byteman delay induced in that operation, you can watch for a message like this:

--------------------------->ThreadPoolExecutor.compareAndDecrementWorkerCount http-nio-8080-exec-11 sleeping...

11. Upon seeing that message, issue another normal quick request like in step 9 within the next 30 seconds and note it is not getting a response.  And without any other activity on the server, the request is never processed until one of the 10 core threads may become available.  A heap dump at that time and inspecting the executor/taskQueue confirms only the 10 busy threads in the pool with slow requests while the expected quick request is sitting in the queue doing nothing.
Comment 1 Aaron Ogburn 2021-07-16 21:04:58 UTC
Fortunately for an easy workaround, you can set minSpareThreads equal to maxThreads to avoid any idle thread removal that may allow for that condition.
Comment 2 Mark Thomas 2021-07-20 15:10:58 UTC
Thanks for reporting this.

I'm currently looking at options that don't involved completely re-implementing LinkedBlockingQueue and/or ThreadPoolExecutor.

I'm not hopeful of finding such an option at this point. We may need to do something similar to Commons Pool and take a copy of LinkedBlockingQueue and adapt it to our needs.
Comment 3 Aaron Ogburn 2021-07-20 16:01:24 UTC
Thanks, Mark.  I also wasn't coming up with a proper fix in the current implementation.  The best thing I saw here was to adjust https://github.com/apache/tomcat/blob/9.0.x/java/org/apache/tomcat/util/threads/TaskQueue.java#L87 to the following to ensure at least an extra idle thread if enqueuing:

        if (parent.getSubmittedCount()<(parent.getPoolSize())) {


But that'd only reduce chance and wouldn't be foolproof as we could still potentially end up with both idle threads timing out in the same problem timing.  Otherwise, it was looking like re-implementing was required to fix :(
Comment 4 Mark Thomas 2021-07-20 16:44:25 UTC
Agreed. I found a whole bunch of ways to do something similar but they all just narrowed the timing window.

Before I do too much work on LinkedBlockingQueue, there is some handling around RejectedExecutionException and the additional force methods I want to look at. I'm not currently convinced that handling is necessary. The more of that we can clean up, the simpler the changes to the copy of LinkedBlockingQueue will be.
Comment 5 Mark Thomas 2021-07-21 08:51:16 UTC
I see why the force handling is needed. It is to address the case where there are maxThreads-1 current threads, two new connections are received, both try to create a new thread, one gets created and the other gets rejected. The rejected one needs to be put on the queue.

There is a balance to strike here between reducing complexity (and opportunities for bugs) by re-using existing classes and the additional complexity created by re-use because of having to work around not having access to all of the internals we really need.

I'm still trying to figure out the best way forward.
Comment 6 Mark Thomas 2021-07-21 13:43:40 UTC
Just a quick update. I am currently looking at a solution along the following lines:
- import latest ThreadPoolExecutor from JSR 166 repo
- Fix any CheckStyle etc issues
- Merge in the changes from o.a.t.util.ThreadPoolexecutor
- Fix this issue by checking the queue size after stopping a worker thread and adding a new thread to the pool if the queue size is not zero.
Comment 7 Mark Thomas 2021-07-22 15:00:35 UTC
While importing ThreadPoolExecutor added 2k+ lines of code, the actual fix was only a few lines of code. We'll need to check for updates to ThreadPoolExecutor from time to time but I'd be surprised if there are still bugs in the parts we are using.

Fixed in:
- 10.1.x for 10.1.0-M3 onwards
- 10.0.x for 10.0.9 onwards
- 9.0.x for 9.0.51 onwards
- 8.5.x for 8.5.70 onwards