50306 – Detect stuck threads

Bug 50306 - Detect stuck threads

Summary: Detect stuck threads

Status:	RESOLVED FIXED

Alias:	None

Product:	Tomcat 7
Classification:	Unclassified
Component:	Catalina (show other bugs)
Version:	7.0.4
Hardware:	All All

Importance:	P2 enhancement with 1 vote (vote)
Target Milestone:	---
Assignee:	Tomcat Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-11-19 18:09 UTC by Sylvain Laurent
Modified:	2012-07-02 14:31 UTC (History)
CC List:	0 users

Attachments
Patch for Tomcat's ThreadPoolExecutor (11.05 KB, patch) 2010-12-29 05:02 UTC, TomLu	Details \| Diff
New Valve: StuckThreadDetectionValve (12.26 KB, patch) 2011-03-14 10:30 UTC, TomLu	Details \| Diff
2012-06-19_tc6_50306_StuckThreadDetectionValve.patch (15.69 KB, patch) 2012-06-19 12:45 UTC, Konstantin Kolinko	Details \| Diff
2012-06-20_tc6_50306_StuckThreadDetectionValve.patch (15.77 KB, patch) 2012-06-20 09:06 UTC, Konstantin Kolinko	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sylvain Laurent 2010-11-19 18:09:10 UTC

Feature request : 
regularly scan worker threads and if one has been processing the same request for longer than a configurable delay, log a warning with the stack trace of that thread.
This would allow to detect very long running threads, usually the ones that are stuck in a network call or in a deadlock.

Comment 1 Pid 2010-11-20 15:53:38 UTC

(In reply to comment #0)
> Feature request : 
> regularly scan worker threads and if one has been processing the same request
> for longer than a configurable delay, log a warning with the stack trace of
> that thread.
> This would allow to detect very long running threads, usually the ones that are
> stuck in a network call or in a deadlock.

The diagnostics package Rainer & I discussed included something similar to this. It's possible to use JMX to identify if a thread is associated with a request and report the properties of the request.

It's also possible to enumerate deadlocked threads using JMX.

I've got some sample code which does various diagnostic tasks, but it's very experimental...

Comment 2 Sylvain Laurent 2010-11-21 15:26:18 UTC

What is this diagnostics package ? something in preparation ? Should I wait for it before working on a patch for the stuck thread detection feature ?

Comment 3 Mark Thomas 2010-11-21 15:45:21 UTC

All of this info (except the stack trace) is already available via JMX and displayed via the Manager app. There are also plenty of tools to generate stack traces. I'm struggling to see what else could be usefully done here without adding bloat.

Comment 4 Rainer Jung 2010-11-22 09:13:44 UTC

I see basically two aspects that code inside Tomcat could add:

1) Extraction of relevant information

Since we know, which thread works on which request and we know a lot about the request meta data (e.g. when it started) and about the threads (e.g. CPU usage), we can very efficiently inspect the stacks for exactly those threads that work on the same request for longer than some configured limit. No need to dump stacks for all threads (a lot of them idle in the pools or connected to the web server but not working on a request).

2) Reduction of polling interval

Since we focus on the relevant info we can do more frequent checks than one would do with a plain "dump all stacks". E.g. every 10 seconds instead of every one minute. If there is no long running request, no stack to inspect.

Yes, all of that can be done by a JMX client. But the client needs to be powerful enough to add logic like retrieving the in-flight requests, extracting those that already take to long, then retrieving the stacks for those requests etc.

What I don't have a nice idea about is how to actually use that information. Of course you could simply log the stacks and the request meta data for the long running requests, but for an average admin this might be to technical (of no real use). Though it would be helpful for his application provider when he starts analyzing reasons for performance problems.

Comment 5 Konstantin Kolinko 2010-11-22 09:52:30 UTC

1) The manager webapp already has a list of requests being processed on the "Server Status" page, as Mark mentioned.

One can make them as clickable, with more info displayed on demand.

2) If some "scan worker thread" self-diagnostic activity is needed inside Tomcat, one can wrap it into o.a.c.LifecycleListener and add to server.xml. I doubt that there is one-fits-all solution here, but I won't stop anyone contributing.

Comment 6 Sylvain Laurent 2010-11-22 14:22:05 UTC

My initial idea was : if a request takes too long, log a WARNING with the stack trace of the thread processing the request.
I think that it is important to have this in log file (push information) than to just allow access by JMX (pull information) because it allows to discover problems pro-actively, for instance a network call that is stuck waiting for data from an unhealthy server and no socket timeout has been set for the call.

As for the actual implementation, it could indeed be a LifeCycleListener, but it will probably need some changes elsewhere to provide instrumentation.

I'll try to propose a patch in a couple of days.

Comment 7 TomLu 2010-12-29 05:02:04 UTC

Created attachment 26451 [details]
Patch for Tomcat's ThreadPoolExecutor

WebSphere has hung/stuck thread detection logic that has been very useful to us, both in development and production. But since we are moving away from WebSphere, it would be nice to have this kind of feature in Tomcat as well. Some might consider this "stuck thread detection" unnessesary bloat, but we have found it useful on numerous occasions. 

Applications are never perfect, and often threads will run for too long (resulting in thread leakage and pool depletion):
-External connections might hang (like URLConnection) [you can configure read/connect timeout, but not all applications has configured this]
-Loops (can be detected by excessive CPU usage)
-Loops which does not have excessive CPU usage (for example because it uses external slow IO in each loop iteration, or has Thread.sleep!! (bad coding, but it can happen))
-Deadlocks.
-Threads waiting on other hung threads.

I have created an experimental patch for Tomcat 7.0.5 that we have been using in development for a short time now.

The hung thread monitoring does the following:
Monitors active (managed) threads, and issues a warning for each thread that has been active for more than the configured threshold (seconds).
The message (log.warn) contains the thread name, stack trace, total time active, and total number of suspected hung threads. 
(In our environment, we use Cron to scan system.out, and send email if a hung thread is detected.)

A timer thread runs at a configurable interval and checks and reports hung threads, and threads previously reported to be hung, but have now finished.

It detects hung threads in the HTTP/AJP thread pools. It does not monitor all JVM threads, but since many user threads are spawned from managed "webrequest" threads, the HTTP/AJP threads often waits for user threads to finish, so this patch should give an indication that something is wrong in those cases as well.

If anyone is interested, I've attached the patch. 
This patch is just an example of how such monitoring can be implemented. This is not production ready code. Performance impacts have not been considered!

The patch is for the following class:

org.apache.tomcat.util.threads.ThreadPoolExecutor

It uses 2 system properties for configuration:

org.apache.catalina.threadmonitor.interval (in seconds. Default 120)
org.apache.catalina.threadmonitor.threshold (in seconds. Default 600)

The monitoring is enabled by default, but can be disabled by setting the "interval" to a negative value or 0.

As stated, it only monitors managed threads created by Tomcat's HTTP/AJP thread pools.

Monitoring all JVM threads is also an option, but may require more code and might lead to many false positives..? The most important thing is to detect problems with stuck HTTP/AJP threads.

TODOs:

-Should the thread monitor properties be moved to server.xml instead of being System properties?
-Add JMX notification when hung thread detected/hung thread finished


I have no problem understanding that such a feature might not be seen as useful for everybody. We will continue to patch our own version of Tomcat anyway :)

Comment 8 Pid 2010-12-29 16:47:42 UTC

If I understand the patch correctly, I disagree with the idea of an arbitrary (even if configurable) time threshold per thread.

A long running thread is not necessarily a broken one; there are existing methods in JMX for identifying deadlocked threads.

The same could be achieved with a Valve in the request processing pipeline, without modifying the Executor.

Comment 9 TomLu 2011-03-14 10:30:23 UTC

Created attachment 26769 [details]
New Valve: StuckThreadDetectionValve

Looking at the source, a Valve seems like a much better place for such code as you wrote. However, I still think we need some kind of a threshold (configurable or not). Deadlocked threads are not the only problem we can face. We are often experiencing problems with erronous data in our DB, which causes loops to occur. Http client request without any configured read or connect timeout can also cause a thread to appear "stuck". The threads this Valve should monitor are HTTP/AJP processor threads only (not user threads), and these threads should _normally_ not be processing long running requests. Such requests should preferably be allocated to a separate thread if possible. Users who do need to execute long running request in the HTTP/AJP threads can choose not to use this Valve. This seems to be the spirit of Tomcat 7.x: create tools to help users improve their apps (Ref memory leakage detection logic). In many servers, the main HTTP request threads are considered a limited resource, and thus if such threads are running for too long, the pool will eventually be depleted. This is the case for WebSphere, which has a default max pool size of 50, and a configurable threshold for it's "hung thread detection" (Default 600 seconds). A valve for detecting "long running threads" can be quite helpful not only in development, but also in production environments in many cases. This valve should probably be disabled by default.
 
I have now moved the code i previously posted to a Valve. I have attached a patch for Tomcat 7.0.11 if anyone wants to add stuck thread detection logic to their Tomcat instance.

Comment 10 Sylvain Laurent 2011-04-07 17:33:10 UTC

Tom, thanks for your contribution. I committed the valve to trunk http://svn.apache.org/viewvc?view=revision&revision=1090003 with the following changes :

- fixed a race condition
- removed the minimum check for the threshold
- allowed any Container
- added other information in the logged warning (request, threshold)
- exposed as JMX, including the IDS of the stuck thread
- documentation

it will be available in tomcat 7.0.13

Comment 11 Konstantin Kolinko 2012-06-19 12:45:58 UTC

Created attachment 28961 [details]
2012-06-19_tc6_50306_StuckThreadDetectionValve.patch

Port of StuckThreadDetectionValve to Tomcat 6

I've also implemented the following additional features
(will apply to Tomcat 7 today, will be in 7.0.29):
- getStuckThreadNames() property, inspectable through JMX
- added thread ids to the log messages

Comment 12 Konstantin Kolinko 2012-06-20 09:06:47 UTC

Created attachment 28966 [details]
2012-06-20_tc6_50306_StuckThreadDetectionValve.patch

Updated documentation and MBean description,
aligning proposed patch with Tomcat 7.

Comment 13 Konstantin Kolinko 2012-07-02 14:31:27 UTC

StuckThreadDetectionValve added to Tomcat 6 in r1356246 and will be in 6.0.36.