Bug 56960 - worker and listener thread dead-lock in graceful shutdown
Summary: worker and listener thread dead-lock in graceful shutdown
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_event (show other bugs)
Version: 2.4.10
Hardware: PC Solaris
: P2 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk, PatchAvailable
Depends on:
Blocks:
 
Reported: 2014-09-11 12:01 UTC by Zin UDA
Modified: 2015-01-23 08:46 UTC (History)
1 user (show)



Attachments
patch for Bug 56960 (801 bytes, patch)
2014-09-30 04:56 UTC, Zin UDA
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zin UDA 2014-09-11 12:01:16 UTC
In case of worker process graceful shutdown, listener thread cannot get worker for processing connections under asynchronous state because get_worker() will not set *have_idle_worker_p. This raise the dead-lock problem, that is the listener thread cannot get idle workers to process asynchronous (e.g. CONN_STATE_WRITE_COMPLETION) because the get_worker don't pass have_idle_worker, but workers are already waiting in ap_queue_pop_something().

This is because the ap_queue_info_wait_for_idler() will return APR_EOF with allocating idle worker (s.t. decreasing worker_queue_info->idlers) after ap_queue_info_term() but the get_worker() will not set *have_idle_worker_p in APR_EOF case. So the listener_thread() multiply call get_worker() for processing waiting connections and worker_queue_info->idlers goes to pt_zero.

---
diff -r -u httpd-2.4.10.orig//server/mpm/event/event.c httpd-2.4.10//server/mpm/event/event.c
--- httpd-2.4.10.orig//server/mpm/event/event.c Thu Jun 26 07:01:31 2014
+++ httpd-2.4.10//server/mpm/event/event.c      Thu Sep 11 19:04:52 2014
@@ -1271,13 +1271,13 @@
     else
         rc = ap_queue_info_try_get_idler(worker_queue_info);
 
-    if (rc == APR_SUCCESS) {
+    if (rc == APR_SUCCESS || APR_STATUS_IS_EOF(rc)) {
         *have_idle_worker_p = 1;
     }
     else if (!blocking && rc == APR_EAGAIN) {
         *all_busy = 1;
     }
-    else if (!APR_STATUS_IS_EOF(rc)) {
+    else {
         ap_log_error(APLOG_MARK, APLOG_ERR, rc, ap_server_conf, APLOGNO(00472)
                      "ap_queue_info_wait_for_idler failed.  "
                      "Attempting to shutdown process gracefully");
Comment 1 Zin UDA 2014-09-30 04:56:36 UTC
Created attachment 32071 [details]
patch for Bug 56960
Comment 2 Zin UDA 2014-09-30 04:58:30 UTC
How to reproduce...
1) enable mod_status with extended status (to check status)
2) set small value for MaxConnectionsPerChild eg. 1000 (to easy reproduce)
3) make dummy large file (>100M) under DocumentRoot.
4) make a heavy load with the large file, with many (>1000) concurrent connection

After several tens minutes, you can find some dead-locked httpd process. Thease process has following status
- Connections total is not 0, listen 'no'.
- Busy and idle thread is 0.
- have some Async connections.

This issue is CRITICAL because this not only happen in case of administrative graceful shutdown but also happen restart each worker process (e.g. exceed MaxConnectionsPerChild)
Comment 3 Zin UDA 2014-09-30 04:59:53 UTC
add keyword 'PatchAvailable' ...
Comment 4 jkaluza 2014-09-30 08:46:48 UTC
This patch seems to work properly with httpd-2.4.x and fixes the mentioned bug for me, but for some reason it fails to fix it in httpd-trunk. Although it improves the situation there, I still see some workers dead-locks (but not so many as without the patch).

Before committing this patch, I will try to find out what's different between trunk and 2.4.x in the event MPM causing this.
Comment 5 Yann Ylavic 2014-09-30 11:46:28 UTC
Possible duplicate of Bug 49504 (active children during graceful restart may trigger it)?
Comment 6 jkaluza 2014-10-03 12:10:22 UTC
So the remaining issue I see with event MPM is caused by http://svn.apache.org/r1605328 (according to svn-bisect)
Comment 7 Yann Ylavic 2014-10-03 13:47:31 UTC
Probably workers_were_busy and have_idle_worker should be declared back as before.
They have a both a meaningful value during the whole life of the listener thread...
Comment 8 Yann Ylavic 2014-10-03 14:17:02 UTC
(In reply to Yann Ylavic from comment #7)
> Probably workers_were_busy and have_idle_worker should be declared back as
> before.
> They have a both a meaningful value during the whole life of the listener
> thread...

In fact only have_idle_worker is concerned, workers_were_busy must be reset to 0 for each main loop.
Comment 9 jkaluza 2014-10-03 14:49:33 UTC
Yes, just wanted to write here that have_idle_worker must be preserved during iterations. I'm going to commit that and patch from this bug into trunk.
Comment 10 jkaluza 2014-10-06 06:13:29 UTC
Attached patch for the original issue committed in r1629577, patch for the issue I found while fixing this bug committed in r1629576.

Proposed for 2.4.x.
Comment 11 Yann Ylavic 2015-01-23 08:46:35 UTC
Backported to 2.4.11 in r1634526.