Bug 53579

Summary:	httpd looping writing on a pipe and unresponsive (httpd <defunct>)
Product:	Apache httpd-2	Reporter:	yd <yrd_us>
Component:	mpm_prefork	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	REOPENED ---
Severity:	critical	CC:	arekm, loic.etienne, mahudees, mark_a_evans, santhoshmukkadan, stixpjr, szg0000
Priority:	P1	Keywords:	MassUpdate
Version:	2.4-HEAD
Target Milestone:	---
Hardware:	PC
OS:	All
Attachments:	Check POD on poll() timeout

Description yd 2012-07-20 23:20:48 UTC

httpd shows 3 defunct children and is unable to serve any further requets. A lelnet connection to port 80 is successful though.

Version output 

httpd -V
Server version: Apache/2.2.22 (Unix)
Server built:   Jun 27 2012 23:19:33
Server's Module Magic Number: 20051115:30
Server loaded:  APR 1.4.5, APR-Util 1.4.1
Compiled using: APR 1.4.5, APR-Util 1.4.1
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/opt/tms"
 -D SUEXEC_BIN="/opt/tms/bin/suexec"
 -D DEFAULT_PIDLOG="logs/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="logs/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"



httpd -l
Compiled in modules:
  core.c
  mod_authn_file.c
  mod_authn_default.c
  mod_authz_host.c
  mod_authz_groupfile.c
  mod_authz_user.c
  mod_authz_default.c
  mod_auth_basic.c
  mod_filter.c
  mod_deflate.c
  mod_log_config.c
  mod_env.c
  mod_headers.c
  mod_version.c
  mod_ssl.c
  prefork.c
  http_core.c
  mod_mime.c
  mod_autoindex.c
  mod_cgi.c
  mod_dir.c
  mod_actions.c
  mod_alias.c
  mod_rewrite.c
  mod_so.c

Config file snippet

Timeout 3600
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15

StartServers         1
MinSpareServers      1
MaxSpareServers      2
MaxClients          16
MaxRequestsPerChild  0

Listen 80
Listen 443



Following is the output of netstat and status of the httpd

# netstat -nt | grep ":80"
tcp      296      0 ::ffff:10.0.1.233:80        ::ffff:10.18.3.14:62429     CLOSE_WAIT
tcp      287      0 ::ffff:10.0.1.233:80        ::ffff:10.18.0.233:51590    CLOSE_WAIT
tcp      304      0 ::ffff:10.0.1.233:80        ::ffff:10.18.0.233:51284    CLOSE_WAIT
tcp      304      0 ::ffff:10.0.1.233:80        ::ffff:10.18.0.233:51576    CLOSE_WAIT
tcp      296      0 ::ffff:10.0.1.233:80        ::ffff:10.18.3.14:58146     CLOSE_WAIT
tcp      287      0 ::ffff:10.0.1.233:80        ::ffff:10.18.0.233:51783    CLOSE_WAIT
tcp      287      0 ::ffff:10.0.1.233:80        ::ffff:10.18.0.233:51761    CLOSE_WAIT

# ps -ef | grep httpd
admin    11463 19173  0 21:53 pts/0    00:00:00 grep httpd
admin    12084 24519  0 Jun29 ?        00:00:19 /usr/sbin/httpd -D NO_DETACH -f /etc/opt/tms/output/httpd.conf
apache   21089 12084  0 Jul05 ?        00:00:00 [httpd] <defunct>
apache   31445 12084  0 Jul05 ?        00:00:00 [httpd] <defunct>
apache   31596 12084  0 Jul05 ?        00:00:00 [httpd] <defunct>

gdb -p 12084 /usr/sbin/httpd shows that the top level process is here:

(gdb) where
#0  0x00007f638884e470 in __write_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000483c87 in apr_file_write ()
#2  0x00000000004273af in ap_mpm_pod_signal ()
#3  0x000000000045f3c3 in ap_mpm_run ()
#4  0x000000000040a664 in main ()

Apparently, it is getting stuck writing something to a file. 
ap_mpm_pod_signal() effectively just called apr_file_write(), where it is
getting stuck.

Using an httpd built with -g , the stack trace from gdb looks like this:

(gdb) where
#0  0x00007f638884e470 in __write_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000483c87 in apr_file_write (thefile=0x19314b0, 
    buf=0x7fffb849709f, nbytes=0x7fffb8497090) at file_io/unix/readwrite.c:188
#2  0x00000000004273af in pod_signal_internal (pod=0x1931420)
    at mpm_common.c:630
#3  ap_mpm_pod_signal (pod=0x1931420) at mpm_common.c:726
#4  0x000000000045f3c3 in perform_idle_server_maintenance (
    _pconf=<value optimized out>, plog=<value optimized out>, 
    s=<value optimized out>) at prefork.c:867
#5  ap_mpm_run (_pconf=<value optimized out>, plog=<value optimized out>, 
    s=<value optimized out>) at prefork.c:1107
#6  0x000000000040a664 in main (argc=5, argv=0x7fffb8497448) at main.c:753

thefile is as follows:

(gdb) frame 1
#1  0x0000000000483c87 in apr_file_write (thefile=0x19314b0, 
    buf=0x7fffb849709f, nbytes=0x7fffb8497090) at file_io/unix/readwrite.c:188
188     file_io/unix/readwrite.c: No such file or directory.
        in file_io/unix/readwrite.c
(gdb) print thefile
$1 = (apr_file_t *) 0x19314b0
(gdb) print *thefile
$2 = {
  pool = 0x18fa138, 
  filedes = 8, 
  fname = 0x0, 
  flags = 0, 
  eof_hit = 0, 
  is_pipe = 1, 
  timeout = -1, 
  buffered = 0, 
  blocking = BLK_ON, 
  ungetchar = 0, 
  buffer = 0x0, 
  bufpos = 0, 
  bufsize = 0, 
  dataRead = 0, 
  direction = 0, 
  filePtr = 0, 
  thlock = 0x0
}
(gdb) 

In other words, it is writing to a pipe (thefile->ispipe == 1).

Note that line 188 of file_io/unix/readwrite.c is this:

187         do {
188             rv = write(thefile->filedes, buf, *nbytes);
189         } while (rv == (apr_size_t)-1 && errno == EINTR);

Comment 1 Jeff Trawick 2012-07-21 14:21:00 UTC

Hmmm, parent blocked writing to a full pipe with no readers...  Unclear if the parent needs to use a timeout, or it is writing too many times, or ???

Do you have an easy way to reproduce this?  Does this happen at about the same time that the error log shows other issues?

Comment 2 yd 2012-07-23 18:19:38 UTC

Not very easy to reproduce, happens after around 48 hours of normal testing. The error log has the following errors.

connect to listener on [::]:443
[Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443
[Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443
[Thu Jul 05 00:00:01 2012] [notice] Graceful restart requested, doing restart
[Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443

The Network unreachable message is seen continous for few hours, above are the last messages in the error log

Following are the last messages in the access log

product.js?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (259 us)
10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (91 us)
10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup-product.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (155 us)
10.0.3.45 - - [05/Jul/2012:00:02:47 -0700] "POST /mgmt/xmldata?p=dynamicStatus HTTP/1.1" 200 138 (10943 us)

Comment 3 yd 2012-07-25 18:26:44 UTC

I suspect this has something to do with graceful restart from the logrotate. We did not see this problem before , but it is more prominent once we started getting the warning 
[warn] (101)Network is unreachable: connect to listener on [::]:443
which is repeated frequently in the log file.

Comment 4 yd 2012-08-03 02:48:21 UTC

Could reproduce it again. Basically it is writing to the pod (pipe of death) and is stuck in it forever. All the child processes are in defunct state. httpd is no more able to server any further requests.. Can someone please respond. I am increasing the severity to blocker as it stops serving any requests and is practically not usable.

Please let me know if any more info is needed as I still have the system in this state.

Comment 5 george 2012-11-01 18:37:30 UTC

(In reply to comment #4)
> Could reproduce it again. Basically it is writing to the pod (pipe of death)
> and is stuck in it forever. All the child processes are in defunct state.
> httpd is no more able to server any further requests.. Can someone please
> respond. I am increasing the severity to blocker as it stops serving any
> requests and is practically not usable.

Please let me know if any more info
> is needed as I still have the system in this state.

I am bumping into the exact same problem, but I cannot reproduce it.
Could you please let me know how did you reproduce this problem?
Thank you


Backtrace:

(gdb) #0  0xb7cd330e in write () from /lib/libpthread.so.0
(gdb) #1  0x080e1947 in apr_file_write ()
(gdb) #2  0x08080147 in ap_mpm_pod_signal ()
(gdb) #3  0x080b1aa1 in ap_mpm_run ()
(gdb) #4  0x08067ef2 in main ()

All child processes are <defunct>


Server version: Apache/2.2.3
Server built:   Oct  6 2011 18:06:51
Server's Module Magic Number: 20051115:3
Server loaded:  APR 1.2.7, APR-Util 1.2.7
Compiled using: APR 1.2.7, APR-Util 1.2.7
Architecture:   32-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/"
 -D SUEXEC_BIN="//bin/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="var/run/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="etc/apache/mime.types"
 -D SERVER_CONFIG_FILE="etc/apache/httpd.conf"

Compiled in modules:
  core.c
  mod_authn_file.c
  mod_authn_default.c
  mod_authz_host.c
  mod_authz_groupfile.c
  mod_authz_user.c
  mod_authz_default.c
  mod_auth_basic.c
  mod_include.c
  mod_filter.c
  mod_log_config.c
  mod_env.c
  mod_setenvif.c
  mod_ssl.c
  prefork.c
  http_core.c
  mod_mime.c
  mod_status.c
  mod_autoindex.c
  mod_asis.c
  mod_cgi.c
  mod_negotiation.c
  mod_dir.c
  mod_actions.c
  mod_userdir.c
  mod_alias.c
  mod_so.c

Comment 6 george 2012-11-16 20:26:41 UTC

After looking at the source code and doing some more testing, here are the results:

-This problem kicks in when the pipe-of-death becomes full (64 kbytes on a Linux system). When this happens, then the httpd parent process is stuck in a write() call, since the pod is blocking (?!)
-The pipe-of-death is written to by the httpd parent process, one byte at a time, in its main loop, as long as the number or httpd child processes exceeds the "MaxSpareServers" setting in the httpd.conf file.
-The pipe-of death is read by the httpd child processes, in their main loop, as long as there are HTTP requests sent to the server.
-If the httpd server is idle for a long time (5 days or so), then the child processes are sleeping, either pending on a semaphore or in a system epoll_wait() call with a very long timeout (100 days ?!).
-If the number of child processes exceeds the MaxSpareServers setting at the time the httpd server becomes idle, then with only the httpd parent process spinning, the pipe-of-death is constantly written to and no one is reading from it.
-When the parent httpd process writes to the pipe-of-death, it also sends a "dummy" HTTP request to the child processes, making a TCP connection to the system loopback interface on the httpd listener port.
-If for some reason the loopback interface and/or listener port are blocked by the Linux firewall for example, then this "dummy" HTTP request fails and never wakes the child processes up.
-When, after the pipe-of-death becomes full, the first external HTTP request comes in, the httpd child processes wake up, process the requests, read the pipe-of-death and go <defunct>.
-The parent httpd process cannot collect the <defunct> child processes, since it it blocked in the write() system call.

In conclusion this problem is caused by a combination of:

internal httpd design:
-pipe-of-death blocks in write().
-child httpd epoll_wait() call does not timeout.

system problems:
-httpd listener port on the loopback interface is blocked.

Comment 7 Loic Etienne 2014-06-05 15:35:52 UTC

We could also observe the problem with 2.2.14-5ubuntu8.12

Reading (with cat) from the affected file descriptor (pipe of death) actually made the parent process stop blocking on write. The affected application was a test instance, with little traffic. This experimentally confirms the the analysis above.

Comment 8 Loic Etienne 2014-06-06 16:57:07 UTC

We observed that some processes try to read from the pipe of death from the file descriptor 0 (instead of 7 on our system, from which other processes read the pipe of death).

dummy request read from file descriptor 0
dummy request fails (no ssl)
file descriptor 0 closed
read from file descriptor 0 (instead of pipe of death), error bad file descriptor
loop to the next connection

Note that the correct file descriptor 7 is open in the child process in question, but the variable supposed to store it contains 0 instead of 7.

We also observed that all our processes are gradually affected by this problem, but we have no hypothesis what event triggers this dysfunction.

Comment 9 Mark Evans 2015-04-01 18:08:14 UTC

Any updates on this issue? Still running into it on 2.4.10... Any ideas on how to force recreation? Looks like there needs to be a way to verify that the pod writes are being consumed? How would I do that?

Comment 10 Mahudeeswaran 2016-02-16 11:13:06 UTC

Able to see the issue in 2.2.31 as well, Did any one manage to find any workaround to overcome this issue or any resolution on this. Is there any concrete repro steps that would lead to this issue always.

For me most of the logs and scenarios matched exactly as given here.

Comment 11 Andrea Leofreddi 2016-06-15 20:04:51 UTC

I could observe this issue with 2.4.6 (default httpd package on CentOS 7).

Comment 12 Paul Ripke 2018-03-21 00:32:32 UTC

+1.

Server version: Apache/2.4.29 (Unix)
Server built:   Jan 29 2018 09:45:53
Server's Module Magic Number: 20120211:68
Server loaded:  APR 1.6.3, APR-UTIL 1.6.1
Compiled using: APR 1.6.3, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     prefork
  threaded:     no
    forked:     yes (variable process count)

NetBSD 8.0_BETA

Comment 13 Yann Ylavic 2018-03-21 01:23:54 UTC

Created attachment 35795 [details]
Check POD on poll() timeout

Does this patch help?

Comment 14 William A. Rowe Jr. 2018-11-07 21:08:53 UTC

Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.

Comment 15 Christophe JAILLET 2018-11-09 21:50:11 UTC

Reported in comments on different 2.4.x version.
So re-opening.