httpd shows 3 defunct children and is unable to serve any further requets. A lelnet connection to port 80 is successful though. Version output httpd -V Server version: Apache/2.2.22 (Unix) Server built: Jun 27 2012 23:19:33 Server's Module Magic Number: 20051115:30 Server loaded: APR 1.4.5, APR-Util 1.4.1 Compiled using: APR 1.4.5, APR-Util 1.4.1 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/opt/tms" -D SUEXEC_BIN="/opt/tms/bin/suexec" -D DEFAULT_PIDLOG="logs/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="logs/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="conf/mime.types" -D SERVER_CONFIG_FILE="conf/httpd.conf" httpd -l Compiled in modules: core.c mod_authn_file.c mod_authn_default.c mod_authz_host.c mod_authz_groupfile.c mod_authz_user.c mod_authz_default.c mod_auth_basic.c mod_filter.c mod_deflate.c mod_log_config.c mod_env.c mod_headers.c mod_version.c mod_ssl.c prefork.c http_core.c mod_mime.c mod_autoindex.c mod_cgi.c mod_dir.c mod_actions.c mod_alias.c mod_rewrite.c mod_so.c Config file snippet Timeout 3600 KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 15 StartServers 1 MinSpareServers 1 MaxSpareServers 2 MaxClients 16 MaxRequestsPerChild 0 Listen 80 Listen 443 Following is the output of netstat and status of the httpd # netstat -nt | grep ":80" tcp 296 0 ::ffff:10.0.1.233:80 ::ffff:10.18.3.14:62429 CLOSE_WAIT tcp 287 0 ::ffff:10.0.1.233:80 ::ffff:10.18.0.233:51590 CLOSE_WAIT tcp 304 0 ::ffff:10.0.1.233:80 ::ffff:10.18.0.233:51284 CLOSE_WAIT tcp 304 0 ::ffff:10.0.1.233:80 ::ffff:10.18.0.233:51576 CLOSE_WAIT tcp 296 0 ::ffff:10.0.1.233:80 ::ffff:10.18.3.14:58146 CLOSE_WAIT tcp 287 0 ::ffff:10.0.1.233:80 ::ffff:10.18.0.233:51783 CLOSE_WAIT tcp 287 0 ::ffff:10.0.1.233:80 ::ffff:10.18.0.233:51761 CLOSE_WAIT # ps -ef | grep httpd admin 11463 19173 0 21:53 pts/0 00:00:00 grep httpd admin 12084 24519 0 Jun29 ? 00:00:19 /usr/sbin/httpd -D NO_DETACH -f /etc/opt/tms/output/httpd.conf apache 21089 12084 0 Jul05 ? 00:00:00 [httpd] <defunct> apache 31445 12084 0 Jul05 ? 00:00:00 [httpd] <defunct> apache 31596 12084 0 Jul05 ? 00:00:00 [httpd] <defunct> gdb -p 12084 /usr/sbin/httpd shows that the top level process is here: (gdb) where #0 0x00007f638884e470 in __write_nocancel () from /lib64/libpthread.so.0 #1 0x0000000000483c87 in apr_file_write () #2 0x00000000004273af in ap_mpm_pod_signal () #3 0x000000000045f3c3 in ap_mpm_run () #4 0x000000000040a664 in main () Apparently, it is getting stuck writing something to a file. ap_mpm_pod_signal() effectively just called apr_file_write(), where it is getting stuck. Using an httpd built with -g , the stack trace from gdb looks like this: (gdb) where #0 0x00007f638884e470 in __write_nocancel () from /lib64/libpthread.so.0 #1 0x0000000000483c87 in apr_file_write (thefile=0x19314b0, buf=0x7fffb849709f, nbytes=0x7fffb8497090) at file_io/unix/readwrite.c:188 #2 0x00000000004273af in pod_signal_internal (pod=0x1931420) at mpm_common.c:630 #3 ap_mpm_pod_signal (pod=0x1931420) at mpm_common.c:726 #4 0x000000000045f3c3 in perform_idle_server_maintenance ( _pconf=<value optimized out>, plog=<value optimized out>, s=<value optimized out>) at prefork.c:867 #5 ap_mpm_run (_pconf=<value optimized out>, plog=<value optimized out>, s=<value optimized out>) at prefork.c:1107 #6 0x000000000040a664 in main (argc=5, argv=0x7fffb8497448) at main.c:753 thefile is as follows: (gdb) frame 1 #1 0x0000000000483c87 in apr_file_write (thefile=0x19314b0, buf=0x7fffb849709f, nbytes=0x7fffb8497090) at file_io/unix/readwrite.c:188 188 file_io/unix/readwrite.c: No such file or directory. in file_io/unix/readwrite.c (gdb) print thefile $1 = (apr_file_t *) 0x19314b0 (gdb) print *thefile $2 = { pool = 0x18fa138, filedes = 8, fname = 0x0, flags = 0, eof_hit = 0, is_pipe = 1, timeout = -1, buffered = 0, blocking = BLK_ON, ungetchar = 0, buffer = 0x0, bufpos = 0, bufsize = 0, dataRead = 0, direction = 0, filePtr = 0, thlock = 0x0 } (gdb) In other words, it is writing to a pipe (thefile->ispipe == 1). Note that line 188 of file_io/unix/readwrite.c is this: 187 do { 188 rv = write(thefile->filedes, buf, *nbytes); 189 } while (rv == (apr_size_t)-1 && errno == EINTR);
Hmmm, parent blocked writing to a full pipe with no readers... Unclear if the parent needs to use a timeout, or it is writing too many times, or ??? Do you have an easy way to reproduce this? Does this happen at about the same time that the error log shows other issues?
Not very easy to reproduce, happens after around 48 hours of normal testing. The error log has the following errors. connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [notice] Graceful restart requested, doing restart [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 The Network unreachable message is seen continous for few hours, above are the last messages in the error log Following are the last messages in the access log product.js?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (259 us) 10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (91 us) 10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup-product.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (155 us) 10.0.3.45 - - [05/Jul/2012:00:02:47 -0700] "POST /mgmt/xmldata?p=dynamicStatus HTTP/1.1" 200 138 (10943 us)
I suspect this has something to do with graceful restart from the logrotate. We did not see this problem before , but it is more prominent once we started getting the warning [warn] (101)Network is unreachable: connect to listener on [::]:443 which is repeated frequently in the log file.
Could reproduce it again. Basically it is writing to the pod (pipe of death) and is stuck in it forever. All the child processes are in defunct state. httpd is no more able to server any further requests.. Can someone please respond. I am increasing the severity to blocker as it stops serving any requests and is practically not usable. Please let me know if any more info is needed as I still have the system in this state.
(In reply to comment #4) > Could reproduce it again. Basically it is writing to the pod (pipe of death) > and is stuck in it forever. All the child processes are in defunct state. > httpd is no more able to server any further requests.. Can someone please > respond. I am increasing the severity to blocker as it stops serving any > requests and is practically not usable. Please let me know if any more info > is needed as I still have the system in this state. I am bumping into the exact same problem, but I cannot reproduce it. Could you please let me know how did you reproduce this problem? Thank you Backtrace: (gdb) #0 0xb7cd330e in write () from /lib/libpthread.so.0 (gdb) #1 0x080e1947 in apr_file_write () (gdb) #2 0x08080147 in ap_mpm_pod_signal () (gdb) #3 0x080b1aa1 in ap_mpm_run () (gdb) #4 0x08067ef2 in main () All child processes are <defunct> Server version: Apache/2.2.3 Server built: Oct 6 2011 18:06:51 Server's Module Magic Number: 20051115:3 Server loaded: APR 1.2.7, APR-Util 1.2.7 Compiled using: APR 1.2.7, APR-Util 1.2.7 Architecture: 32-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/" -D SUEXEC_BIN="//bin/suexec" -D DEFAULT_PIDLOG="/var/run/apache2.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="var/run/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="etc/apache/mime.types" -D SERVER_CONFIG_FILE="etc/apache/httpd.conf" Compiled in modules: core.c mod_authn_file.c mod_authn_default.c mod_authz_host.c mod_authz_groupfile.c mod_authz_user.c mod_authz_default.c mod_auth_basic.c mod_include.c mod_filter.c mod_log_config.c mod_env.c mod_setenvif.c mod_ssl.c prefork.c http_core.c mod_mime.c mod_status.c mod_autoindex.c mod_asis.c mod_cgi.c mod_negotiation.c mod_dir.c mod_actions.c mod_userdir.c mod_alias.c mod_so.c
After looking at the source code and doing some more testing, here are the results: -This problem kicks in when the pipe-of-death becomes full (64 kbytes on a Linux system). When this happens, then the httpd parent process is stuck in a write() call, since the pod is blocking (?!) -The pipe-of-death is written to by the httpd parent process, one byte at a time, in its main loop, as long as the number or httpd child processes exceeds the "MaxSpareServers" setting in the httpd.conf file. -The pipe-of death is read by the httpd child processes, in their main loop, as long as there are HTTP requests sent to the server. -If the httpd server is idle for a long time (5 days or so), then the child processes are sleeping, either pending on a semaphore or in a system epoll_wait() call with a very long timeout (100 days ?!). -If the number of child processes exceeds the MaxSpareServers setting at the time the httpd server becomes idle, then with only the httpd parent process spinning, the pipe-of-death is constantly written to and no one is reading from it. -When the parent httpd process writes to the pipe-of-death, it also sends a "dummy" HTTP request to the child processes, making a TCP connection to the system loopback interface on the httpd listener port. -If for some reason the loopback interface and/or listener port are blocked by the Linux firewall for example, then this "dummy" HTTP request fails and never wakes the child processes up. -When, after the pipe-of-death becomes full, the first external HTTP request comes in, the httpd child processes wake up, process the requests, read the pipe-of-death and go <defunct>. -The parent httpd process cannot collect the <defunct> child processes, since it it blocked in the write() system call. In conclusion this problem is caused by a combination of: internal httpd design: -pipe-of-death blocks in write(). -child httpd epoll_wait() call does not timeout. system problems: -httpd listener port on the loopback interface is blocked.
We could also observe the problem with 2.2.14-5ubuntu8.12 Reading (with cat) from the affected file descriptor (pipe of death) actually made the parent process stop blocking on write. The affected application was a test instance, with little traffic. This experimentally confirms the the analysis above.
We observed that some processes try to read from the pipe of death from the file descriptor 0 (instead of 7 on our system, from which other processes read the pipe of death). dummy request read from file descriptor 0 dummy request fails (no ssl) file descriptor 0 closed read from file descriptor 0 (instead of pipe of death), error bad file descriptor loop to the next connection Note that the correct file descriptor 7 is open in the child process in question, but the variable supposed to store it contains 0 instead of 7. We also observed that all our processes are gradually affected by this problem, but we have no hypothesis what event triggers this dysfunction.
Any updates on this issue? Still running into it on 2.4.10... Any ideas on how to force recreation? Looks like there needs to be a way to verify that the pod writes are being consumed? How would I do that?
Able to see the issue in 2.2.31 as well, Did any one manage to find any workaround to overcome this issue or any resolution on this. Is there any concrete repro steps that would lead to this issue always. For me most of the logs and scenarios matched exactly as given here.
I could observe this issue with 2.4.6 (default httpd package on CentOS 7).
+1. Server version: Apache/2.4.29 (Unix) Server built: Jan 29 2018 09:45:53 Server's Module Magic Number: 20120211:68 Server loaded: APR 1.6.3, APR-UTIL 1.6.1 Compiled using: APR 1.6.3, APR-UTIL 1.6.1 Architecture: 64-bit Server MPM: prefork threaded: no forked: yes (variable process count) NetBSD 8.0_BETA
Created attachment 35795 [details] Check POD on poll() timeout Does this patch help?
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd. As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd. If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question. If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with. Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.
Reported in comments on different 2.4.x version. So re-opening.