Summary: | httpd looping writing on a pipe and unresponsive (httpd <defunct>) | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | yd <yrd_us> |
Component: | mpm_prefork | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | REOPENED --- | ||
Severity: | critical | CC: | arekm, loic.etienne, mahudees, mark_a_evans, santhoshmukkadan, stixpjr, szg0000 |
Priority: | P1 | Keywords: | MassUpdate |
Version: | 2.4-HEAD | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | Check POD on poll() timeout |
Description
yd
2012-07-20 23:20:48 UTC
Hmmm, parent blocked writing to a full pipe with no readers... Unclear if the parent needs to use a timeout, or it is writing too many times, or ??? Do you have an easy way to reproduce this? Does this happen at about the same time that the error log shows other issues? Not very easy to reproduce, happens after around 48 hours of normal testing. The error log has the following errors. connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 [Thu Jul 05 00:00:01 2012] [notice] Graceful restart requested, doing restart [Thu Jul 05 00:00:01 2012] [warn] (101)Network is unreachable: connect to listener on [::]:443 The Network unreachable message is seen continous for few hours, above are the last messages in the error log Following are the last messages in the access log product.js?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (259 us) 10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (91 us) 10.0.3.45 - - [05/Jul/2012:00:02:46 -0700] "GET /rollup-product.css?v=1a95dd38c7ca8035134db9ac4aa167f6 HTTP/1.1" 304 - (155 us) 10.0.3.45 - - [05/Jul/2012:00:02:47 -0700] "POST /mgmt/xmldata?p=dynamicStatus HTTP/1.1" 200 138 (10943 us) I suspect this has something to do with graceful restart from the logrotate. We did not see this problem before , but it is more prominent once we started getting the warning [warn] (101)Network is unreachable: connect to listener on [::]:443 which is repeated frequently in the log file. Could reproduce it again. Basically it is writing to the pod (pipe of death) and is stuck in it forever. All the child processes are in defunct state. httpd is no more able to server any further requests.. Can someone please respond. I am increasing the severity to blocker as it stops serving any requests and is practically not usable. Please let me know if any more info is needed as I still have the system in this state. (In reply to comment #4) > Could reproduce it again. Basically it is writing to the pod (pipe of death) > and is stuck in it forever. All the child processes are in defunct state. > httpd is no more able to server any further requests.. Can someone please > respond. I am increasing the severity to blocker as it stops serving any > requests and is practically not usable. Please let me know if any more info > is needed as I still have the system in this state. I am bumping into the exact same problem, but I cannot reproduce it. Could you please let me know how did you reproduce this problem? Thank you Backtrace: (gdb) #0 0xb7cd330e in write () from /lib/libpthread.so.0 (gdb) #1 0x080e1947 in apr_file_write () (gdb) #2 0x08080147 in ap_mpm_pod_signal () (gdb) #3 0x080b1aa1 in ap_mpm_run () (gdb) #4 0x08067ef2 in main () All child processes are <defunct> Server version: Apache/2.2.3 Server built: Oct 6 2011 18:06:51 Server's Module Magic Number: 20051115:3 Server loaded: APR 1.2.7, APR-Util 1.2.7 Compiled using: APR 1.2.7, APR-Util 1.2.7 Architecture: 32-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/" -D SUEXEC_BIN="//bin/suexec" -D DEFAULT_PIDLOG="/var/run/apache2.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="var/run/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="etc/apache/mime.types" -D SERVER_CONFIG_FILE="etc/apache/httpd.conf" Compiled in modules: core.c mod_authn_file.c mod_authn_default.c mod_authz_host.c mod_authz_groupfile.c mod_authz_user.c mod_authz_default.c mod_auth_basic.c mod_include.c mod_filter.c mod_log_config.c mod_env.c mod_setenvif.c mod_ssl.c prefork.c http_core.c mod_mime.c mod_status.c mod_autoindex.c mod_asis.c mod_cgi.c mod_negotiation.c mod_dir.c mod_actions.c mod_userdir.c mod_alias.c mod_so.c After looking at the source code and doing some more testing, here are the results: -This problem kicks in when the pipe-of-death becomes full (64 kbytes on a Linux system). When this happens, then the httpd parent process is stuck in a write() call, since the pod is blocking (?!) -The pipe-of-death is written to by the httpd parent process, one byte at a time, in its main loop, as long as the number or httpd child processes exceeds the "MaxSpareServers" setting in the httpd.conf file. -The pipe-of death is read by the httpd child processes, in their main loop, as long as there are HTTP requests sent to the server. -If the httpd server is idle for a long time (5 days or so), then the child processes are sleeping, either pending on a semaphore or in a system epoll_wait() call with a very long timeout (100 days ?!). -If the number of child processes exceeds the MaxSpareServers setting at the time the httpd server becomes idle, then with only the httpd parent process spinning, the pipe-of-death is constantly written to and no one is reading from it. -When the parent httpd process writes to the pipe-of-death, it also sends a "dummy" HTTP request to the child processes, making a TCP connection to the system loopback interface on the httpd listener port. -If for some reason the loopback interface and/or listener port are blocked by the Linux firewall for example, then this "dummy" HTTP request fails and never wakes the child processes up. -When, after the pipe-of-death becomes full, the first external HTTP request comes in, the httpd child processes wake up, process the requests, read the pipe-of-death and go <defunct>. -The parent httpd process cannot collect the <defunct> child processes, since it it blocked in the write() system call. In conclusion this problem is caused by a combination of: internal httpd design: -pipe-of-death blocks in write(). -child httpd epoll_wait() call does not timeout. system problems: -httpd listener port on the loopback interface is blocked. We could also observe the problem with 2.2.14-5ubuntu8.12 Reading (with cat) from the affected file descriptor (pipe of death) actually made the parent process stop blocking on write. The affected application was a test instance, with little traffic. This experimentally confirms the the analysis above. We observed that some processes try to read from the pipe of death from the file descriptor 0 (instead of 7 on our system, from which other processes read the pipe of death). dummy request read from file descriptor 0 dummy request fails (no ssl) file descriptor 0 closed read from file descriptor 0 (instead of pipe of death), error bad file descriptor loop to the next connection Note that the correct file descriptor 7 is open in the child process in question, but the variable supposed to store it contains 0 instead of 7. We also observed that all our processes are gradually affected by this problem, but we have no hypothesis what event triggers this dysfunction. Any updates on this issue? Still running into it on 2.4.10... Any ideas on how to force recreation? Looks like there needs to be a way to verify that the pod writes are being consumed? How would I do that? Able to see the issue in 2.2.31 as well, Did any one manage to find any workaround to overcome this issue or any resolution on this. Is there any concrete repro steps that would lead to this issue always. For me most of the logs and scenarios matched exactly as given here. I could observe this issue with 2.4.6 (default httpd package on CentOS 7). +1. Server version: Apache/2.4.29 (Unix) Server built: Jan 29 2018 09:45:53 Server's Module Magic Number: 20120211:68 Server loaded: APR 1.6.3, APR-UTIL 1.6.1 Compiled using: APR 1.6.3, APR-UTIL 1.6.1 Architecture: 64-bit Server MPM: prefork threaded: no forked: yes (variable process count) NetBSD 8.0_BETA Created attachment 35795 [details]
Check POD on poll() timeout
Does this patch help?
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd. As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd. If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question. If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with. Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated. Reported in comments on different 2.4.x version. So re-opening. |