We are seeing sporadic instances of what appears to be a bug in httpd. The symptoms are that httpd hangs on polling and reading. Here's some pertinent lines from an strace log (I've separated the "hanging" sections with a space at both ends): 51988 31737 write(8, "some.ip.address - - [09/Nov/2010:17:47:10 -0600] \"GET /learn/ HTTP/1.1\" 200 8719 "..., 222) = 222 51989 31737 shutdown(11, 1 /* send */) = 0 51990 31737 poll([{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1 51991 31737 read(11, "", 512) = 0 51992 31737 close(11) = 0 51993 31737 read(5, 0xbfedeccf, 1) = -1 EAGAIN (Resource temporarily unavailable) 51994 31737 semop(1605639, 0x12fb44, 1 <unfinished ...> 51995 3551 <... select resumed> ) = 0 (Timeout) 51996 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 51997 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 51998 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 51999 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 52000 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 52001 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 52002 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 52003 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 52004 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 52005 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 52006 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 52007 3551 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) 52008 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 52009 3551 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> 52010 4710 <... poll resumed> [{fd=4, events=POLLIN, revents=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 52011 4710 accept(4, {sa_family=AF_INET6, sin6_port=htons(63556), inet_pton(AF_INET6, "::ffff:some.ip.address", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 1 1 52012 4710 semop(1605639, 0x12fb4a, 1 <unfinished ...> 52013 14265 <... semop resumed> ) = 0 52014 4710 <... semop resumed> ) = 0 // ... 58900 31732 write(13, "\371\0\0\0\3SELECT * from trefPageModule\r\n\t\t\t\t\t\t\t\t\t\t LEFT JOIN tPage ON tPage.pageID=tr"..., 253) = 253 58901 31732 read(13, <unfinished ...> 58902 3551 <... select resumed> ) = 0 (Timeout) 58903 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 58904 3551 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> 58905 31732 <... read resumed> "\1\0\0\1\34<\0\0\2\3def\6MySite\16trefPageModule\16trefPageModule\2ID\2ID\f?\0\v\0\0\0\3\3B\0\0\0D\0\0\3\3def\6Ut"..., 16384) = 1871 58906 31732 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 0) = 0 // ... 60626 31732 close(15) = 0 60627 31732 rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0 60628 31732 brk(0x9392000) = 0x9392000 60629 31732 brk(0x9390000) = 0x9390000 60630 31732 brk(0x938e000) = 0x938e000 60631 31732 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 60632 31732 writev(11, [{"\27\3\1\f\360\355\214<(\213r\to\10\261J\254\327-!A\242\17\6U\336I3\370\335\265\323\214\236{H\20\344\323\324\221\330~\345\314\246\237\245\341\25 0\334\335\2\201\201\274\300\360\235\277\350\n\335\373\230\236\225b\360<\224\200\272\252=p\32\360\0361"..., 3391}], 1) = 3391 60633 31732 write(8, "some.ip.address - - [09/Nov/2010:17:49:10 -0600] \"GET /mysite/managePages/ HTTP/1"..., 240) = 240 60634 31732 writev(11, [{"\25\3\1\0 \226\304\335\336\312\224\257\300\352#G|\5Xjq3.#\301\236\331\352\350&\313\330&5\26\210\324", 37}], 1) = 37 60635 31732 shutdown(11, 1 /* send */) = 0 60636 31732 poll( <unfinished ...> 60637 3551 <... select resumed> ) = 0 (Timeout) 60638 3551 waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0 60639 3551 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> 60640 31732 <... poll resumed> [{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1 60641 31732 read(11, "\25\3\1\0 i-B\27\367\v\252\364\372\r\30\214\1\213\233t\25\227O\234ei[\334\272\371@\2127\211\241}", 512) = 37 60642 31732 poll([{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1 60643 31732 read(11, "", 512) = 0
Please try to get a stacktrace of the hanging thread (http://httpd.apache.org/dev/debugging.html). From the strace it looks like you are using some 3rd party "app" module like php in your httpd. What are you using?
(In reply to comment #1) > Please try to get a stacktrace of the hanging thread > (http://httpd.apache.org/dev/debugging.html). From the strace it looks like you > are using some 3rd party "app" module like php in your httpd. What are you > using? We are using php. I will have to wait until the next time this occurs--it appears to be sporadic, with a frequency of about once a week, roughly. Haven't been able to tie to any specific event or action conclusively.
Hi there, Although I don't know anything about Alex's bug, we experience something that looks like a bug that might be very similar so I thought to add more info here rather than opening a new one. The httpd in question is 2.2.16 running on EC2. We see that after a while (10 minutes) of a significant load, some of the worker processes get hang on poll(). This is the stacktrace from one of the processes that are in this state: #0 0x00007f76fa70c748 in poll () from /lib64/libc.so.6 #1 0x00007f76fabe2822 in apr_wait_for_io_or_timeout () from /usr/lib64/libapr-1.so.0 #2 0x00007f76fabdd2da in apr_socket_recv () from /usr/lib64/libapr-1.so.0 #3 0x00007f76fc0dfddd in ap_lingering_close () #4 0x00007f76fc0eaa7e in ?? () #5 0x00007f76fc0ead1a in ?? () #6 0x00007f76fc0eadd0 in ?? () #7 0x00007f76fc0eb908 in ap_mpm_run () #8 0x00007f76fc0c54fb in main () Please let me know if there's any more information I can provide - this problem is obviously a major concern to us. Thanks, Tal
> Although I don't know anything about Alex's bug, we experience something that > looks like a bug that might be very similar so I thought to add more info here > rather than opening a new one. If a thread stays in that backtraces for more than a few seconds, open a separate bug.
Having threads spend time in poll() and read() is expected, reopen if you can illustrate something more nefarious