Bug 50247 - httpd hangs on poll(...) and read(...)
Summary: httpd hangs on poll(...) and read(...)
Status: RESOLVED INVALID
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: All (show other bugs)
Version: 2.2.16
Hardware: PC Linux
: P2 critical (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-10 12:54 UTC by Alex
Modified: 2011-09-17 19:50 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex 2010-11-10 12:54:58 UTC
We are seeing sporadic instances of what appears to be a bug in httpd. The symptoms are that httpd hangs on polling and reading. Here's some pertinent lines from an strace log (I've separated the "hanging" sections with a space at both ends):

  51988 31737 write(8, "some.ip.address - - [09/Nov/2010:17:47:10 -0600] \"GET /learn/ HTTP/1.1\" 200 8719 "..., 222) = 222
  51989 31737 shutdown(11, 1 /* send */)        = 0
  51990 31737 poll([{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1  51991 31737 read(11, "", 512)                 = 0  51992 31737 close(11)                         = 0
  51993 31737 read(5, 0xbfedeccf, 1)            = -1 EAGAIN (Resource temporarily unavailable)  51994 31737 semop(1605639, 0x12fb44, 1 <unfinished ...>

  51995 3551  <... select resumed> )            = 0 (Timeout)
  51996 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0  51997 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)  51998 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  51999 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
  52000 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  52001 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
  52002 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0  52003 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)  52004 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0  52005 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)  52006 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  52007 3551  select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
  52008 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  52009 3551  select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>

  52010 4710  <... poll resumed> [{fd=4, events=POLLIN, revents=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1
  52011 4710  accept(4, {sa_family=AF_INET6, sin6_port=htons(63556), inet_pton(AF_INET6, "::ffff:some.ip.address", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 1        1
  52012 4710  semop(1605639, 0x12fb4a, 1 <unfinished ...>
  52013 14265 <... semop resumed> )             = 0
  52014 4710  <... semop resumed> )             = 0

  // ...

  58900 31732 write(13, "\371\0\0\0\3SELECT * from trefPageModule\r\n\t\t\t\t\t\t\t\t\t\t LEFT JOIN tPage ON tPage.pageID=tr"..., 253) = 253
  58901 31732 read(13,  <unfinished ...>

  58902 3551  <... select resumed> )            = 0 (Timeout)
  58903 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  58904 3551  select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>

  58905 31732 <... read resumed> "\1\0\0\1\34<\0\0\2\3def\6MySite\16trefPageModule\16trefPageModule\2ID\2ID\f?\0\v\0\0\0\3\3B\0\0\0D\0\0\3\3def\6Ut"..., 16384) = 1871
  58906 31732 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 0) = 0

  // ...

  60626 31732 close(15)                         = 0
  60627 31732 rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0  60628 31732 brk(0x9392000)                    = 0x9392000
  60629 31732 brk(0x9390000)                    = 0x9390000
  60630 31732 brk(0x938e000)                    = 0x938e000
  60631 31732 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
  60632 31732 writev(11, [{"\27\3\1\f\360\355\214<(\213r\to\10\261J\254\327-!A\242\17\6U\336I3\370\335\265\323\214\236{H\20\344\323\324\221\330~\345\314\246\237\245\341\25        0\334\335\2\201\201\274\300\360\235\277\350\n\335\373\230\236\225b\360<\224\200\272\252=p\32\360\0361"..., 3391}], 1) = 3391  60633 31732 write(8, "some.ip.address - - [09/Nov/2010:17:49:10 -0600] \"GET /mysite/managePages/ HTTP/1"..., 240) = 240
  60634 31732 writev(11, [{"\25\3\1\0 \226\304\335\336\312\224\257\300\352#G|\5Xjq3.#\301\236\331\352\350&\313\330&5\26\210\324", 37}], 1) = 37
  60635 31732 shutdown(11, 1 /* send */)        = 0
  60636 31732 poll( <unfinished ...>

  60637 3551  <... select resumed> )            = 0 (Timeout)
  60638 3551  waitpid(-1, 0xbfeded50, WNOHANG|WSTOPPED) = 0
  60639 3551  select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>

  60640 31732 <... poll resumed> [{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1
  60641 31732 read(11, "\25\3\1\0 i-B\27\367\v\252\364\372\r\30\214\1\213\233t\25\227O\234ei[\334\272\371@\2127\211\241}", 512) = 37
  60642 31732 poll([{fd=11, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 2000) = 1
  60643 31732 read(11, "", 512)                 = 0
Comment 1 Ruediger Pluem 2010-11-10 14:52:26 UTC
Please try to get a stacktrace of the hanging thread (http://httpd.apache.org/dev/debugging.html). From the strace it looks like you are using some 3rd party "app" module like php in your httpd. What are you using?
Comment 2 Alex 2010-11-11 11:12:00 UTC
(In reply to comment #1)
> Please try to get a stacktrace of the hanging thread
> (http://httpd.apache.org/dev/debugging.html). From the strace it looks like you
> are using some 3rd party "app" module like php in your httpd. What are you
> using?

We are using php. I will have to wait until the next time this occurs--it appears to be sporadic, with a frequency of about once a week, roughly. Haven't been able to tie to any specific event or action conclusively.
Comment 3 tal.yalon 2011-03-09 08:27:06 UTC
Hi there,

Although I don't know anything about Alex's bug, we experience something that looks like a bug that might be very similar so I thought to add more info here rather than opening a new one.

The httpd in question is 2.2.16 running on EC2.

We see that after a while (10 minutes) of a significant load, some of the worker processes get hang on poll().

This is the stacktrace from one of the processes that are in this state:

#0  0x00007f76fa70c748 in poll () from /lib64/libc.so.6
#1  0x00007f76fabe2822 in apr_wait_for_io_or_timeout () from /usr/lib64/libapr-1.so.0
#2  0x00007f76fabdd2da in apr_socket_recv () from /usr/lib64/libapr-1.so.0
#3  0x00007f76fc0dfddd in ap_lingering_close ()
#4  0x00007f76fc0eaa7e in ?? ()
#5  0x00007f76fc0ead1a in ?? ()
#6  0x00007f76fc0eadd0 in ?? ()
#7  0x00007f76fc0eb908 in ap_mpm_run ()
#8  0x00007f76fc0c54fb in main ()

Please let me know if there's any more information I can provide - this problem is obviously a major concern to us.

Thanks,
Tal
Comment 4 Eric Covener 2011-03-09 09:28:14 UTC
> Although I don't know anything about Alex's bug, we experience something that
> looks like a bug that might be very similar so I thought to add more info here
> rather than opening a new one.

If a thread stays in that backtraces for more than a few seconds, open a separate bug.
Comment 5 Eric Covener 2011-09-17 19:50:46 UTC
Having threads spend time in poll() and read() is expected, reopen if you can illustrate something more nefarious