Summary: | httpd-2.2.12 often hangs for hours | ||
---|---|---|---|
Product: | APR | Reporter: | HWS <strube> |
Component: | APR | Assignee: | Apache Portable Runtime bugs mailinglist <bugs> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | jsl |
Priority: | P2 | ||
Version: | 1.3.4 | ||
Target Milestone: | --- | ||
Hardware: | HP | ||
OS: | Windows Server 2003 | ||
Bug Depends on: | |||
Bug Blocks: | 46571 | ||
Attachments: | Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier) |
Description
HWS
2009-08-05 03:12:31 UTC
Please use pstack to see what the httpd child processes are doing and post unique backtraces here. I used a duplicate installation on another machine (same OS) for testing and a script executing wget queries to six web-pages (with subqueries to css, images etc.) at a high rate (about 100 times the average load of our web server). The bug did not occur within 16 hours! However, the machines (both SPARC) had different type and number of CPUs. Thus I will have to reinstall 2.2.12 on our web server and wait till it hangs (although I don't like this). Please grant me some time (up to a week) for my next information. Faster than expected, the bug occurred again, and I could collect some hopefully informative output while the server was hanging: *** master httpd process: * pstack PID : fefcc1d0 pollsys (ffbffb30, 0, ffbffb98, 0) fef67b68 pselect (ffbffb30, ff034630, ff034630, 0, ffbffb98, 0) + 1c8 fef67ee0 select (0, 0, 0, 0, ffbffc00, 4ab82600) + a0 ff21faf0 apr_sleep (0, f4240, f4240, 33c, 9c180, f4000) + 4c 0004398c ap_wait_or_timeout (11175, 11000, ffbffcd0, ffffffff, 3, f4000) + 7c 0006ab00 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 2f0 0002787c main (1, 9957c, 99400, 9a400, 99510, cb188) + 994 00026688 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) = 0 waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0 pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) (sleeping...) pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) = 0 waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0 pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) = 0 waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0 ... etc. * truss -t !all -u a.out: -p PID : (No other process showed any output on this!) /1@1: -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff) /1@1: <- ap_wait_or_timeout() = 0xffbffce4 /1@1: -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18) /1@1: <- ap_wait_or_timeout() = 0x11175 /1@1: -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff) /1@1: <- perform_idle_server_maintenance() = 1 /1@1: -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18) /1@1: -> ap_run_monitor(0xa6e18, 0x0, 0x0, 0x0) /1@1: <- ap_run_monitor() = 0 /1@1: <- ap_wait_or_timeout() = 0x11175 /1@1: -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff) /1@1: <- perform_idle_server_maintenance() = 1 /1@1: -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18) /1@1: <- ap_wait_or_timeout() = 0x11175 /1@1: -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff) ... etc. *** fcgi-controlling httpd process: * pstack PID : fefcc1d0 pollsys (ffbff9b0, 1, 0, 0) fef67b68 pselect (ffbff9b0, ff034630, ff034630, 40, 0, 0) + 1c8 fef67ee0 select (10, ffbffba0, 0, 0, 0, ffbffc29) + a0 feb99a9c fcgi_pm_main (180, 0, febb6b80, feba005c, feba00e8, 3) + 9a4 feb943d4 init_module (a6e18, 18f6a8, aae28, ada58, 9782c, 0) + 264 0003abd4 ap_run_post_config (a6e18, e4f10, aae28, ada58, 9, 1c6908) + 48 00027820 main (1, 9957c, 99400, 9a400, 99510, cb188) + 938 00026688 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : pollsys(0xFFBFF9B0, 1, 0x00000000, 0x00000000) (sleeping...) *** one worker httpd process: * pstack PID : fefcb904 portfs (6, 8, 1ed548, 2, 1, ffbffa80) ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c 0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378 0006a4d8 make_child (9a400, d, 9ac00, 0, 9ac00, e) + 128 0006a6f0 perform_idle_server_maintenance (2, ffbffbec, 9ad7c, 1, 2, 9c400) + 164 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408 0002787c main (1, 9957c, 99400, 9a400, 99510, cb188) + 994 00026688 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...) port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) = 0 [62] port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...) *** all other worker httpd processes (adresses varying): * pstack PID : fefcb804 fcntl (f, 7, ff232d2c) fefbb4cc fcntl (f, 1, ff332a00, ff213f24, ffffedf6, 49db8400) + 84 ff212610 proc_mutex_fcntl_acquire (33c, 0, 1f3f8, fffb7d40, ff2319d4, 18f7f0) + 44 000698cc accept_mutex_on (8, 0, 0, 0, 8, 9c000) + 8 0006a168 child_main (18, 3e8, 9a400, 1f34d0, 1, 9c400) + 2c8 0006a4d8 make_child (9a400, c, 9ac00, 0, 9ac00, d) + 128 0006a6f0 perform_idle_server_maintenance (2, ffbffbe8, 9ad7c, 0, 2, 9c400) + 164 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408 0002787c main (1, 9957c, 99400, 9a400, 99510, cb188) + 994 00026688 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : fcntl(15, F_SETLKW, 0xFF232D2C) (sleeping...) *** all cronolog processes (adresses varying): * pstack PID : ff2cc2e4 read (0, ffbdfd5c, 10000) 00011c74 main (26c00, 4a7aac78, 25dd0, 3, 25800, ffbffe8e) + 790 000113dc _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : read(0, 0xFFBDFD5C, 65536) (sleeping...) *** master php process: * pstack PID : feb4c91c waitid (7, 0, ffbfdbd0, 3) feb3c120 wait (ffbffe38, 1, ff2c2a00, 2, 570cf4, 0) + 58 0047182c main (7fa0, 580f20, 0, 1, 58103c, 1) + 9cc 00042fa8 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : waitid(P_ALL, 0, 0xFFBFDBD0, WEXITED|WTRAPPED) (sleeping...) *** child php processes: * pstack PID : feb4b97c accept (0, ffbfdc3c, ffbfdcac, 1) 00472794 main (8014, 0, 1f4, 1, 0, 1) + 1934 00042fa8 _start (0, 0, 0, 0, 0, 0) + 108 * truss -p PID : accept(0, 0xFFBFDC3C, 0xFFBFDCAC, SOV_DEFAULT) (sleeping...) >Please grant me some time (up to a week) for my next information.
No problem. This is best effort/as-time-allows support on both sides ;) I'll just mark the status as NEEDINFO, which you can change when you're able to duplicate the problem and post the backtraces.
This pid is supposed to wake up for the next client connection:
* pstack PID :
fefcb904 portfs (6, 8, 1ed548, 2, 1, ffbffa80)
ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c
0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378
From the truss I guess it is waking up every 10 seconds or so (whatever the prefork timeout is), but never finding a new connection.
I don't know exactly what is causing the problem. I had a similar situation with the event MPM once and this same implementation of apr_pollset_poll(), but I didn't have time to debug or to narrow it down to one of {Event MPM, APR, Solaris kernel}.
You will likely be able to work around the problem by selecting a different poll implementation. It should be as easy as
$ make distclean
$ export ac_cv_func_port_create=no
$ ./configure [normal configure options]
(Does that look right to others following the bug list?)
From another standpoint:
>In this state, neither any answer packets were generated (tested with network sniffer) nor any log lines written.
The TCP layer should answer a SYN packet. Did that happen?
What does netstat show at the time of the hang? If connections are able to reach the ESTABLISHED state but not responded to, then TCP is doing the right thing and httpd isn't finding the connection. If none reach the ESTABLISHED state, then it is a TCP layer issue.
What changed in 2.2.12?
The Solaris port code changed in APR, but it doesn't look dangerous:
+ {
+ int flags;
+
+ if ((flags = fcntl((*pollset)->port_fd, F_GETFD)) == -1)
+ return errno;
+
+ flags |= FD_CLOEXEC;
+ if (fcntl((*pollset)->port_fd, F_SETFD, flags) == -1)
+ return errno;
+ }
BTW, what exactly did you do to clear up the hang? Before recompiling and doing new tests (which I have to delay till after the next hang), here some remarks: > I had a similar situation >with the event MPM once and this same implementation of apr_pollset_poll(), >but I didn't have time to debug or to narrow it down to one of {Event MPM, >APR, Solaris kernel}. >You will likely be able to work around the problem by selecting a different >poll implementation. As far as I can see, 2.2.11 used exactly the same implementation, without any difficulties. The only change now is setting FD_CLOEXEC. >The TCP layer should answer a SYN packet. Did that happen? >What does netstat show at the time of the hang? If connections are able to >reach the ESTABLISHED state but not responded to, then TCP is doing the right >thing and httpd isn't finding the connection. If none reach the ESTABLISHED >state, then it is a TCP layer issue. I have to wait for the next hang, but as far as I remember, snoop did not show any 'A' packets as reply to 'S' packets. If I remember the netstat output correctly, old connections were visible, but no new ones. >BTW, what exactly did you do to clear up the hang? Often nothing, but then it took hours to recover. Graceful restart always cleared the hang. >I have to wait for the next hang, but as far as I remember, snoop did not show
>any 'A' packets as reply to 'S' packets. If I remember the netstat output
>correctly, old connections were visible, but no new ones.
No, my memory seems to be bad! Now a new hang occurred, but new connection attempts do some handshake, becoming ESTABLISHED. Also netstat shows a lot of connections in CLOSE_WAIT state, which vanished after graceful restart (done in order to revive the server).
>You will likely be able to work around the problem by selecting a different
>poll implementation.
For testing, I compiled two new variants of libapr-1.so.0.3.7. First, I removed the line "#define HAVE_PORT_CREATE 1" from srclib/apr/include/arch/unix/apr_private.h, causing poll() instead of port_create() to be used (I verified that no reference to port_create() existed any more in the library). With this library, no hangs happened in 3,5 days.
Second, I used port_create() but removed the two occurences of fcntl(fd, F_SETFD, flags) from srclib/apr/poll/unix/port.c, making it act as in httpd-2.2.11. But in spite of this, there soon were hangs again.
Thus, NOT using port_create() seems to be a workaround. But still, I am not content with this, because 2.2.11 also used port_create() without ANY difficulties! (This is why I also tested the effect of F_SETFD in poll.c, which was new since 2.2.11. But there is F_SETFD in many other places, too.) The actual reason of the hanging is, IMHO, still unknown (some subtle, machine-dependent timing problem?).
Thanks so much for the valuable research. I agree that disabling port_create() is not a satisfying work-around; there's no question that something needs to be fixed, but I don't know where (kernel, APR, httpd, or even compiler). It's more than a little astonishing that you're encountering this with the simplest MPM, with APR/httpd threading support disabled. I guess that the problem has been there all along but is triggered by some seemingly unrelated change in 2.2.12. I was able to reproduce the event port problem then run successfully with this patch, which applies cleanly to the APR in httpd 2.2.12 or 2.2.13. (It isn't expected to be needed for httpd older than 2.2.12, unless a certain prefork MPM fix has been applied on top of 2.2.11). ----------cut here----------- --- poll/unix/port.c.orig 2009-08-19 13:48:22.876217458 -0400 +++ poll/unix/port.c 2009-08-19 13:49:49.908294412 -0400 @@ -320,6 +320,17 @@ ret = port_getn(pollset->port_fd, pollset->port_set, pollset->nalloc, &nget, tvptr); + if (ret == -1 && nget > 0) { + /* On at least some levels of Solaris, -1/ETIME can be + * reported at the same time an event is returned. + * Since the kernel forgets the event condition as soon as + * it is returned, we have to ignore the -1/ETIME and + * process the event to prevent a deadlock. + */ + + ret = 0; + } + /* decrease the waiting ASAP to reduce the window for calling port_associate within apr_pollset_add() */ apr_atomic_dec32(&pollset->waiting); ------------cut here----------- If anyone tries this patch, please report back. > If anyone tries this patch, please report back.
Before I try: There is another occurrence of port_getn() in port.c on line 501 (after your patch; 490 original). Shouldn't the same patch be applied there, too?
(BTW, unfortunately I can hardly carry out tests now because I'll be absent for three weeks after tomorrow.)
Created attachment 24161 [details]
Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier)
httpd users: This works with httpd 2.2.11-2.2.13 (and possibly earlier levels).
cd to the srclib/apr directory before applying patch.
(In reply to comment #12) > > If anyone tries this patch, please report back. > > Before I try: There is another occurrence of port_getn() in port.c on line 501 > (after your patch; 490 original). Shouldn't the same patch be applied there, > too? > (BTW, unfortunately I can hardly carry out tests now because I'll be absent for > three weeks after tomorrow.) That unhandled path is for an API isn't used in httpd 2.2.x, but the patch I just attached does fix the problem for that API. A fix has been committed to APR trunk, 1.4.x, and 1.3.x branches for the next 1.3.x release. I would like to chime in here and add that this issue doesn't only affect Solaris, but Linux (specifically RH9, 2.4 kernel) as well. I upgraded to 2.2.13 this weekend and experienced the exact same issues. Upgraded firstly to apr-1.3.8, no change. Then applied the patch attached here and things are working fine again. (Point of note to readers: I also then saw MySQL connections fill with queries in a locked state when running Gallery -- you may need to kill some queries if you hit this issue). (In reply to comment #15) > I would like to chime in here and add that this issue doesn't only affect > Solaris, but Linux (specifically RH9, 2.4 kernel) as well. You must be affected by something else. Linux does not have event ports but epoll. > I upgraded to 2.2.13 this weekend and experienced the exact same issues. > > Upgraded firstly to apr-1.3.8, no change. Then applied the patch attached here > and things are working fine again. As said this is not possible. The patch attached in #13 changes code that is never executed on Linux. (In reply to comment #16) > You must be affected by something else. Linux does not have event ports but > epoll. Well, that's strange, because either this patch, or me clearing my MySQL connections (which was caused by this, I'm sure of it) fixed the issue. > As said this is not possible. The patch attached in #13 changes code that is > never executed on Linux. Very strange. I guess I'l revert my patch and rebuild and try again. My apologies... a clean rebuild and everything is working fine. Comment on attachment 24161 [details] Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier) >--- poll/unix/port.c.orig 2009-08-21 08:16:52.140947520 -0400 >+++ poll/unix/port.c 2009-08-21 08:18:58.662675195 -0400 >@@ -325,6 +325,11 @@ > apr_atomic_dec32(&pollset->waiting); > (*num) = nget; > >+ if (ret == -1 && nget && >+ (errno == ETIME || errno == EINTR)) { >+ ret = 0; >+ } >+ > if (ret == -1) { > (*num) = 0; > if (errno == EINTR) { >@@ -490,6 +495,11 @@ > ret = port_getn(pollcb->port_fd, pollcb->port_set, pollcb->nalloc, > &nget, tvptr); > >+ if (ret == -1 && nget && >+ (errno == ETIME || errno == EINTR)) { >+ ret = 0; >+ } >+ > if (ret == -1) { > if (errno == ETIME || errno == EINTR) { > rv = APR_TIMEUP; |