Bug 47645

Summary:	httpd-2.2.12 often hangs for hours
Product:	APR	Reporter:	HWS <strube>
Component:	APR	Assignee:	Apache Portable Runtime bugs mailinglist <bugs>
Status:	RESOLVED FIXED
Severity:	critical	CC:	jsl
Priority:	P2
Version:	1.3.4
Target Milestone:	---
Hardware:	HP
OS:	Windows Server 2003
Bug Depends on:
Bug Blocks:	46571
Attachments:	Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier)

Description HWS 2009-08-05 03:12:31 UTC

After upgrading from httpd 2.2.11 to 2.2.12, the server was blocked at least once per day for up to six hours. In this state, neither any answer packets were generated (tested with network sniffer) nor any log lines written. In one such state, I restarted the server gracefully, then it worked again. The log files do not show any errors. Tracing the running master httpd process with Solaris "truss" showed a loop of pollsys(...) and waitid(P_ALL,...) calls, but this is normal. Only downgrading to 2.2.11 helped.

Replacing the log-pipe syntax "|exec cronolog ..." with the new one, "||cronolog ..." had no effect. Systematic testing will be difficult since I cannot do this on our "production" web server and the bug cannot be triggered at will.

System: Solaris 10 u7 (SPARC), httpd self-compiled with Sun Studio Forte 7 compiler, configure options: CFLAGS='-xO3 -xarch=v8 -xstrconst' --disable-threads --enable-suexec --with-ssl=/usr/sfw (plus module options; mpm=prefork [default]). Additionally, mod_fastcgi was compiled separately. All this and the runtime configuration were identical to previous versions, which always worked reliably.

Process hierarchy (log output is piped through cronolog):
httpd        (master)
  cronolog
  cronolog
  cronolog
  httpd      (fastcgi controller)
    php-fcgi (master, PHP_FCGI_CHILDREN=2)
      php-fcgi
      php-fcgi
  httpd
  httpd
  ...
  httpd

Comment 1 Jeff Trawick 2009-08-05 06:24:25 UTC

Please use pstack to see what the httpd child processes are doing and post unique backtraces here.

Comment 2 HWS 2009-08-06 00:37:31 UTC

I used a duplicate installation on another machine (same OS) for testing and a script executing wget queries to six web-pages (with subqueries to css, images etc.) at a high rate (about 100 times the average load of our web server). The bug did not occur within 16 hours! However, the machines (both SPARC) had different type and number of CPUs. Thus I will have to reinstall 2.2.12 on our web server and wait till it hangs (although I don't like this).
Please grant me some time (up to a week) for my next information.

Comment 3 HWS 2009-08-06 05:07:05 UTC

Faster than expected, the bug occurred again, and I could collect some hopefully informative output while the server was hanging:

*** master httpd process:
* pstack PID :
 fefcc1d0 pollsys  (ffbffb30, 0, ffbffb98, 0)
 fef67b68 pselect  (ffbffb30, ff034630, ff034630, 0, ffbffb98, 0) + 1c8
 fef67ee0 select   (0, 0, 0, 0, ffbffc00, 4ab82600) + a0
 ff21faf0 apr_sleep (0, f4240, f4240, 33c, 9c180, f4000) + 4c
 0004398c ap_wait_or_timeout (11175, 11000, ffbffcd0, ffffffff, 3, f4000) + 7c
 0006ab00 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 2f0
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) (sleeping...)
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
... etc.

* truss -t !all -u a.out: -p PID :
  (No other process showed any output on this!)
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff)
/1@1:       <- ap_wait_or_timeout() = 0xffbffce4
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff)
/1@1:       <- perform_idle_server_maintenance() = 1
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:         -> ap_run_monitor(0xa6e18, 0x0, 0x0, 0x0)
/1@1:         <- ap_run_monitor() = 0
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff)
/1@1:       <- perform_idle_server_maintenance() = 1
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0, 0xffffffff)
... etc.

*** fcgi-controlling httpd process:
* pstack PID :
 fefcc1d0 pollsys  (ffbff9b0, 1, 0, 0)
 fef67b68 pselect  (ffbff9b0, ff034630, ff034630, 40, 0, 0) + 1c8
 fef67ee0 select   (10, ffbffba0, 0, 0, 0, ffbffc29) + a0
 feb99a9c fcgi_pm_main (180, 0, febb6b80, feba005c, feba00e8, 3) + 9a4
 feb943d4 init_module (a6e18, 18f6a8, aae28, ada58, 9782c, 0) + 264
 0003abd4 ap_run_post_config (a6e18, e4f10, aae28, ada58, 9, 1c6908) + 48
 00027820 main     (1, 9957c, 99400, 9a400, 99510, cb188) + 938
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
pollsys(0xFFBFF9B0, 1, 0x00000000, 0x00000000) (sleeping...)

*** one worker httpd process:
* pstack PID :
 fefcb904 portfs   (6, 8, 1ed548, 2, 1, ffbffa80)
 ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c
 0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378
 0006a4d8 make_child (9a400, d, 9ac00, 0, 9ac00, e) + 128
 0006a6f0 perform_idle_server_maintenance (2, ffbffbec, 9ad7c, 1, 2, 9c400) + 164
 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...)
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80)      = 0 [62]
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...)

*** all other worker httpd processes (adresses varying):
* pstack PID :
 fefcb804 fcntl    (f, 7, ff232d2c)
 fefbb4cc fcntl    (f, 1, ff332a00, ff213f24, ffffedf6, 49db8400) + 84
 ff212610 proc_mutex_fcntl_acquire (33c, 0, 1f3f8, fffb7d40, ff2319d4, 18f7f0) + 44
 000698cc accept_mutex_on (8, 0, 0, 0, 8, 9c000) + 8
 0006a168 child_main (18, 3e8, 9a400, 1f34d0, 1, 9c400) + 2c8
 0006a4d8 make_child (9a400, c, 9ac00, 0, 9ac00, d) + 128
 0006a6f0 perform_idle_server_maintenance (2, ffbffbe8, 9ad7c, 0, 2, 9c400) + 164
 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
fcntl(15, F_SETLKW, 0xFF232D2C) (sleeping...)

*** all cronolog processes (adresses varying):
* pstack PID :
 ff2cc2e4 read     (0, ffbdfd5c, 10000)
 00011c74 main     (26c00, 4a7aac78, 25dd0, 3, 25800, ffbffe8e) + 790
 000113dc _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
read(0, 0xFFBDFD5C, 65536)	(sleeping...)

*** master php process:
* pstack PID :
 feb4c91c waitid   (7, 0, ffbfdbd0, 3)
 feb3c120 wait     (ffbffe38, 1, ff2c2a00, 2, 570cf4, 0) + 58
 0047182c main     (7fa0, 580f20, 0, 1, 58103c, 1) + 9cc
 00042fa8 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
waitid(P_ALL, 0, 0xFFBFDBD0, WEXITED|WTRAPPED) (sleeping...)

*** child php processes:
* pstack PID :
 feb4b97c accept   (0, ffbfdc3c, ffbfdcac, 1)
 00472794 main     (8014, 0, 1f4, 1, 0, 1) + 1934
 00042fa8 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
accept(0, 0xFFBFDC3C, 0xFFBFDCAC, SOV_DEFAULT) (sleeping...)

Comment 4 Jeff Trawick 2009-08-06 05:17:54 UTC

>Please grant me some time (up to a week) for my next information.

No problem.  This is best effort/as-time-allows support on both sides ;)  I'll just mark the status as NEEDINFO, which you can change when you're able to duplicate the problem and post the backtraces.

Comment 5 Jeff Trawick 2009-08-06 05:39:11 UTC

This pid is supposed to wake up for the next client connection:

* pstack PID :
 fefcb904 portfs   (6, 8, 1ed548, 2, 1, ffbffa80)
 ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c
 0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378

From the truss I guess it is waking up every 10 seconds or so (whatever the prefork timeout is), but never finding a new connection.

I don't know exactly what is causing the problem.  I had a similar situation with the event MPM once and this same implementation of apr_pollset_poll(), but I didn't have time to debug or to narrow it down to one of {Event MPM, APR, Solaris kernel}.

You will likely be able to work around the problem by selecting a different poll implementation.  It should be as easy as 

$ make distclean
$ export ac_cv_func_port_create=no
$ ./configure [normal configure options]

(Does that look right to others following the bug list?)

From another standpoint:

>In this state, neither any answer packets were generated (tested with network sniffer) nor any log lines written.

The TCP layer should answer a SYN packet.  Did that happen?

What does netstat show at the time of the hang?  If connections are able to reach the ESTABLISHED state but not responded to, then TCP is doing the right thing and httpd isn't finding the connection.  If none reach the ESTABLISHED state, then it is a TCP layer issue.

What changed in 2.2.12?

The Solaris port code changed in APR, but it doesn't look dangerous:

+    {
+        int flags;
+
+        if ((flags = fcntl((*pollset)->port_fd, F_GETFD)) == -1)
+            return errno;
+
+        flags |= FD_CLOEXEC;
+        if (fcntl((*pollset)->port_fd, F_SETFD, flags) == -1)
+            return errno;
+    }

Comment 6 Jeff Trawick 2009-08-06 05:44:18 UTC

BTW, what exactly did you do to clear up the hang?

Comment 7 HWS 2009-08-06 06:43:46 UTC

Before recompiling and doing new tests (which I have to delay till after the next hang), here some remarks:

> I had a similar situation
>with the event MPM once and this same implementation of apr_pollset_poll(),
>but I didn't have time to debug or to narrow it down to one of {Event MPM,
>APR, Solaris kernel}.
>You will likely be able to work around the problem by selecting a different
>poll implementation.

As far as I can see, 2.2.11 used exactly the same implementation, without any difficulties. The only change now is setting FD_CLOEXEC.

>The TCP layer should answer a SYN packet.  Did that happen?
>What does netstat show at the time of the hang?  If connections are able to
>reach the ESTABLISHED state but not responded to, then TCP is doing the right
>thing and httpd isn't finding the connection.  If none reach the ESTABLISHED
>state, then it is a TCP layer issue.

I have to wait for the next hang, but as far as I remember, snoop did not show any 'A' packets as reply to 'S' packets. If I remember the netstat output correctly, old connections were visible, but no new ones.

>BTW, what exactly did you do to clear up the hang?

Often nothing, but then it took hours to recover. Graceful restart always cleared the hang.

Comment 8 HWS 2009-08-06 08:16:24 UTC

>I have to wait for the next hang, but as far as I remember, snoop did not show
>any 'A' packets as reply to 'S' packets. If I remember the netstat output
>correctly, old connections were visible, but no new ones.

No, my memory seems to be bad! Now a new hang occurred, but new connection attempts do some handshake, becoming ESTABLISHED. Also netstat shows a lot of connections in CLOSE_WAIT state, which vanished after graceful restart (done in order to revive the server).

Comment 9 HWS 2009-08-10 07:44:54 UTC

>You will likely be able to work around the problem by selecting a different
>poll implementation.

For testing, I compiled two new variants of libapr-1.so.0.3.7. First, I removed the line "#define HAVE_PORT_CREATE 1" from srclib/apr/include/arch/unix/apr_private.h, causing poll() instead of port_create() to be used (I verified that no reference to port_create() existed any more in the library). With this library, no hangs happened in 3,5 days.
Second, I used port_create() but removed the two occurences of fcntl(fd, F_SETFD, flags) from srclib/apr/poll/unix/port.c, making it act as in httpd-2.2.11. But in spite of this, there soon were hangs again.

Thus, NOT using port_create() seems to be a workaround. But still, I am not content with this, because 2.2.11 also used port_create() without ANY difficulties! (This is why I also tested the effect of F_SETFD in poll.c, which was new since 2.2.11. But there is F_SETFD in many other places, too.) The actual reason of the hanging is, IMHO, still unknown (some subtle, machine-dependent timing problem?).

Comment 10 Jeff Trawick 2009-08-11 12:29:11 UTC

Thanks so much for the valuable research.

I agree that disabling port_create() is not a satisfying work-around; there's no question that something needs to be fixed, but I don't know where (kernel, APR, httpd, or even compiler).  It's more than a little astonishing that you're encountering this with the simplest MPM, with APR/httpd threading support disabled.

I guess that the problem has been there all along but is triggered by some seemingly unrelated change in 2.2.12.

Comment 11 Jeff Trawick 2009-08-19 11:33:44 UTC

I was able to reproduce the event port problem then run successfully with this patch, which applies cleanly to the APR in httpd 2.2.12 or 2.2.13.  (It isn't expected to be needed for httpd older than 2.2.12, unless a certain prefork MPM fix has been applied on top of 2.2.11).

----------cut here-----------
--- poll/unix/port.c.orig	2009-08-19 13:48:22.876217458 -0400
+++ poll/unix/port.c	2009-08-19 13:49:49.908294412 -0400
@@ -320,6 +320,17 @@
     ret = port_getn(pollset->port_fd, pollset->port_set, pollset->nalloc,
                     &nget, tvptr);
 
+    if (ret == -1 && nget > 0) {
+        /* On at least some levels of Solaris, -1/ETIME can be
+         * reported at the same time an event is returned.
+         * Since the kernel forgets the event condition as soon as
+         * it is returned, we have to ignore the -1/ETIME and
+         * process the event to prevent a deadlock.
+         */
+
+        ret = 0;
+    }
+
     /* decrease the waiting ASAP to reduce the window for calling 
        port_associate within apr_pollset_add() */
     apr_atomic_dec32(&pollset->waiting);
------------cut here-----------

If anyone tries this patch, please report back.

Comment 12 HWS 2009-08-20 00:45:15 UTC

> If anyone tries this patch, please report back.

Before I try: There is another occurrence of port_getn() in port.c on line 501 (after your patch; 490 original). Shouldn't the same patch be applied there, too?
(BTW, unfortunately I can hardly carry out tests now because I'll be absent for three weeks after tomorrow.)

Comment 13 Jeff Trawick 2009-08-24 09:48:44 UTC

Created attachment 24161 [details]
Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier)

httpd users: This works with httpd 2.2.11-2.2.13 (and possibly earlier levels).
cd to the srclib/apr directory before applying patch.

Comment 14 Jeff Trawick 2009-08-24 09:50:32 UTC

(In reply to comment #12)
> > If anyone tries this patch, please report back.
> 
> Before I try: There is another occurrence of port_getn() in port.c on line 501
> (after your patch; 490 original). Shouldn't the same patch be applied there,
> too?
> (BTW, unfortunately I can hardly carry out tests now because I'll be absent for
> three weeks after tomorrow.)

That unhandled path is for an API isn't used in httpd 2.2.x, but the patch I just attached does fix the problem for that API.

A fix has been committed to APR trunk, 1.4.x, and 1.3.x branches for the next 1.3.x release.

Comment 15 Jason Lingohr 2009-09-26 19:32:47 UTC

I would like to chime in here and add that this issue doesn't only affect Solaris, but Linux (specifically RH9, 2.4 kernel) as well.

I upgraded to 2.2.13 this weekend and experienced the exact same issues.

Upgraded firstly to apr-1.3.8, no change.  Then applied the patch attached here and things are working fine again.

(Point of note to readers: I also then saw MySQL connections fill with queries in a locked state when running Gallery -- you may need to kill some queries if you hit this issue).

Comment 16 Ruediger Pluem 2009-09-27 03:25:17 UTC

(In reply to comment #15)
> I would like to chime in here and add that this issue doesn't only affect
> Solaris, but Linux (specifically RH9, 2.4 kernel) as well.

You must be affected by something else. Linux does not have event ports but epoll.

> I upgraded to 2.2.13 this weekend and experienced the exact same issues.
> 
> Upgraded firstly to apr-1.3.8, no change.  Then applied the patch attached here
> and things are working fine again.

As said this is not possible. The patch attached in #13 changes code that is never executed on Linux.

Comment 17 Jason Lingohr 2009-09-27 15:25:35 UTC

(In reply to comment #16)
> You must be affected by something else. Linux does not have event ports but
> epoll.

Well, that's strange, because either this patch, or me clearing my MySQL connections (which was caused by this, I'm sure of it) fixed the issue.

> As said this is not possible. The patch attached in #13 changes code that is
> never executed on Linux.

Very strange.  I guess I'l revert my patch and rebuild and try again.

Comment 18 Jason Lingohr 2009-09-27 16:08:49 UTC

My apologies... a clean rebuild and everything is working fine.

Comment 19 shenming 2014-03-31 06:07:36 UTC

Comment on attachment 24161 [details]
Fix for PR 47645 to use with apr 1.3.3-1.3.8 (and possibly earlier)

>--- poll/unix/port.c.orig	2009-08-21 08:16:52.140947520 -0400
>+++ poll/unix/port.c	2009-08-21 08:18:58.662675195 -0400
>@@ -325,6 +325,11 @@
>     apr_atomic_dec32(&pollset->waiting);
>     (*num) = nget;
> 
>+    if (ret == -1 && nget &&
>+        (errno == ETIME || errno == EINTR)) {
>+        ret = 0;
>+    }
>+
>     if (ret == -1) {
>         (*num) = 0;
>         if (errno == EINTR) {
>@@ -490,6 +495,11 @@
>     ret = port_getn(pollcb->port_fd, pollcb->port_set, pollcb->nalloc,
>                     &nget, tvptr);
> 
>+    if (ret == -1 && nget &&
>+        (errno == ETIME || errno == EINTR)) {
>+        ret = 0;
>+    }
>+
>     if (ret == -1) {
>         if (errno == ETIME || errno == EINTR) {
>             rv = APR_TIMEUP;