Bug 48029

Summary:	httpd-2.2.14 hangs in port_getn
Product:	APR	Reporter:	Jan Schlien <httpd.apache2>
Component:	APR	Assignee:	Apache Portable Runtime bugs mailinglist <bugs>
Status:	CLOSED FIXED
Severity:	normal	CC:	apache, bernd.leibing
Priority:	P2
Version:	1.3.9
Target Milestone:	---
Hardware:	Sun
OS:	Solaris
Attachments:	first test fix for PR 48029

Description Jan Schlien 2009-10-20 10:46:38 UTC

running httpd on Solaris 10U7, we were facing problems somewhat like the ones described in bug 47645, at least we saw processes hanging a *long* time in port_getn once in a while. Although the conversation in that bug looks like 2.2.11 would not be affected, we decided to upgrade - and things are getting better.

Better meaning the bug is rather easy to reproduce, now. We are using prefork mpm. What I can reproduce is:
- start a fresh apache instance
- query some cgi

Results:
Some output from the cgi is received slowly, then it suddenly stops. The cgi seems to have run properly: On the server, the corresponding process has terminated, and can be seen as "defunct" with ps. Its parent is an httpd process. pstack on that process shows:
 feef4ce5 portfs   (6, c, 8672428, 2, 1, 8045438)
 fecb0a77 call_port_getn (c, 8672428, 2, 804547c, 11e1a300, 0) + a5
 fecb1335 apr_pollset_poll (86723e8, 11e1a300, 0, 80454d8, 80454dc, 8672800) + 1d5
 081fb58d cgi_bucket_read (8669f70, 8045538, 804553c, 0, 8045520, fefd25eb) + 76
 080b0f86 ap_content_length_filter (866cca8, 8672798, 866bf14, 860c190, 866bf0c, 8669ec0) + bc
[...]

Using solaris' dtrace, we are seeing that the process called port_getn several times earlier. System calls in chronological order, truncated after ap_content_length_filter:

[1]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`apr_pollset_poll+0xe2
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0

[2]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 1

[3]
              libc.so.1`_write+0x15
              libapr-1.so.0.3.9`apr_file_write+0x1cc
              libapr-1.so.0.3.9`apr_file_write_full+0x38
              libapr-1.so.0.3.9`apr_file_puts+0x3b
              httpd`log_error_core+0x566
              httpd`ap_log_rerror+0x5b
              httpd`cgi_bucket_read+0x12f
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 131
corresponding log entry: (11)Resource temporarily unavailable: poll failed waiting for CGI child

[4]
              libc.so.1`_writev+0x15
              libapr-1.so.0.3.9`apr_socket_sendv+0x83
              httpd`writev_it_all+0x5d
              httpd`ap_core_output_filter+0xb1a
              httpd`ap_pass_brigade+0x92
return value: 8183
obviously transfers the bit of output that is returned to the client

[5]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 1

[6]
              libc.so.1`_read+0x15
              libapr-1.so.0.3.9`apr_file_read+0xff
              libapr-1.so.0.3.9`apr_file_gets+0x178
              httpd`log_script_err+0x26
              httpd`cgi_bucket_read+0x1a4
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0
why 0? where has the rest of my cgi output gone?

[7]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`apr_pollset_remove+0x7b
              httpd`cgi_bucket_read+0x1c4
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0

[8]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
-- hangs for quite a long time

This might also be a mod_cgi bug. However, mod_cgi.c has a timestamp of Jul 2007, and it worked with apr from httpd 2.2.11.

Comment 1 Jeff Trawick 2009-10-21 11:30:34 UTC

This log message means port_getn() is returning EAGAIN:

(11)Resource temporarily unavailable: poll failed waiting for CGI child

EAGAIN isn't a documented return code for port_getn(), and APR doesn't handle it.  (I see that libevent had to start handling it at one point.)

Since EAGAIN wasn't expected by APR, it gets passed back as-is to cgi_bucket_read() as some arbitrary system error code.

Since EAGAIN wasn't expected by cgi_bucket_read(), it gets passed back as-is up the filter chain as some arbitrary system error code.  Up the filter chain in ap_content_length_filter(), EAGAIN is checked for (though it is supposed to mean something else).  That causes the data read already to be flushed, then a blocking bucket read is performed.

The read() which returns 0, where you ask "why 0? where has the rest of my cgi output gone?" is a read on your script's stderr.  The apr_pollset_remove is supposed to be for the stderr handle, since we've already hit EOF on it.

Does the final port_getn() hang for the amount specified by your Timeout directive?  (300 is the default)

I'm wildly guessing for the moment that there's some bug with the pollset manipulation such that the final port_getn() isn't looking at the right handle, and the cgi process remains as a zombie until the pipe it shares with the httpd child is emptied.  Maybe we'd avoid the mishandling except for the bogus EAGAIN.

Can you get a truss showing the httpd child process and the cgi script?  (truss -fp pid-of-httpd-blocked-in-poll before sending the request)

I'll try to model it more closely here.

Comment 2 Jan Schlien 2009-10-21 11:52:00 UTC

First: Yes, the amount of time passing until the request "completes" roughly fits the config's Timeout value.

Unfortunately getting a truss of what is going on is not possible. I forget to mention why we used dtrace to see what is going on - and that bit might be interesting, too: When running with truss, everything works fine. Reproducibly. Might be a hint to some kind of weird timing issue.

I was expecting the return value showed by dtrace for "libc.so.1`_portfs+0x15" in [2] (which is 1), was the return value of port_getn.

When trussing the already hanging child process in [8], I see it sleeping in port_getn. dtrace states it is in _portfs. pstack by the way also states _portfs. Thus I guess port_getn is the high level name for _portfs.

If that is correct, then your explanation with 11 and EAGAIN does not add up, as dtrace shows port_getn really returned 1. However, 1 would mean EPERM, which is no documented return code for port_getn, either.

To gather more information, we could play around with dtrace somewhat further, if we knew what we are looking for.

Comment 3 Jeff Trawick 2009-10-21 12:22:20 UTC

>Thus I guess port_getn is the high level name for _portfs.
>If that is correct, then your explanation with 11 and EAGAIN does not add up,
>as dtrace shows port_getn really returned 1.

If you look for _portfs at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/event_port.c, portfs is a generic port-related call (probably kernel entry point), and the first parameter indicates which operation (which libc call).

I agree though that the source of the EAGAIN isn't necessarily what I said in my previous update.  I think it is clear that apr_pollset_poll() returned EAGAIN though, and that means either port_associate() or port_getn() returned EAGAIN.  port_associate() even documents EAGAIN as a possible return.

The call to _portfs() directly from apr_pollset_poll() must be for port_associate(), and the call to _portfs() from call_port_getn() must be for port_getn().  If the EAGAIN came from port_associate(), we wouldn't have called port_getn().  Are calls [1] and [2] from the same call to apr_pollset_poll()?

BTW, APR < 1.3.9 didn't check the return code from port_associate().

The return code from _portfs() is some structure/union rval_t which I can't find the definition of.  It does include multiple values; I don't know which of multiple values dtrace is showing us.

Comment 4 Jeff Trawick 2009-10-21 13:06:01 UTC

Created attachment 24404 [details]
first test fix for PR 48029

Please try this patch.

The theory is that we didn't notice an event when it was first reported, so we didn't know we had to reassociate the descriptor, so we never woke up when activity occurred on that descriptor in the future.

This is what happened before 1.3.9 on some cases where -1/ETIME was returned by port_getn().  The patch handles the same condition for -1/EAGAIN -- even an event is returned, don't lose it.

This patch also includes the fix to PR48030, which is for a regression in this code in APR 1.3.9.

Comment 5 Jan Schlien 2009-10-22 03:03:07 UTC

The patch you suggest does not apply to apr 1.3.9. I cannot figure out at first sight what to change in 1.3.9 to resemble the changes you made (to the HEAD version, I guess). Do you want me to try the repository version of apr? Or can you provide a port to 1.3.9, please?

I do not see anything from the patch to bug 48030 included (which indeed does apply to 1.3.9).

Comment 6 Jeff Trawick 2009-10-22 04:52:44 UTC

Hi Jan,

I don't have any issues patching...  What sort of errors are you getting?  Do you have GNU patch (gpatch) available?

$ tar -xjf ../apr-1.3.9.tar.bz2 
$ cd apr-1.3.9/
$ patch -p0 --dry-run < ~/PR48029-1.txt 
patching file poll/unix/port.c
$ patch -p0  < ~/PR48029-1.txt 
patching file poll/unix/port.c

(I confirmed that what I can download from Bugzilla matches my original patch file.)

Comment 7 Jan Schlien 2009-10-22 05:32:43 UTC

*cough* I confused myself with the apr-1.x-PR47645.patch lying around.

Yes, the patch does apply now. Hoewever, from your previous explanations, I was expecting to see some patch against port_associate() (due to EGAIN), while this patch only changes port_getn() treatment.

Anyway, I'm going to try that now and come back then.

Comment 8 Jeff Trawick 2009-10-22 06:12:02 UTC

>I was expecting to see some patch against port_associate() (due to EGAIN), 
>while this patch only changes port_getn() treatment.

The source of the EAGAIN issue isn't perfectly clear to me, but

* APR 1.3.9 won't call port_getn() if port_associate() returns any error, including EAGAIN, within the same call to apr_pollset_poll(); that's why I don't think port_associate() is the source of the EAGAIN

* if I'm wrong about the source of EAGAIN: the documented reason for getting EAGAIN from port_associate() is reaching a limit of objects associated with the port
** we can't recover from that error once it occurs (if indeed the reason for EAGAIN is the documented reason)
** the bug reported in PR 48030, if it doesn't cause a crash, could cause objects to be reassociated multiple times, potentially leading to the documented scenario; the patch includes that fix

Needless to say, I wish I could reproduce the issue ;)

Comment 9 Jan Schlien 2009-10-22 06:39:39 UTC

Alright. Tried the patch now - but no differences.

We could write a more elaborate dtrace script, or try to investigate the process waiting for port_getn() to complete with a debugger. But what are we looking for?

Comment 10 Jeff Trawick 2009-10-22 07:33:02 UTC

Something we need to know is whether there are any descriptors associated with the event port when we block in port_getn(), but I don't know how to check for that other than observing the sequence of calls to port_getn() and port_associate() leading up to the hang.

Do you see clear evidence of where the EAGAIN is coming from in the traces you already have?

I think it is easy to be 100% sure where the EAGAIN is coming from.

Since cgi_bucket_read logs the retcode from port_associate() or port_getn(), let's change EAGAIN from either of those to something unique so that we know where the EAGAIN is coming from.

--- poll/unix/port.c.orig	2009-10-22 10:15:14.172437555 -0400
+++ poll/unix/port.c	2009-10-22 10:24:37.096601717 -0400
@@ -350,6 +350,9 @@
                              fd, get_event(ep->pfd.reqevents), ep);
         if (ret < 0) {
             rv = apr_get_netos_error();
+
+            if (rv == EAGAIN) rv = ENOEXEC;
+
             APR_RING_INSERT_TAIL(&(pollset->free_ring), ep, pfd_elem_t, link);
             break;
         }
@@ -367,6 +370,8 @@
     rv = call_port_getn(pollset->port_fd, pollset->port_set, pollset->nalloc,
                         &nget, timeout);
 
+    if (rv == EAGAIN) rv = ERANGE;
+
     /* decrease the waiting ASAP to reduce the window for calling 
        port_associate within apr_pollset_add() */
     apr_atomic_dec32(&pollset->waiting);

So with this patch, what is the error code/error string logged with "poll failed
waiting for CGI child"?

Comment 11 Jan Schlien 2009-10-23 03:23:25 UTC

Nice move, but bad luck. Retrun value remains EAGAIN (11) according to the log file. Still, I wonder why everything works fine under truss.

I'll try to grab more information with dtrace. Or I might introduce some quick and diry logging in port.c to see clearer. Whichever comes first...

Comment 12 Jeff Trawick 2009-10-23 14:43:33 UTC

See recent updates added to 48030.

I'm thinking (bad idea?) that the root cause of Bernd's problem (in 48030) is the same as this one, and Nick's problem (original report in 48029) is unique.

Comment 13 Jeff Trawick 2009-10-24 04:49:25 UTC

Summary:
port_getn() returns bogus return codes with 32-bit applications on Solaris 10/x86.  APR 1.3.8 accidently managed to work in most cases; APR 1.3.9 treats these bugs returns from port_getn() as errors since the value is negative.  The latest Solaris 10 update ("Solaris 10 10/09" a.k.a. "Solaris 10 U8") has the problem, as do some (all?) earlier releases.  Current OpenSolaris levels do not have the problem.  It does not apparently affect 64-bit applications on x86 or any applications on SPARC.  One interesting aspect of the port_getn() problem is that the retcode is good if you control the process calling port_getn() (at least with dbx or truss).

Anticipated fix:
Change APR configure to run a small test program and disable the use of Event Ports if port_getn() returns a bogus value.

Workaround:
Set the environment variable ac_cv_func_port_create=no when building APR in an affected configuration.

$ make distclean
$ ac_cv_func_port_create=no
$ export ac_cv_func_port_create
$ ./configure --regular-options

Here is a simple test program showing the issue:

$ cat useport.c 
#include <stdio.h>

#include <port.h>

int main(void)
{
    int port, rc;
    uint_t nget;

    port = port_create();

    rc = port_getn(port, NULL, 0, &nget, NULL);
    printf("rc %d nget %u\n", rc, nget);

    return 0;
}
$ cc useport.c && ./a.out 
rc -17349963 nget 0
$ cc -m64 useport.c && ./a.out 
rc 0 nget 0

Comment 14 Jan Schlien 2009-10-26 09:44:37 UTC

First off: Compiling with ac_cv_func_port_create=no makes everything run fine, it seems. That is a strong hint that the problem is located somewhere in the near the port interface, either implementation or usage.

However, I cannot reproduce your useport example on any of the machines I use to trigger the buggy behavior here. Can you prove it returns correct values when trussed? If no, that should be a completely different issue - if yes, it might be related, though.

Is the port thing restricted to solaris? Is it worth investigating more? Is it an option to globally disable ports for everyone?

Comment 15 Jeff Trawick 2009-10-26 10:39:51 UTC

>First off: Compiling with ac_cv_func_port_create=no makes everything run fine,
>it seems. That is a strong hint that the problem is located somewhere in the
>near the port interface, either implementation or usage.

>Is the port thing restricted to solaris? Is it worth investigating more? Is it
>an option to globally disable ports for everyone?

A little background:

apr_pollset_poll() is what APR provides to apps like httpd, and that works cross-platform.  The "Event Port" stuff applies only to Solaris.  If Event Ports were disabled, it wouldn't affect other platforms and it wouldn't affect all applications on Solaris since plain poll() would be used.

Manwhile, I found a better solution besides disabling Event Ports which I've committed.  This is it:

http://svn.apache.org/viewvc/apr/apr/branches/1.3.x/poll/unix/port.c?r1=807269&r2=829803

>However, I cannot reproduce your useport example on any of the machines I use
>to trigger the buggy behavior here. Can you prove it returns correct values
>when trussed? If no, that should be a completely different issue - if yes, it
>might be related, though.

You get "rc 0 nget 0" displayed on a machine with the problem?  Ouch.

Yes, truss or dbx makes that simple testcase run clean for me on Solaris 10/x86-32.

$ ./useport-32
rc -17349963 nget 0
$ truss ./useport-32 2>&1 | tail -5
brk(0x08062908)                                 = 0
fstat64(1, 0x08046FA0)                          = 0
rc 0 nget 0
write(1, " r c   0   n g e t   0\n", 12)        = 12
_exit(0)
$

similarities with your observations

a. got worse with APR 1.3.9

The huge negative retcode from port_getn() wouldn't cause a problem with APR 1.3.8, since 1.3.8 checked specifically for "rc == -1" instead of "rc < 0".

(The "got worse" idea assumes you had a different problem with 1.3.8, possibly the one that 1.3.9 corrected, which was very intermittent.)

b. problem doesn't occur under observation via truss

c. unexpected EAGAIN failure

apr_pollset_poll() would grab whatever was in errno when it thought port_getn() failed on these cases where port_getn() didn't really fail and didn't set errno.  Since we're doing I/O with the CGI and recently did I/O with the client , EAGAIN is a likely errno value to pick up incorrectly.

d. a case where port_getn() returns bogus negative number matches cgi_bucket_read()'s usage

a case with the bad retcode is when port_getn() is called with 0 timeout to find out immediately if an event is ready (if an event is actually ready it won't return a bad retcode)

that's the kind of call made when cgi_bucket_read() is called the first time by content-length filter; only when cgi_bucket_read() returns EAGAIN does content-length filter tell it to wait until data is available

--/--

I hope you're able to try the tiny patch pointed to above in in this comment in place of earlier attempts.

Comment 16 Jan Schlien 2009-10-27 11:52:14 UTC

> A little background:

Thanks for that part!

> Manwhile, I found a better solution besides disabling Event Ports which I've
> committed.  This is it:
> 
> http://svn.apache.org/viewvc/apr/apr/branches/1.3.x/poll/unix/port.c?r1=807269&r2=829803

Works! Seems we're on track now. However...

> >However, I cannot reproduce your useport example on any of the machines I use
> >to trigger the buggy behavior here. Can you prove it returns correct values
> >when trussed? If no, that should be a completely different issue - if yes, it
> >might be related, though.
> 
> You get "rc 0 nget 0" displayed on a machine with the problem?  Ouch.

... yes. That is still the case. But as the fix you committed works, there must be some strange condition sometimes making port_getn return those negative values aside from -1.

Had no time to play around with dtrace, yet. But it seems quite obvious it would show port_getn returning a strange value as does your example (though not reproducible here).

Comment 17 Jeff Trawick 2009-11-13 11:11:24 UTC

fix has been backported to APR 1.3.x branch for future 1.3.10