Apache 2.0.59 on Solaris 10 Update 4 (Generic_120011-14) Corrupt HTTP/1.1 chunking output seen. Corresponded exactly to the duplication of the contents of a file descriptor (an SSI file inclusion) passed as an argument to sendfilev() (without a preceding chunk header the browsers barfed) The problem was *only* seen when the machine was under I/O stress and sendfilev() returned EINTR according to truss. There were never problems without EINTR. Disabling sendfile support in a .htaccess made the problem disappear. The sun manual page : http://docs.sun.com/app/docs/doc/816-5172/sendfilev-3ext?a=view says EINTR will not have sent data. However, if you compare the Apache sendfilev() handling to the Samba sendfilev() handling you see different handling of EINTR. Apache: srclib/apr/network_io/unix/sendrecv.c do { /* Clear out the repeat */ repeat = 0; /* socket, vecs, number of vecs, bytes written */ rv = sendfilev(sock->socketdes, sfv, vecs, &nbytes); if (rv == -1 && errno == EAGAIN) { if (nbytes) { rv = 0; } else if (!arv && apr_is_option_set(sock->netmask, APR_SO_TIMEOUT) == 1) { apr_status_t t = apr_wait_for_io_or_timeout(NULL, sock, 0); if (t != APR_SUCCESS) { *len = 0; return t; } arv = 1; repeat = 1; } } } while ((rv == -1 && errno == EINTR) || repeat); Samba: /* * Although not listed in the API error returns, this is almost certainly * a slow system call and will be interrupted by a signal with EINTR. JRA. */ xferred = 0; #if defined(HAVE_EXPLICIT_LARGEFILE_SUPPORT) && defined(HAVE_OFF64_T) && defined(HAVE_SENDFILEV64) nwritten = sendfilev64(tofd, vec, sfvcnt, &xferred); #else nwritten = sendfilev(tofd, vec, sfvcnt, &xferred); #endif if (nwritten == -1 && errno == EINTR) { if (xferred == 0) continue; /* Nothing written yet. */ else nwritten = xferred; } I have also raised a bug with Sun over the documentation of their error behaviour
Created attachment 21919 [details] proposed patch for apr 0.9 (Apache 2.0.x)
Created attachment 21920 [details] proposed patch for apr trunk
testing requested; I doubt I'll be able to find the time
I'll see what I can do.
According to http://sunsolve.sun.com/search/document.do?assetkey=1-1-6408517-1, this bug is fixed in Solaris Express, not sure if it was integrated into Solaris 10 5/08. Bug ID: 6408517 Synopsis: sendfile should only return EINTR if no bytes have been transferred. Product: solaris Category: library Subcategory: libsendfile Release: solaris_nevada Fixed by patch: Introduced in Build: s81_37 Commit to fix in Build: snv_42 Fixed in Build: snv_42 Integrated in Build: snv_42
I am not sure about reset rv to 0 when bytes transferred is non 0. There are other errno could have set the byte transferred to non-zero value.
The patch made the machines go splattydeath apparently. Loops and memory leaks they said, but I wasn't around when it got unexpectedly deployed/reverted so I haven't got any hard data on what happened. I will try and set up a pet to persecute to find out how/why it failed, but in the meantime, it's failed.
Created attachment 22214 [details] truss -vall -u '*:*' output of apache going loopy using previous patch Got a trace of an above-patched apache going batbleep loopy. It was attempting to send a flatfile jpg. Caught with truss -vall -u '*:*'
I hope I've got some interesting additional information: We started analyzing a similar problem: Acrobat Reader retrieving a PDF document via Range requests and getting E/A errors. Server was Apache httpd 2.0.63 on Solaris 10. No Problem on some Solaris 8 and 9 systems. Truss revealed the ocurrence of sendfilev() calls returning with EINVAL. First I thought the problem are the params to sendfilev, because vecs was 0 and by going up the stack I saw, that there were no headers and trailers and also file size 0. Strange. Then I noticed, that the actual problem is in the sendfilev call shortly before the one, that returns with EINVAL. The sendfilev directly before returns with a number of written bytes bigger, than all headers plus trailers plus file. By some code instrumentation I saw: httpd hands sendfilev over 20 buffers. If I call sendfile in the same way as httpd would, and use only the first 1, 2, ... 16 buffers it returns the correct number of bytes. If I call sendfilev with the first 17 buffers, it returns twice the number of bytes for the first 16 buffers plus the size of the 17th. So when crossing from 16 to 17 buffers, it seems to write the first 16 buffers double. That's when I found this issue here, which is also about Solaris, sendfile, EINVAl and dupliocate data written. I'll try to search sunsolve, check against older 2.0.x versions and also to make an easy test case against apr 0.9 tomorrow. It does not seem to happen with httpd 2.2, although it could be, that the problem depends on subtle specifics of the requests (byte sizes etc.). I didn't try the patch Jeff attached here, because I think the EINVAL is not the root cause. That only happens, because the number of bytes written is too big and breaks later calculations in server/core.c., such that sendfilev later4 gets called again with zero buffers.
If I'm not totally wrong in what I'm doing, I am now down to the following finding: In my test case I throw away all sendfilev buffers directly before the call to sendfilew coming from httpd and create 17 new buffers instead, each of size 1 and containing the single character 'X'. The web server indeed returns the 17 characters "X" (naturally not in an HTTP compliant way), but sendfilev says it has written 33 bytes, i.e. again double the size of the first 16 buffers plus the size of buffer 17. If I go down to 16 buffers, this does not happen. If I'm using sendfilev directly and write to a file on the file system it does not happen. If I'm using sendfilev directly and write to a network socket directly created with socket/bind/listen/address it does happen. So it seems this is reproducible without apr/httpd. Though I didn't find any obviously relevant patch in Sunsolve, I'll check again.
Our problem was finally fixed by applying the most recent Solaris 10 Kernel Jumbo Patch 137111-04. It is only a few weeks old and contains a critical security fix for Bug "6675943: dblk_t structures may be overwitten by sendfilev() mishandling certain input vectors". Apart from fixing a sendfilev issue, the part of the bug description publicly available doesn't sound as related to this issue here, but applying the patch to the system fixed it. D White: can you confirm, that this patch for Solaris 10 also fixes your problem?
Won't have an opportunity to test for a while (no more deployments/Apache2.2 on the horizon). Will squeak if I do.