I have compiled 2.0.53 under Solaris 10 using the following configuration: CFLAGS="-m64 -mcpu=v9" LDFLAGS="-m64 -mcpu=v9" ./configure --enable-layout=Lifesci --disable-ipv6 --enable-so --enable-mods-shared=most --with-mpm=worker --with-ssl=/opt/apps64/openssl --enable-nonportable-atomics This compiled a 64-bit binary without displaying errors. However, after startup, the CGI gateway seems broken. I get 500 errors about 80% of the time when I try to run anything via CGI. The error log contains the following for each failed request: [Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] (32)Broken pipe: write to cgi daemon process [Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] Premature end of script headers: printenv [Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] File does not exist: /home/www-ssl/html/favicon.ico [Thu Mar 31 17:36:16 2005] [notice] child pid 10380 exit signal Segmentation fault (11) [Thu Mar 31 17:36:16 2005] [error] cgid daemon process died, restarting Debugging provided the following observations: 1) If I unload mod_ssl, then the CGI gateway works properly. 2) Compiling without --enable-nonportable-atomics does not solve the problem 3) Compiling with mod_ssl statically built (not a DSO) does not solve the problem. 4) The following configuration does work (prefork MPM): > CFLAGS="-m64 -mcpu=v9" LDFLAGS="-m64 -mcpu=v9" ./configure > --enable-layout=Lifesci --disable-ipv6 --enable-so --enable-ssl > --enable-rewrite --enable-info --enable-headers > --enable-mods-shared="auth_dbm dav dav_fs vhost_alias imap > speling expires usertrack unique-id asis vhost-alias include" > --with-ssl=/opt/apps64/openssl Here is some addition information regarding the build environment: > # gcc -v > Reading specs from > /opt/apps/gcc-3.4.3/lib/gcc/sparc-sun-solaris2.10/3.4.3/specsConfigured with: > ../gcc-3.4.3/configure --prefix=/opt/apps/gcc-3.4.3 --mandir=/opt/man > --infodir=/opt/info --with-as=/usr/sfw/bin/gas --with-gnu-as > --with-ld=/usr/ccs/bin/ld --disable-nls --disable-static --with-cpu=sparcv9 > --disable-libgcjThread model: posix > gcc version 3.4.3 > # uname -a > SunOS hw-develop.lscf.ucsb.edu. 5.10 Generic sun4u sparc SUNW,Ultra-60 > # /opt/apps64/openssl/bin/openssl version > OpenSSL 0.9.7f 22 Mar 2005
FWIW, I am able to hit the mod_cgid mess on the last Solaris 10 beta as well as Solaris 10 GA. Prior beta builds of Solaris 10 did not have the problem. I haven't tested mod_ssl any. This is 32-bit Solaris 10/x86 on Pentium 4. The failure is intermittent. Under CGI load, I have caused my box to reboot before, as mod_cgid daemon gets bogus data over the socket about how many bytes to allocate and tries to malloc too much. I can't recreate the failure with detailed tracing of one side of the socket (mod_cgid daemon or mod_cgid handler). The root cause of the problem is some corruption of information about CGI environment variables which is sent from the cgid handler over to the cgid daemon prior to creation of the CGI process. I have tweaked the mod_cgid code on both sides of the Unix socket to try to see what is required to trigger the problem, but without success. Meanwhile, we don't have reports of this type of problem occur on any other platform and it didn't occur on most of the publically-released beta versions of Solaris 10. That doesn't mean it can't possibly be an Apache bug, but it seems fairly likely that it is NOT an Apache bug. I've tried multiple levels of gcc (the two levels available from blastwave) with no change in behavior seen. So: I'm suggesting that an Apache change to fix this may not be forthcoming, and that you may see cgid problems without mod_ssl. I'd recommend that you stick with your prefork configuration (which uses mod_cgi and does not use a Unix socket) for now.
Is this actually mod_ssl-specific, it is only triggered by SSL requests?
Oh, Jeff answered that. Changing component to mod_cgid for the moment, given that data.
The problem also exists in other platforms. I'm using 2.0.54 on Win32 (XP, development machine). I have a similar problem with *some* client certificates during client authentication in SSL. Using OpenSSL/0.9.7f with 'SSLOptions +StdEnvVars +ExportCertData' [Thu May 05 13:30:20 2005] [error] [client 127.0.0.1] (22)Invalid argument: couldn't create child process: 22: printenv.pl [Thu May 05 13:30:20 2005] [error] [client 127.0.0.1] (22)Invalid argument: couldn't spawn child process: C:/apache/httpd/cgi-bin/printenv.pl The certificate data is (from the firefox client cert window): Issued to: OID.2.5.4.41=#1311456D707265736120646520507275656261,CN=Empresa de Prueba,OU=Sucursal de Prueba,O=Empresa de Prueba,serialNumber=" / AAAA010101HDFRXX00",OID.2.5.4.45=#131C414141303130313031414141202F2041414141303130313031414141 Serial Number: 30:30:30:30:31:30:30:30:30:30:30:30:30:30:30:30:30:31:31:34 Valid from 8/2/2004 14:47:13 PM to 8/2/2006 14:47:13 PM Purposes: Client,Server,Sign,Encrypt Issued by: O=Servicio de Administración Tributaria,OU=Administración de Seguridad de la Información,CN=AC de Pruebas SAT,C=MX,ST="Mexico, D.F.",L=Ciudad de Mexico It seems that it can't parse this DN, probably due to the unknown OID. Then the cgi invocation fails too because of corrupted data??? Problem is: I depend on 'SSLOptions +StdEnvVars' working because all the info is sent via mod_jk 1.2.8 to Tomcat 5.5, and is used to AAA the users on a j2ee webapp. It works ok with just "+ExportCertData", but other variables are required by settings such as: JkExtractSSL on JkHTTPSIndicator HTTPS JkCERTSIndicator SSL_CLIENT_CERT #JkCIPHERIndicator n/a ? JkSESSIONIndicator SSL_SESSION_ID JkKEYSIZEIndicator SSL_CIPHER_ALGKEYSIZE JkEnvVar SSL_CLIENT_M_SERIAL SSL_CLIENT_M_SERIAL
Gerardo, that sounds like bug 13029, the bug covered here is mod_cgid specific.
Tnx Joe, will seek for resolution on that thread.
A little more data on mod_cgid vs. Solaris 10: Below is a trace of the mod_cgid daemon process. It is supposed to be reading length (binary) followed by value of an environment variable. The first two read() calls show this clearly. Look at the next two read calls though. It reads the length properly on the first call. However, on the second call the environment variable is prefixed with the same binary length, as if the sender wrote the length twice. It goes downhill from there, as the mod_cgid daemon is now out of sync with the mod_cgid handler. Next time it thinks it is reading the length, it reads the last four bytes of the environment variable, and tries to allocate a huge amount of memory. If you start the web server and run cgid daemon and other child processes under truss (-u not needed), it works fine. If you truss just the daemon, you can still see the bogosity. /1: read(10, "\0\0\011", 4) = 4 /1@1: <- libc:_read() = 4 /1@1: <- libc:read() = 4 /1@1: <- mod_cgid:sock_read() = 0 /1@1: -> libapr-0:apr_palloc(0x14e410, 0x12, 0x4, 0xfffffff8) /1@1: <- libapr-0:apr_palloc() = 0x14e9b8 /1@1: -> libc_psr:memset(0x14e9b8, 0x0, 0x12, 0xfffffff8) /1@1: <- libc_psr:memset() = 0x14e9b8 /1@1: -> mod_cgid:sock_read(0xa, 0x14e9b8, 0x11, 0xfffffff8) /1@1: -> libc:read(0xa, 0x14e9b8, 0x11, 0x0) /1@1: -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0) /1@1: <- libc:_save_nv_regs() = 0xff092238 /1@1: -> libc:_read(0xa, 0x14e9b8, 0x11, 0x0) /1: read(10, " S E R V E R _ N A M E =".., 17) = 17 /1@1: <- libc:_read() = 17 /1@1: <- libc:read() = 17 /1@1: <- mod_cgid:sock_read() = 0 /1@1: -> mod_cgid:sock_read(0xa, 0xffbff870, 0x4, 0xfffffff8) /1@1: -> libc:read(0xa, 0xffbff870, 0x4, 0x0) /1@1: -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0) /1@1: <- libc:_save_nv_regs() = 0xff092238 /1@1: -> libc:_read(0xa, 0xffbff870, 0x4, 0x0) /1: read(10, "\0\0\017", 4) = 4 /1@1: <- libc:_read() = 4 /1@1: <- libc:read() = 4 /1@1: <- mod_cgid:sock_read() = 0 /1@1: -> libapr-0:apr_palloc(0x14e410, 0x18, 0x4, 0xfffffff8) /1@1: <- libapr-0:apr_palloc() = 0x14e9d0 /1@1: -> libc_psr:memset(0x14e9d0, 0x0, 0x18, 0xfffffff8) /1@1: <- libc_psr:memset() = 0x14e9d0 /1@1: -> mod_cgid:sock_read(0xa, 0x14e9d0, 0x17, 0xfffffff8) /1@1: -> libc:read(0xa, 0x14e9d0, 0x17, 0x0) /1@1: -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0) /1@1: <- libc:_save_nv_regs() = 0xff092238 /1@1: -> libc:_read(0xa, 0x14e9d0, 0x17, 0x0) /1: read(10, "\0\0\017 S E R V E R _ A".., 23) = 23 /1@1: <- libc:_read() = 23 /1@1: <- libc:read() = 23 /1@1: <- mod_cgid:sock_read() = 0 /1@1: -> mod_cgid:sock_read(0xa, 0xffbff870, 0x4, 0xfffffff8) /1@1: -> libc:read(0xa, 0xffbff870, 0x4, 0x0) /1@1: -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0) /1@1: <- libc:_save_nv_regs() = 0xff092238 /1@1: -> libc:_read(0xa, 0xffbff870, 0x4, 0x0) /1: read(10, " . 2 3 2", 4) = 4 /1@1: <- libc:_read() = 4 /1@1: <- libc:read() = 4 /1@1: <- mod_cgid:sock_read() = 0 /1@1: -> libapr-0:apr_palloc(0x14e410, 0x2e323333, 0x4, 0xfffffff8) The sender does have the capability of repeating the call to write(), but only if it gets -1/EINTR: static apr_status_t sock_write(int fd, const void *buf, size_t buf_size) { int rc; do { rc = write(fd, buf, buf_size); } while (rc < 0 && errno == EINTR); if (rc < 0) { return errno; } return APR_SUCCESS; } But adding a write to error log for the repeat case doesn't show a repeated call to write at the time that the server receives the binary envvar length twice.
As a datapoint, I can reproduce massive problems with mod_cgid on httpd.xones.apache.org box. I ended up switching to prefork/mod_cgi, because I needed CGI to work, but mod_cgid is seriously broken on Solaris 10. If any @apache.org people need a Solaris 10 machine for testing, we can setup accounts on httpd.zones.apache.org
FWIW, I can confirm that it has nothing to do with gcc. I'm using Sun Forte 9 and I still see the same problem. I'm going to update the summary accordingly. Plus, I'm going to spend some time today dorking with this one.
Here's a 'dtrace' script: http://people.apache.org/~jerenkrantz/httpd.d ./httpd.d 14581 15274 (First pid is worker process; second is the cgid daemon.) Here's 'bad' output from an SMP x86 Solaris 10 box: 0 40471 get_req:entry (15274) Entering get_req! 0 10 read:entry read (15274) fd: 38 - 64 bytes 0 11 read:return read (15274) fd: 38 - 64 bytes 0 11 read:return 0 10 read:entry read (15274) fd: 38 - 42 bytes 0 11 read:return read (15274) fd: 38 - 42 bytes 0 11 read:return /home/jerenk/public_html/weblog/weblog.cgi 0 10 read:entry read (15274) fd: 38 - 10 bytes 0 11 read:return read (15274) fd: 38 - 10 bytes 0 11 read:return weblog.cgii/software/ 0 10 read:entry read (15274) fd: 38 - 20 bytes 0 11 read:return read (15274) fd: 38 - 20 bytes 0 11 read:return /weblog.cgi/software 0 10 read:entry read (15274) fd: 38 - 4 bytes 0 11 read:return read (15274) fd: 38 - 4 bytes 0 11 read:return 1430084180 0 11 read:return TZ=U 1 12 write:entry write (14581) fd: 38 - 64 bytes 1 12 write:entry 1 13 write:return wrote (14581): 64 1 12 write:entry write (14581) fd: 38 - 42 bytes 1 12 write:entry /home/jerenk/public_html/weblog/weblog.cgi 1 13 write:return wrote (14581): 42 1 12 write:entry write (14581) fd: 38 - 10 bytes 1 12 write:entry weblog.cgi 1 13 write:return wrote (14581): 10 1 12 write:entry write (14581) fd: 38 - 20 bytes 1 12 write:entry /weblog.cgi/software 1 13 write:return wrote (14581): 20 1 12 write:entry write (14581) fd: 38 - 4 bytes 1 12 write:entry 1 13 write:return wrote (14581): 4 1 12 write:entry write (14581) fd: 38 - 13 bytes 1 12 write:entry TZ=US/Pacific 1 13 write:return wrote (14581): 13 1 12 write:entry write (14581) fd: 38 - 4 bytes 1 12 write:entry $ 1 13 write:return wrote (14581): 4 1 12 write:entry write (14581) fd: 38 - 36 bytes 1 12 write:entry HTTP_HOST=weblog.erenkrantz.com:8080 1 13 write:return wrote (14581): 36 1 12 write:entry write (14581) fd: 38 - 4 bytes 1 12 write:entry l 1 13 write:return wrote (14581): 4 1 12 write:entry write (14581) fd: 38 - 108 bytes 1 12 write:entry HTTP_USER_AGENT=Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b4) Gecko/20050819 Firefox/1.0+ 1 13 write:return wrote (14581): 108 ...snip...pid 15274 is done allocating 1430084180 bytes.... 1 10 read:entry read (15274) fd: 38 - 1430084180 bytes 1 11 read:return read (15274): Short read: 1430084180, 165 1 11 read:return S/Pacific 1 10 read:entry read (15274) fd: 38 - 1430084015 bytes 1 11 read:return read (15274): Short read: 1430084015, 1849 1 11 read:return Data too large 1 10 read:entry read (15274) fd: 38 - 1430082166 bytes 1 11 read:return read (15274): Short read: 1430082166, 0 1 11 read:return 1 40472 get_req:return (15274) Leaving get_req!
Responses on the OpenSolaris development lists (opensolaris-code@opensolaris.org) to my query: http://www.opensolaris.org/jive/thread.jspa?threadID=1957&tstart=0 indicate the following: - mod_cgid uses AF_UNIX sockets - AF_UNIX was 'massively overhauled for Solaris 10, right near GA' - There are a number of AF_UNIX bugs that have fixes in the Sol10 patch pipeline [Current versions of Solaris Express have the fixes already.] - One temporary workaround that was offered was to comment out the shutdown() call around line mod_cgid.c:1431. I'll follow up with testing the shutdown() fix as well as trying out sol-nv-b19 next week. But, this definitely isn't an httpd bug. However, there is a possibility we can work around it though.
In my limited testing, commenting out the shutdown call doesn't seem to fix the problems for me. I'll try the Solaris Express stuff next week.
Based upon my examination, using snv_19 (current Solaris Express release) doesn't have this problem. (I subjected worker MPM'd httpd to flood'd CGI requests without problems.) I have committed a fix to trunk in r264866 to block usage of mod_cgid on Solaris 10. After the AF_UNIX patch from Sun is released (120664-01), we can detect for that patch and only allow mod_cgid usage in that scenario.
120664-01 has been released (9/13). See http://sunsolve.sun.com/pub-cgi/show.pl?target=patches/xos-10&nav=pub-patches Has anybody tried it yet? (In a fit of disgust at several issues I overlaid my Sol10 installation with FC4.)
Yes, I believe 120664/120665 fixes it. If not, please reopen this bug. Fixes were committed in r326018 to detect a 'good' patch at configure time.
Footnote; 118833-36 cluster claims to superceed 120664-01 However, I've heard at least one rumor that it doesn't solve this bug. So someone with time on their hands aught to validate this so we can add this newer cluster to the list of acceptable patches in ./configure
This looks like PR#39332. That bug has a proposed patch attached, which I've just reviewed and am applying to trunk. Testing welcome! *** This bug has been marked as a duplicate of bug 39332 ***
Undo spam change *** This bug has been marked as a duplicate of bug 39332 ***