34264 – Broken mod_cgid under Solaris 10 with Worker MPM

Bug 34264 - Broken mod_cgid under Solaris 10 with Worker MPM

Summary: Broken mod_cgid under Solaris 10 with Worker MPM

Status:	RESOLVED DUPLICATE of bug 39332

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_cgid (show other bugs)
Version:	2.0.53
Hardware:	Sun Solaris

Importance:	P2 blocker (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-04-01 19:13 UTC by Shea Lovan
Modified:	2018-02-25 19:49 UTC (History)
CC List:	2 users (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Shea Lovan 2005-04-01 19:13:45 UTC

I have compiled 2.0.53 under Solaris 10 using the following configuration:

CFLAGS="-m64 -mcpu=v9" LDFLAGS="-m64 -mcpu=v9" ./configure
--enable-layout=Lifesci --disable-ipv6 --enable-so --enable-mods-shared=most
--with-mpm=worker --with-ssl=/opt/apps64/openssl --enable-nonportable-atomics

This compiled a 64-bit binary without displaying errors.  However, after
startup, the CGI gateway seems broken.  I get 500 errors about 80% of the time
when I try to run anything via CGI.  The error log contains the following for
each failed request:

[Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] (32)Broken pipe:
write to cgi daemon process
[Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] Premature end of
script headers: printenv
[Thu Mar 31 17:36:16 2005] [error] [client 128.111.226.63] File does not exist:
/home/www-ssl/html/favicon.ico
[Thu Mar 31 17:36:16 2005] [notice] child pid 10380 exit signal Segmentation
fault (11)
[Thu Mar 31 17:36:16 2005] [error] cgid daemon process died, restarting

Debugging provided the following observations:
  1)  If I unload mod_ssl, then the CGI gateway works properly.
  2)  Compiling without --enable-nonportable-atomics does not solve the problem
  3)  Compiling with mod_ssl statically built (not a DSO) does not solve the
problem.
  4)  The following configuration does work (prefork MPM):
      > CFLAGS="-m64 -mcpu=v9" LDFLAGS="-m64 -mcpu=v9" ./configure
      > --enable-layout=Lifesci --disable-ipv6 --enable-so --enable-ssl
      > --enable-rewrite --enable-info --enable-headers
      > --enable-mods-shared="auth_dbm dav dav_fs vhost_alias imap 
      > speling expires usertrack unique-id asis vhost-alias include"
      > --with-ssl=/opt/apps64/openssl

Here is some addition information regarding the build environment:
> # gcc -v
> Reading specs from
> /opt/apps/gcc-3.4.3/lib/gcc/sparc-sun-solaris2.10/3.4.3/specsConfigured with: 
> ../gcc-3.4.3/configure --prefix=/opt/apps/gcc-3.4.3 --mandir=/opt/man 
> --infodir=/opt/info --with-as=/usr/sfw/bin/gas --with-gnu-as 
> --with-ld=/usr/ccs/bin/ld --disable-nls --disable-static --with-cpu=sparcv9 
> --disable-libgcjThread model: posix
> gcc version 3.4.3

> # uname -a
> SunOS hw-develop.lscf.ucsb.edu. 5.10 Generic sun4u sparc SUNW,Ultra-60

> # /opt/apps64/openssl/bin/openssl version
> OpenSSL 0.9.7f 22 Mar 2005

Comment 1 Jeff Trawick 2005-04-01 21:41:53 UTC

FWIW, I am able to hit the mod_cgid mess on the last Solaris 10 beta as well as
Solaris 10 GA.  Prior beta builds of Solaris 10 did not have the problem.  I
haven't tested mod_ssl any.  This is 32-bit Solaris 10/x86 on Pentium 4.

The failure is intermittent.  Under CGI load, I have caused my box to reboot
before, as mod_cgid daemon gets bogus data over the socket about how many bytes
to allocate and tries to malloc too much.  I can't recreate the failure with
detailed tracing of one side of the socket (mod_cgid daemon or mod_cgid handler).

The root cause of the problem is some corruption of information about CGI
environment variables which is sent from the cgid handler over to the cgid
daemon prior to creation of the CGI process.  I have tweaked the mod_cgid code
on both sides of the Unix socket to try to see what is required to trigger the
problem, but without success.

Meanwhile, we don't have reports of this type of problem occur on any other
platform and it didn't occur on most of the publically-released beta versions of
Solaris 10.  That doesn't mean it can't possibly be an Apache bug, but it seems
fairly likely that it is NOT an Apache bug.  I've tried multiple levels of gcc
(the two levels available from blastwave) with no change in behavior seen.

So: 
I'm suggesting that an Apache change to fix this may not be forthcoming, and
that you may see cgid problems without mod_ssl.  I'd recommend that you stick
with your prefork configuration (which uses mod_cgi and does not use a Unix
socket) for now.

Comment 2 Joe Orton 2005-04-04 10:12:41 UTC

Is this actually mod_ssl-specific, it is only triggered by SSL requests?

Comment 3 Joe Orton 2005-04-04 17:18:04 UTC

Oh, Jeff answered that.  Changing component to mod_cgid for the moment, given
that data.

Comment 4 Gerardo Nevarez 2005-05-05 21:08:59 UTC

The problem also exists in other platforms. I'm using 2.0.54 on Win32 (XP,
development machine). 

I have a similar problem with *some* client certificates during client
authentication in SSL. Using OpenSSL/0.9.7f  with 'SSLOptions +StdEnvVars
+ExportCertData'

[Thu May 05 13:30:20 2005] [error] [client 127.0.0.1] (22)Invalid argument:
couldn't create child process: 22: printenv.pl
[Thu May 05 13:30:20 2005] [error] [client 127.0.0.1] (22)Invalid argument:
couldn't spawn child process: C:/apache/httpd/cgi-bin/printenv.pl

The certificate data is (from the firefox client cert window):

Issued to: OID.2.5.4.41=#1311456D707265736120646520507275656261,CN=Empresa de
Prueba,OU=Sucursal de Prueba,O=Empresa de Prueba,serialNumber=" /
AAAA010101HDFRXX00",OID.2.5.4.45=#131C414141303130313031414141202F2041414141303130313031414141
  Serial Number: 30:30:30:30:31:30:30:30:30:30:30:30:30:30:30:30:30:31:31:34
  Valid from 8/2/2004 14:47:13 PM to 8/2/2006 14:47:13 PM
  Purposes: Client,Server,Sign,Encrypt
Issued by: O=Servicio de Administración Tributaria,OU=Administración de
Seguridad de la Información,CN=AC de Pruebas SAT,C=MX,ST="Mexico, D.F.",L=Ciudad
de Mexico

It seems that it can't parse this DN, probably due to the unknown OID. Then the
cgi invocation fails too because of corrupted data???

Problem is: I depend on 'SSLOptions +StdEnvVars' working because all the info is
sent via mod_jk  1.2.8 to Tomcat 5.5, and is used to AAA the users on a j2ee
webapp. It works ok with just "+ExportCertData", but other variables are
required by settings such as:

JkExtractSSL on
JkHTTPSIndicator HTTPS
JkCERTSIndicator SSL_CLIENT_CERT
#JkCIPHERIndicator n/a ?
JkSESSIONIndicator SSL_SESSION_ID
JkKEYSIZEIndicator SSL_CIPHER_ALGKEYSIZE
JkEnvVar SSL_CLIENT_M_SERIAL SSL_CLIENT_M_SERIAL

Comment 5 Joe Orton 2005-05-05 22:25:54 UTC

Gerardo, that sounds like bug 13029, the bug covered here is mod_cgid specific.

Comment 6 Gerardo Nevarez 2005-05-09 17:36:03 UTC

Tnx Joe, will seek for resolution on that thread.

Comment 7 Jeff Trawick 2005-07-23 13:07:32 UTC

A little more data on mod_cgid vs. Solaris 10:

Below is a trace of the mod_cgid daemon process.  It is supposed to be reading
length (binary) followed by value of an environment variable.  The first two
read() calls show this clearly.  Look at the next two read calls though.  It
reads the length properly on the first call.  However, on the second call the
environment variable is prefixed with the same binary length, as if the sender
wrote the length twice.  It goes downhill from there, as the mod_cgid daemon is
now out of sync with the mod_cgid handler.  Next time it thinks it is reading
the length, it reads the last four bytes of the environment variable, and tries
to allocate a huge amount of memory.

If you start the web server and run cgid daemon and other child processes under
truss (-u not needed), it works fine.  If you truss just the daemon, you can
still see the bogosity.

/1:     read(10, "\0\0\011", 4)                         = 4
/1@1:                   <- libc:_read() = 4
/1@1:                 <- libc:read() = 4
/1@1:               <- mod_cgid:sock_read() = 0
/1@1:               -> libapr-0:apr_palloc(0x14e410, 0x12, 0x4, 0xfffffff8)
/1@1:               <- libapr-0:apr_palloc() = 0x14e9b8
/1@1:               -> libc_psr:memset(0x14e9b8, 0x0, 0x12, 0xfffffff8)
/1@1:               <- libc_psr:memset() = 0x14e9b8
/1@1:               -> mod_cgid:sock_read(0xa, 0x14e9b8, 0x11, 0xfffffff8)
/1@1:                 -> libc:read(0xa, 0x14e9b8, 0x11, 0x0)
/1@1:                   -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0)
/1@1:                   <- libc:_save_nv_regs() = 0xff092238
/1@1:                   -> libc:_read(0xa, 0x14e9b8, 0x11, 0x0)
/1:     read(10, " S E R V E R _ N A M E =".., 17)      = 17
/1@1:                   <- libc:_read() = 17
/1@1:                 <- libc:read() = 17
/1@1:               <- mod_cgid:sock_read() = 0
/1@1:               -> mod_cgid:sock_read(0xa, 0xffbff870, 0x4, 0xfffffff8)
/1@1:                 -> libc:read(0xa, 0xffbff870, 0x4, 0x0)
/1@1:                   -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0)
/1@1:                   <- libc:_save_nv_regs() = 0xff092238
/1@1:                   -> libc:_read(0xa, 0xffbff870, 0x4, 0x0)
/1:     read(10, "\0\0\017", 4)                         = 4
/1@1:                   <- libc:_read() = 4
/1@1:                 <- libc:read() = 4
/1@1:               <- mod_cgid:sock_read() = 0
/1@1:               -> libapr-0:apr_palloc(0x14e410, 0x18, 0x4, 0xfffffff8)
/1@1:               <- libapr-0:apr_palloc() = 0x14e9d0
/1@1:               -> libc_psr:memset(0x14e9d0, 0x0, 0x18, 0xfffffff8)
/1@1:               <- libc_psr:memset() = 0x14e9d0
/1@1:               -> mod_cgid:sock_read(0xa, 0x14e9d0, 0x17, 0xfffffff8)
/1@1:                 -> libc:read(0xa, 0x14e9d0, 0x17, 0x0)
/1@1:                   -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0)
/1@1:                   <- libc:_save_nv_regs() = 0xff092238
/1@1:                   -> libc:_read(0xa, 0x14e9d0, 0x17, 0x0)
/1:     read(10, "\0\0\017 S E R V E R _ A".., 23)      = 23
/1@1:                   <- libc:_read() = 23
/1@1:                 <- libc:read() = 23
/1@1:               <- mod_cgid:sock_read() = 0
/1@1:               -> mod_cgid:sock_read(0xa, 0xffbff870, 0x4, 0xfffffff8)
/1@1:                 -> libc:read(0xa, 0xffbff870, 0x4, 0x0)
/1@1:                   -> libc:_save_nv_regs(0xff092238, 0x0, 0x0, 0x0)
/1@1:                   <- libc:_save_nv_regs() = 0xff092238
/1@1:                   -> libc:_read(0xa, 0xffbff870, 0x4, 0x0)
/1:     read(10, " . 2 3 2", 4)                         = 4
/1@1:                   <- libc:_read() = 4
/1@1:                 <- libc:read() = 4
/1@1:               <- mod_cgid:sock_read() = 0
/1@1:               -> libapr-0:apr_palloc(0x14e410, 0x2e323333, 0x4, 0xfffffff8)

The sender does have the capability of repeating the call to write(), but only
if it gets -1/EINTR:

static apr_status_t sock_write(int fd, const void *buf, size_t buf_size)
{
    int rc;

    do {
        rc = write(fd, buf, buf_size);
    } while (rc < 0 && errno == EINTR);
    if (rc < 0) {
        return errno;
    }

    return APR_SUCCESS;
}

But adding a write to error log for the repeat case doesn't show a repeated call
to write at the time that the server receives the binary envvar length twice.

Comment 8 Paul Querna 2005-07-24 21:11:06 UTC

As a datapoint, I can reproduce massive problems with mod_cgid on
httpd.xones.apache.org box.  I ended up switching to prefork/mod_cgi, because I
needed CGI to work, but mod_cgid is seriously broken on Solaris 10.

If any @apache.org people need a Solaris 10 machine for testing, we can setup
accounts on httpd.zones.apache.org

Comment 9 Justin Erenkrantz 2005-08-26 19:51:33 UTC

FWIW, I can confirm that it has nothing to do with gcc.  I'm using Sun Forte 9 and I still see the same 
problem.

I'm going to update the summary accordingly.  Plus, I'm going to spend some time today dorking with this 
one.

Comment 10 Justin Erenkrantz 2005-08-27 00:40:37 UTC

Here's a 'dtrace' script:

http://people.apache.org/~jerenkrantz/httpd.d

./httpd.d 14581 15274

(First pid is worker process; second is the cgid daemon.)

Here's 'bad' output from an SMP x86 Solaris 10 box:

  0  40471                    get_req:entry (15274) Entering get_req!
  0     10                       read:entry read (15274) fd: 38 - 64 bytes
  0     11                      read:return read (15274) fd: 38 - 64 bytes
  0     11                      read:return
  0     10                       read:entry read (15274) fd: 38 - 42 bytes
  0     11                      read:return read (15274) fd: 38 - 42 bytes
  0     11                      read:return
/home/jerenk/public_html/weblog/weblog.cgi
  0     10                       read:entry read (15274) fd: 38 - 10 bytes
  0     11                      read:return read (15274) fd: 38 - 10 bytes
  0     11                      read:return weblog.cgii/software/
  0     10                       read:entry read (15274) fd: 38 - 20 bytes
  0     11                      read:return read (15274) fd: 38 - 20 bytes
  0     11                      read:return /weblog.cgi/software
  0     10                       read:entry read (15274) fd: 38 - 4 bytes
  0     11                      read:return read (15274) fd: 38 - 4 bytes
  0     11                      read:return 1430084180
  0     11                      read:return TZ=U
  1     12                      write:entry write (14581) fd: 38 - 64 bytes
  1     12                      write:entry
  1     13                     write:return wrote (14581): 64
  1     12                      write:entry write (14581) fd: 38 - 42 bytes
  1     12                      write:entry
/home/jerenk/public_html/weblog/weblog.cgi
  1     13                     write:return wrote (14581): 42
  1     12                      write:entry write (14581) fd: 38 - 10 bytes
  1     12                      write:entry weblog.cgi
  1     13                     write:return wrote (14581): 10
  1     12                      write:entry write (14581) fd: 38 - 20 bytes
  1     12                      write:entry /weblog.cgi/software
  1     13                     write:return wrote (14581): 20
  1     12                      write:entry write (14581) fd: 38 - 4 bytes
  1     12                      write:entry
  1     13                     write:return wrote (14581): 4
  1     12                      write:entry write (14581) fd: 38 - 13 bytes
  1     12                      write:entry TZ=US/Pacific
  1     13                     write:return wrote (14581): 13
  1     12                      write:entry write (14581) fd: 38 - 4 bytes
  1     12                      write:entry $
  1     13                     write:return wrote (14581): 4
  1     12                      write:entry write (14581) fd: 38 - 36 bytes
  1     12                      write:entry HTTP_HOST=weblog.erenkrantz.com:8080
  1     13                     write:return wrote (14581): 36
  1     12                      write:entry write (14581) fd: 38 - 4 bytes
  1     12                      write:entry l
  1     13                     write:return wrote (14581): 4
  1     12                      write:entry write (14581) fd: 38 - 108 bytes
  1     12                      write:entry HTTP_USER_AGENT=Mozilla/5.0
(Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b4) Gecko/20050819 Firefox/1.0+
  1     13                     write:return wrote (14581): 108
...snip...pid 15274 is done allocating 1430084180 bytes....
  1     10                       read:entry read (15274) fd: 38 - 1430084180 bytes
  1     11                      read:return read (15274): Short read:
1430084180, 165
  1     11                      read:return S/Pacific
  1     10                       read:entry read (15274) fd: 38 - 1430084015 bytes
  1     11                      read:return read (15274): Short read:
1430084015, 1849
  1     11                      read:return Data too large
  1     10                       read:entry read (15274) fd: 38 - 1430082166 bytes
  1     11                      read:return read (15274): Short read: 1430082166, 0
  1     11                      read:return
  1  40472                   get_req:return (15274) Leaving get_req!

Comment 11 Justin Erenkrantz 2005-08-27 01:39:21 UTC

Responses on the OpenSolaris development lists (opensolaris-code@opensolaris.org) to my query:

http://www.opensolaris.org/jive/thread.jspa?threadID=1957&tstart=0

indicate the following:

- mod_cgid uses AF_UNIX sockets
- AF_UNIX was 'massively overhauled for Solaris 10, right near GA'
- There are a number of AF_UNIX bugs that have fixes in the Sol10 patch pipeline
   [Current versions of Solaris Express have the fixes already.]
- One temporary workaround that was offered was to comment out the shutdown() call
   around line mod_cgid.c:1431.

I'll follow up with testing the shutdown() fix as well as trying out sol-nv-b19 next week.

But, this definitely isn't an httpd bug.  However, there is a possibility we can work around it though.

Comment 12 Justin Erenkrantz 2005-08-27 02:00:27 UTC

In my limited testing, commenting out the shutdown call doesn't seem to fix the problems for me.

I'll try the Solaris Express stuff next week.

Comment 13 Justin Erenkrantz 2005-08-30 23:24:03 UTC

Based upon my examination, using snv_19 (current Solaris Express release) doesn't have this problem.  (I 
subjected worker MPM'd httpd to flood'd CGI requests without problems.)

I have committed a fix to trunk in r264866 to block usage of mod_cgid on Solaris 10.

After the AF_UNIX patch from Sun is released (120664-01), we can detect for that patch and only allow 
mod_cgid usage in that scenario.

Comment 14 Jeff Trawick 2005-09-18 17:06:53 UTC

120664-01 has been released (9/13).  See
http://sunsolve.sun.com/pub-cgi/show.pl?target=patches/xos-10&nav=pub-patches

Has anybody tried it yet?

(In a fit of disgust at several issues I overlaid my Sol10 installation with FC4.)

Comment 15 Justin Erenkrantz 2005-11-05 10:42:36 UTC

Yes, I believe 120664/120665 fixes it.  If not, please reopen this bug.

Fixes were committed in r326018 to detect a 'good' patch at configure time.

Comment 16 William A. Rowe Jr. 2007-08-20 10:23:14 UTC

Footnote; 118833-36 cluster claims to superceed 120664-01

However, I've heard at least one rumor that it doesn't solve this bug.

So someone with time on their hands aught to validate this so we can add
this newer cluster to the list of acceptable patches in ./configure

Comment 17 Nick Kew 2008-12-26 18:08:40 UTC

This looks like PR#39332.  That bug has a proposed patch attached, which I've just reviewed and am applying to trunk.  Testing welcome!

*** This bug has been marked as a duplicate of bug 39332 ***

Comment 18 Rainer Jung 2018-02-25 19:49:42 UTC

Undo spam change

*** This bug has been marked as a duplicate of bug 39332 ***