Bug 46467

Summary: Apache-childs segfault when number of childs reaches 130
Product: Apache httpd-2 Reporter: Alex Pircher <Alexander_Pircher>
Component: mpm_preforkAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: critical CC: erno.kovacs, jwezel, klausman, werner
Priority: P1 Keywords: FixedInTrunk
Version: 2.2.11   
Target Milestone: ---   
Hardware: Other   
OS: Linux   
Attachments: GNU-Debugger-Output.txt
httpd.conf
add more error logging for apr_pollset_create and apr_pollset_add
worker beos (not tested)

Description Alex Pircher 2009-01-02 16:52:43 UTC
Created attachment 23071 [details]
GNU-Debugger-Output.txt

DESCRIPTION:
The Apache-childs begin to segfault once Apache reaches 130 childs. This can
be reproduced every time by setting MinSpareServers to a value higher than 130.
It has been tested with 2.2.11, 2.2.10 and 2.2.9 which all have the same
behaviour. 2.0.63 is working fine.

The error_log contains:
...
[Sat Jan 03 01:32:34 2009] [notice] child pid 672 exit signal Segmentation fault (11)
[Sat Jan 03 01:32:34 2009] [notice] child pid 673 exit signal Segmentation fault (11)
[Sat Jan 03 01:32:34 2009] [notice] child pid 674 exit signal Segmentation fault (11)
...

Attached is the output of the gdb of one coredump.

CONFIGURATION:
./configure --prefix=/usr/local/bin/httpd --enable-suexec --with-suexec --enable-mods-shared=all --disable-imagemap

Following lines have been added to httpd.conf:
# ---------------------------------------------
<IfModule prefork.c>
StartServers      32
MinSpareServers  200
MaxSpareServers  400
ServerLimit      1600
MaxClients       1600
MaxRequestsPerChild  4000
</IfModule>

CoreDumpDirectory /tmp
# ---------------------------------------------
Comment 1 Ruediger Pluem 2009-01-03 01:31:12 UTC
Please provide the following information:

- Kernel version
- glibc version
- Your httpd.conf
- Your ulimits (ulimit -a)

Please execute the following steps with gdb when you have loaded your core dump:

frame 1
p num_listensocks
Comment 2 Alex Pircher 2009-01-03 13:26:32 UTC
Created attachment 23075 [details]
httpd.conf
Comment 3 Alex Pircher 2009-01-03 13:27:57 UTC
- Kernel version
2.6.27.9-159 (x86_64)

- glibc version
2.9

- Your httpd.conf
Default httpd.conf installed during installation with the lines above added.
Attached is my httpd.conf

- Your ulimits (ulimit -a)
I have already tried set the limits as high as possible, the current settings are:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

- gdb steps
(gdb) frame 1
#1  0x00000000004494b6 in child_main (child_num_arg=<value optimized out>) at prefork.c:532
532             (void) apr_pollset_add(pollset, &pfd);
(gdb) p num_listensocks
$1 = 1
Comment 4 Alex Pircher 2009-01-03 14:02:23 UTC
An additional note which may be important. I have just tested it with the first 2.2-release of Apache 2.2.0 and got the same behaviour.
Comment 5 Jeff Trawick 2009-01-06 14:20:16 UTC
Ruediger, is apr_pollset_create() failing due to some kernel limit?

Alex, you could confirm that by running with this tiny patch

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c	(revision 731724)
+++ server/mpm/prefork/prefork.c	(working copy)
@@ -485,7 +485,12 @@
 
     /* Set up the pollfd array */
     /* ### check the status */
-    (void) apr_pollset_create(&pollset, num_listensocks, pchild, 0);
+    status = apr_pollset_create(&pollset, num_listensocks, pchild, 0);
+    if (status != APR_SUCCESS) {
+        ap_log_error(APLOG_MARK, APLOG_EMERG, status, ap_server_conf,
+                     "Couldn't initialize pollset in child");
+        clean_child_exit(APEXIT_CHILDFATAL);
+    }
 

and seeing if you get the new message.  If you do, you might be able to work around whatever is causing the apr_pollset_create() failure by setting "apr_cv_epoll=no" when you configure.  (none of this tested ;) )
Comment 6 Ruediger Pluem 2009-01-06 14:29:27 UTC
(In reply to comment #5)
> Ruediger, is apr_pollset_create() failing due to some kernel limit?

I don't know. I guess your patch will be very helpful in detecting why the creation of the pollset fails. IMHO the main difference between 2.0.63 and 2.2.x is that epoll is used and I had the idea that we might be out of fd's, but this does not seem to be the case.

Comment 7 Joe Orton 2009-01-07 02:09:33 UTC
The epoll_create failure looks like some kind of weird kernel bug; strace output looks like the below:

read(9, "\2\0\0\0\1\0\0\0\0\0\0\0"..., 12) = 12
read(9, ""..., 0) = 0
close(9) = 0
setgroups(1, [48]) = 0
geteuid() = 0
setuid(48) = 0
epoll_create(1) = -1 EMFILE (Too many open files)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

the fact that an fd was closed right before epoll_create() seems like sufficient evidence for this, and it's not some global fd limit being hit since that would produce ENFILE.
Comment 8 Joe Orton 2009-01-07 02:17:10 UTC
Bleh, this is annoying :(  It's a new tunable setting.  If you do

echo 1024 >  /proc/sys/fs/epoll/max_user_instances

where 1024 is some value larger than MaxClients, then it works.

http://lkml.indiana.edu/hypermail/linux/kernel/0812.0/01183.html
Comment 9 Ruediger Pluem 2009-01-07 08:48:03 UTC
So I guess this transforms into a documentation bug and we should document it somewhere. Maybe add a platform specific section for Linux, like the one for Windows (http://httpd.apache.org/docs/2.2/en/platform/windows.html)?
Comment 10 Ruediger Pluem 2009-01-07 12:11:01 UTC
At least the segfault is now fixed in trunk (r732414) by Jeff's patch. Instead an error message is logged and the child exits.
Comment 11 Ruediger Pluem 2009-01-07 12:18:07 UTC
Patch proposed for backport to 2.2.x as r732465.
Comment 12 Alex Pircher 2009-01-07 15:04:11 UTC
I can confirm that the patch is working in 2.2.11 and that raising max_user_instances solves the problem. Thanks for your efforts!

This may affect the worker-MPM as well if you have more than 128 workers which
is probably very unlikely.

A specific documentation-section for Linux would be a good idea.
Comment 13 Joe Orton 2009-01-08 03:15:41 UTC
I'm going to see if we can either get the default setting raised or whether this can be turned into an rlimit which we can bump manually.
Comment 14 Eric Covener 2009-01-09 06:27:51 UTC
*** Bug 46501 has been marked as a duplicate of this bug. ***
Comment 15 Stefan Fritsch 2009-01-11 08:48:40 UTC
Created attachment 23105 [details]
add more error logging for apr_pollset_create and apr_pollset_add

CONNECT in mod_proxy and mod_cgi also use apr_pollset_create. Therefore it is possible that this problem also occurs in worker or event mpm.

There is now also the new max_user_watches limit that may affect apr_pollset_add (though the default seems high enough).

Here is a patch that does some more error checking for these calls.
Comment 16 Nick Kew 2009-01-12 06:46:58 UTC
Fixed in r733698.
Comment 17 Stefan Fritsch 2009-01-12 08:32:00 UTC
Please look at the patch I submitted. At least mpm_worker and mod_cgi need to be changed, too.
Comment 18 Joe Orton 2009-02-09 04:37:20 UTC
FWIW, there's a thread on the kernel list about this and so far as I can tell the decision was to remove the default limits again:

http://lkml.indiana.edu/hypermail/linux/kernel/0901.3/01806.html
Comment 19 Werner Detter 2009-02-09 06:36:25 UTC
Hi Everybody, 

seems like they've removed this setting in Kernel 2.6.28.4 which I've installed today: 

server:# ls /proc/sys/fs/epoll/
max_user_watches

So, max_user_instances is gone and so is the problem :-) 

cheers,
Werner Detter

Comment 20 Jeff Trawick 2009-03-15 08:58:16 UTC
*** Bug 46856 has been marked as a duplicate of this bug. ***
Comment 21 Ruediger Pluem 2009-07-13 09:03:24 UTC
*** Bug 47519 has been marked as a duplicate of this bug. ***
Comment 22 Arkadiusz Miskiewicz 2009-08-16 12:40:52 UTC
(In reply to comment #16)
> Fixed in r733698.

Nick, why worker and beos wasn't fixed, too? 

Also apr_pollset_create() can fail if apr build on system that supports epoll_create1() (with recent glibc/linux kernel for example) but run on a system which doesn't support it (older linux kernel)
Comment 23 Arkadiusz Miskiewicz 2009-08-16 12:46:10 UTC
Created attachment 24140 [details]
worker beos (not tested)
Comment 24 Ruediger Pluem 2009-08-16 13:30:59 UTC
Committed Stefans patch as r804764 to trunk. It contains even more checks. BEOS support isn't present on trunk any longer hence no changes there.
Comment 25 Arkadiusz Miskiewicz 2009-08-19 10:52:27 UTC
Hope to see it backported to 2.2.x.
Comment 26 Jochen Wezel 2009-11-12 11:03:35 UTC
(In reply to comment #25)
> Hope to see it backported to 2.2.x.

Yes, me too!