Bug 54502 - Apache deadlock on epoll_ctl error (1000 process limit)
Summary: Apache deadlock on epoll_ctl error (1000 process limit)
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_prefork (show other bugs)
Version: 2.2.15
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-29 10:47 UTC by Etienne CHAMPETIER
Modified: 2013-01-30 22:37 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Etienne CHAMPETIER 2013-01-29 10:47:03 UTC
Hi

With kernel 3.2.9 (included) to 3.2.17 (excluded) there was an arbitrary limitation on epoll path (1000) which cause apache to deadlock when having 1001+ process. The first patch is http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=28d82dc1c4edbc352129f97f4ca22624d1fe61de, which put the limit to 1000, and the second patch is http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=93dc6107a76daed81c07f50215fa6ae77691634f, which doesn't limit epoll for non-nested path (so apache work again).


This limitation show a bug in apache which lead to a deadlock: if a httpd process get an error when doing epoll_ctl, it continue to run, and if he get the accept_mutex, epoll_wait will return 0 because epoll_ctl just failed, and apache will be blocked.
Here follow a small strace of the 1001 process:
-epoll_create1(O_CLOEXEC)    = 39
-epoll_ctl(39, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=1010443880, u64=140193037952616}}) = -1 EINVAL (Invalid argument)
-epoll_ctl(39, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=1010443880, u64=140193037952616}}) = -1 EINVAL (Invalid argument)
-semop(14385470, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
<... semop resumed> )       = 0
-epoll_wait(39,  <unfinished ...>
<... epoll_wait resumed> {}, 2, 10000) = 0


To reproduce:
-get a kernel with the limitation (3.2.9 to 3.2.16 for the 3.2 branch)
-configure httpd to listen on at least 2 ports (80 and 81) so that it use accept_mutex
-configure httpd to "StartServers 1001"
-start it with strace -f /etc/init.d/httpd start > ~/debug.log
-make a lot of request until it stop responding


The httpd process that fail to epoll_ctl should kill it self or retry epoll_ctl.


This bug was uncovered on a centos 6.3 with httpd 2.2.15 and a 3.2.13 kernel, but i've read other thread speaking of the 1000 httpd process limit on ubuntu...
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1028470 (so still present in 2.2.22 for sure)

I've put normal severity because by updating the kernel apache work again.
Comment 1 Mike Rumph 2013-01-30 22:37:17 UTC
In the latest Apache 2.2.x code,the child_main() function in prefork.c is not checking the status code after calling apr_pollset_add().

Here is an excerpt:

    for (lr = ap_listeners, i = num_listensocks; i--; lr = lr->next) {
        apr_pollfd_t pfd = { 0 };

        pfd.desc_type = APR_POLL_SOCKET;
        pfd.desc.s = lr->sd;
        pfd.reqevents = APR_POLLIN;
        pfd.client_data = lr;

        /* ### check the status */
        (void) apr_pollset_add(pollset, &pfd);
    }

This code has been improved in Apache 2.4.x.
svn blame shows the following revisions:

101799     gstein     for (lr = ap_listeners, i = num_listensocks; i--; lr = lr->next) {
101799     gstein         apr_pollfd_t pfd = { 0 };
101799     gstein 
101799     gstein         pfd.desc_type = APR_POLL_SOCKET;
101799     gstein         pfd.desc.s = lr->sd;
101799     gstein         pfd.reqevents = APR_POLLIN;
101799     gstein         pfd.client_data = lr;
101799     gstein 
804764     rpluem         status = apr_pollset_add(pollset, &pfd);
804764     rpluem         if (status != APR_SUCCESS) {
1393382     jorton             /* If the child processed a SIGWINCH before setting up the
1393382     jorton              * pollset, this error path is expected and harmless,
1393382     jorton              * since the listener fd was already closed; so don't
1393382     jorton              * pollute the logs in that case. */
1393382     jorton             if (!die_now) {
1393382     jorton                 ap_log_error(APLOG_MARK, APLOG_EMERG, status, ap_server_conf, APLOGNO(00157)
1393382     jorton                              "Couldn't add listener to pollset; check system or user limits");
1393382     jorton                 clean_child_exit(APEXIT_CHILDSICK);
1393382     jorton             }
1393382     jorton             clean_child_exit(0);
804764     rpluem         }
757853    trawick 
757853    trawick         lr->accept_func = ap_unixd_accept;
 96102        rbb     }