Bug 42829 - graceful restart with multiple listeners using prefork MPM can result in hung processes
graceful restart with multiple listeners using prefork MPM can result in hung...
Status: RESOLVED FIXED
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_prefork
2.2.4
PC FreeBSD
: P2 critical with 14 votes (vote)
: ---
Assigned To: Apache HTTPD Bugs Mailing List
: PatchAvailable
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2007-07-06 14:45 UTC by Nathan Mower
Modified: 2014-02-17 13:55 UTC (History)
15 users (show)



Attachments
Proposed changes to the SIGUSR1 handler and prefork code (390 bytes, patch)
2007-07-06 14:48 UTC, Nathan Mower
Details | Diff
new patch for graceful-stop / graceful reload problem (1.00 KB, patch)
2007-10-18 05:43 UTC, Stefan Fritsch
Details | Diff
patch for graceful-stop / graceful reload problem without race condition (2.80 KB, patch)
2007-11-16 02:06 UTC, Stefan Fritsch
Details | Diff
close listeners using dup2 to avoid undefined behaviour (1.42 KB, patch)
2008-01-18 07:57 UTC, Joe Orton
Details | Diff
combination of the two patches above (21137 and 21406) (3.42 KB, patch)
2008-11-03 12:05 UTC, Stefan Fritsch
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nathan Mower 2007-07-06 14:45:16 UTC
Intermittently after a graceful restart (apachectl graceful) extra Apache
processes are left running on the server.  A check to server-status shows a
number of them to be in state 'G' (gracefully finishing).  They remain in that
state indefinitely, and will not go away until a stop-and-start or a restart is
issued.

The 'top' command shows one of these 'G' processes is stuck in kqread, while the
rest are in lockf.  (I believe the processes in lockf are blocking on the accept
mutex, which is held by the process in kqread and therefore is never cleared.)

A stack trace of the process in kqread state shows the process calling kevent()
from the apr_pollset_poll() function in srclib/apr/poll/unix/kqueue.c, which is
called from the main loop in child_main() of server/mpm/prefork/profork.c.  When
the process is allowed to  continue, it never returns from kevent().

Frequently the error log contains the following message when the problem has
occurred:
[error] (9)Bad file descriptor: apr_socket_accept: (client socket)

The behavior can be replicated in the Apache 2.2.4 release version on FreeBSD
6.2 stable.  The build configuration is as follows:

./configure \
  --with-apr=/usr/local/apr-httpd/ \
  --with-apr-util=/usr/local/apr-util-httpd/ \
  --prefix=/usr/local/httpd-2.2.4 \
  --with-mpm=prefork \
  --enable-mods=info

The problem only occurs when at least one extra listener is configured in
httpd.conf:
  Listen 80
  Listen 8080

When replicating the problem it is useful to enable server-status by
uncommenting the Include line for httpd-info.conf and adding an 'Allow from'
line in the file conf/extra/httpd-info.conf.

After Apache has been restarted a series of 'apachectl graceful' commands will
usually trigger the problem.  It speeds things up to run several of them on one
command line (separated by semi-colons), but care is required not to run into
another Apache bug with similar steps to duplicate (ID 39311).  Usually the hung
'G' process will show up within 2-3 minutes by continually repeating in this way.

Here is my take on what may happening, along with a patch that works for me:

There appears to be a race condition between Apache prefork and kevent() that
depends upon the timing of the receipt of SIGUSR1 in the child process.  When
SIGUSR1 is received, the current signal handler for prefork closes all listeners
immediately.  I believe if the close occurs before the process gets into
kevent() to poll the sockets, it can cause kevent() to hang.

My solution was to wait until after the !die_now loop to close listeners.  The
signal handler merely sets the die_now flag, nothing more.  Here is my proposed
patch:

--- prefork.c.orig      Fri Jul  6 11:12:53 2007
+++ prefork.c   Fri Jul  6 11:18:24 2007
@@ -330,8 +330,6 @@
 
 static void stop_listening(int sig)
 {
-    ap_close_listeners();
-
     /* For a graceful stop, we want the child to exit when done */
     die_now = 1;
 }
@@ -657,6 +655,7 @@
             die_now = 1;
         }
     }
+    ap_close_listeners();
     clean_child_exit(0);
 }
Comment 1 Nathan Mower 2007-07-06 14:48:08 UTC
Created attachment 20470 [details]
Proposed changes to the SIGUSR1 handler and prefork code
Comment 2 Henry Karpatskij 2007-08-22 01:33:46 UTC
I have the exactly same problem on jailed Apache 2.2.4 running on FreeBSD 6.2.
Comment 3 Mihai Costache 2007-08-23 09:50:37 UTC
i have same issue on my test/preproduction machine with apache 2.2.4 (with
mod_ssl, mod_php) on top of gentoo 2007.0 and slackware 12.0 linux distribution
...  tested with the package from distribution and with my own package builded
from vanilla apache (the arhives from httpd.apache.org)
Comment 4 Ilya V. Glazkov 2007-08-28 03:52:25 UTC
Same problem on Apache 2.2.4 (Debian/Lenny, amd x86_64, kernel 2.6.18)
Comment 5 Delian Krustev 2007-09-05 07:50:25 UTC
I'm also seeing this on Debian testing on x86 . 

Looking at the error logs the problem has appeared  before 2.2.4. I see the 
message:

[Fri Nov 24 09:25:09 2006] [notice] Graceful restart requested, doing restart
[Fri Nov 24 09:25:09 2006] [error] (9)Bad file descriptor: apr_socket_accept: 
(client socket)
[Fri Nov 24 09:25:23 2006] [notice] Apache configured -- resuming normal 
operations


At this time the apache version on testing was 2.2.3 (so was on this box)

Recently, after the upgrade to 2.2.4 and the fix to CVE-2007-3304 I'm also 
seeing this:

[Thu Aug 23 13:00:14 2007] [notice] Graceful restart requested, doing restart
[Thu Aug 23 13:00:14 2007] [notice] (10)No child processes: cannot send signal 
10 to pid 19020 (non-child or already dead)
[Thu Aug 23 13:00:14 2007] [error] (9)Bad file descriptor: apr_socket_accept: 
(client socket)
[Thu Aug 23 13:00:17 2007] [notice] Digest: generating secret for digest 
authentication ...
[Thu Aug 23 13:00:21 2007] [notice] Digest: done
[Thu Aug 23 13:00:21 2007] [notice] Apache/2.2.4 (Debian) mod_ssl/2.2.4 
OpenSSL/0.9.8e DAV/2 configured -- resuming normal operations


I've noticed that after such unsuccessfull gracefull restart the number of the 
remaining alive children stands almost constant(about 32-33, I've got 
MinSpareServers set to 30). I've started 200 parallel requests for a big file, 
but the number of the active children remained about 30, although I've got 
MaxClients set to 256.
Comment 6 Oden Eriksson 2007-09-12 05:52:40 UTC
Confirmed on Mandriva Linux Cooker as well with 2.2.6. The proposed patch seems
to fix this.
Comment 7 richlv 2007-09-25 03:36:59 UTC
the same problem on sles 10 sp1 (using 2.2.3).

what are the implications of this, could any problems arise if this is not 
patched ?
Comment 8 Nathan Mower 2007-09-26 09:53:23 UTC
(In reply to comment #7)

> what are the implications of this, could any problems arise if this is not 
> patched ?

It caused us problems in cases where a cron entry was saving off log files and
then issuing a graceful restart every night.  More and more processes ended up
hung until we reached our MaxClients limit, resulting in denial of service.
Comment 9 Aaron Axelsen 2007-09-27 07:21:01 UTC
We are running sles 10 and also getting the following:

[Wed Sep 26 03:00:02 2007] [notice] Graceful restart requested, doing restart
[Wed Sep 26 03:00:02 2007] [error] (9)Bad file descriptor: apr_socket_accept:
(client socket)
[Wed Sep 26 03:00:02 2007] [error] (9)Bad file descriptor: apr_socket_accept:
(client socket)
Syntax error on line 101 of /etc/apache2/server-tuning.conf:
Invalid command 'BrowserMatch', perhaps misspelled or defined by a module not
included in the server configuration
[Wed Sep 26 03:10:02 2007] [warn] pid file /var/run/httpd2.pid overwritten --
Unclean shutdown of previous Apache run?
[Wed Sep 26 03:10:02 2007] [notice] Apache/2.2.3 (Linux/SUSE) configured --
resuming normal operations

It doesn't happen every night, very sporadically.

For the mean time, I have a script that checks to make sure apache is running,
and if not, restart it.
Comment 10 Stefan Fritsch 2007-10-18 05:41:33 UTC
The patch from the bug submitter does not close the listening sockets until the
current request has been finished, therefore breaking graceful-stop behaviour.

Depending on OS and used apr_pollset_poll variant, it seems that both
apr_pollset_poll() and lr->accept_func() can block if the sockets have been
closed in the meantime. This can be mitigated by checking for die_now before
calling these two functions (though a short race condition still remains).

Maybe it would also be possible to change apr to not block if all listening
sockets in the pollset are closed.

Comment 11 Stefan Fritsch 2007-10-18 05:43:14 UTC
Created attachment 21002 [details]
new patch for graceful-stop / graceful reload problem
Comment 12 Paul Querna 2007-11-14 01:30:07 UTC
it seems this bug may be related to similar issues for the event mpm?
PR 43359:
http://issues.apache.org/bugzilla/show_bug.cgi?id=43359

Has a more complicated patch...
Comment 13 Stefan Fritsch 2007-11-15 10:58:42 UTC
(In reply to comment #12)
> it seems this bug may be related to similar issues for the event mpm?
> PR 43359:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=43359
> 
> Has a more complicated patch...

I think that's a different bug. Here, calling apr_pollset_poll and/or accept 
when the listening sockets have already been closed, causing a block. In 
PR43359 not continuing to process the request when the listening sockets have 
been closed.

I don't see how the patch from PR43359 would help in prefork, as we have to 
close the sockets in the signal handler (at least when we are in 
ap_process_connection). Otherwise the sockets will only be closed after the 
current request is finished, which would break graceful stop.

Or am I missing something?
Comment 14 Stefan Fritsch 2007-11-16 02:06:41 UTC
Created attachment 21137 [details]
patch for graceful-stop / graceful reload problem without race condition

This patch fixes the remaining race condition in the previous patch
Comment 15 Paul Henrichsen 2008-01-04 05:25:53 UTC
> More and more processes ended up hung until we reached
> our MaxClients limit, resulting in denial of service.

Denial of service is also resulting from the memory 
occupied by the hung processes. This has been severe 
enough to cause to cause errors such as 
"/libexec/ld-elf.so.1: /lib/libc.so.6: mmap of entire address space failed:
Cannot allocate memory."
when attempting to connect via ssh. 
Comment 16 Joe Orton 2008-01-18 06:45:39 UTC
So, am I correct in the assumption that the root cause of this bug is a sequence
as follows:

1. child in child_main enters the while (!die_now) loop
2. child is SIGUSR1'ed
3. listening socket, fd N is close()d
4. child calls accept or epoll_wait/whatever *on closed fd N*

and calling accept/epoll_wait/whatever on a closed fd results in undefined
behaviour and hence the process may hang.
Comment 17 Joe Orton 2008-01-18 07:00:21 UTC
Presuming the assumption above is correct; Stefan's patch doesn't strictly close
the race since the signal could still occur e.g. between checking die_now and
the invocation of the accept function.

To strictly close this race, I think ap_close_listeners() is going to have to
dup  something on to the listening fds which will guarantee that the accept/etc
function call has defined behaviour.   In the single-listener case this could be
a simple hack like dup2(open("/dev/null", blah), listener).  For the
multi-listener case, it means invalidating the pollset fd, which is
method-specific and will require a new APR interface.
Comment 18 Stefan Fritsch 2008-01-18 07:17:00 UTC
Joe's description of the problem in comment #16 is accurate.

I think I closed the race completely, because if the signal occurs between the 
check for die_now and accept(), the signal handler will only set die_now and 
not call ap_close_listeners() (because active_connection == 0). In this case 
the ap_close_listeners() is done in the while loop just before 
ap_process_connection().

But fixing it in apr, so that apr_pollset_poll, etc. have a defined behaviour 
would probably be better.

Maybe 
Comment 19 Joe Orton 2008-01-18 07:54:28 UTC
Ah, sorry, I missed that, thanks Stefan.  The dup2 is actually simple in
implementation, I'll attach a patch.  I don't think it is necessary to worry
about invalidating the pollsets -- the pollsets will contain references *to the
listening fds*.  So I think that simply dup2ing onto those fds should be sufficient.

I'm not sure which approach is better/safer.
Comment 20 Joe Orton 2008-01-18 07:57:24 UTC
Created attachment 21406 [details]
close listeners using dup2 to avoid undefined behaviour

This closes the listeners by dup2'ing a dummy socket onto them.  This should
restore the guarantee of defined behaviour by a call to
accept/apr_pollset_poll/ which references this fd, as opposed to closing it
outright.
Comment 21 Joe Orton 2008-01-18 09:14:47 UTC
Using dup2 like this does have the advantage that it makes stop_listening()
(more) async-signal-safe; calling apr_socket_close(), as the current code does,
is particularly async-signal-unsafe since it modifies pool state.
Comment 22 Ruediger Pluem 2008-01-18 14:09:18 UTC
(In reply to comment #20)
> Created an attachment (id=21406) [edit]
> close listeners using dup2 to avoid undefined behaviour
> 
> This closes the listeners by dup2'ing a dummy socket onto them.  This should
> restore the guarantee of defined behaviour by a call to
> accept/apr_pollset_poll/ which references this fd, as opposed to closing it
> outright.

Hm, and apr_pollset_poll called on the dummy_listener with APR_POLLIN as event
for which we are looking for returns immediately?
Comment 23 Joe Orton 2008-01-18 15:10:58 UTC
Taking this up dev@httpd.

http://marc.info/?l=apache-httpd-dev&m=120069757111682&w=2
Comment 24 Guille (bisho) 2008-04-21 05:36:13 UTC
I don't know if this is related or no. I have a similar problem with apache 2.2.4-3ubuntu0.1 of latest ubuntu gutsy.

Apache prefork uses a "internal dummy connection" to tell childs they should die. When apache emits that request to a child (too many idle workers, MaxRequestsPerChild reached...), the children correctly dies (is not longer a proc in the system), but the apache main process still thinks it's "Gracefully finishing".

Server status page shows processes on this state in a number increasing until Maxclients is reached and apache stops answering connections, so it's a very buig problem for me.

It can be forced on my system easily issuing a /etc/init.d/apache2 reload. With that I get a lot of connections in the 'G' state (finished but the main apache still thinks they are finishing).

It's on a xen virtual machine, maybe it's a problem related to this.
Comment 25 Stefan Fritsch 2008-04-21 13:44:58 UTC
(In reply to comment #24)
> I don't know if this is related or no. I have a similar problem with apache
> 2.2.4-3ubuntu0.1 of latest ubuntu gutsy.

Yes this is the same problem. See

https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/174805

and possibly add a comment there to get the fix backported to gutsy.
Comment 26 Raphael Wegmann 2008-08-27 06:12:58 UTC
I can confirm this bug for Apache 2.2.6 prefork on Solaris 10. 
Here is a trace of an unsuccessful SIGUSR1 call:

    Received signal #16, SIGUSR1, in fcntl() [caught]
      siginfo: SIGUSR1 pid=2185 uid=2002
fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
close(7)                                        = 0
close(5)                                        = 0
close(3)                                        = 0
setcontext(0xFFFFFFFF7FFFE210)
fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...)
fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         = 0
port_associate(22, 4, 0xFFFFFFFFFFFFFFFF, 0x00000001, 0x10039FE68) Err#81 EBADFD
port_getn(22, 0x10039FD40, 3, 1, 0x00000000) (sleeping...)

    Received signal #16, SIGUSR1, in port_getn() [caught]
      siginfo: SIGUSR1 pid=2185 uid=2002
port_getn(22, 0x10039FD40, 3, 1, 0x00000000)    Err#4 EINTR
lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
close(-1)                                       Err#9 EBADF
close(-1)                                       Err#9 EBADF
close(-1)                                       Err#9 EBADF
setcontext(0xFFFFFFFF7FFFE360)
write(13, " [ F r i   A u g   2 2  ".., 104)    = 104
close(22)                                       = 0
close(11)                                       = 0
close(10)                                       = 0
_exit(1)

error-log:
[Fri Aug 22 15:08:04 2008] [error] (70007)The timeout specified has expired: apr_pollset_poll: (listen)

Even after sending a 2nd USR1-signal the child doesn't exit:

 3.6337 read(0, 0xFFFFFFFF7FFFEB78, 512)                = 0
 3.6338 close(0)                                        = 0
 3.6339 read(10, 0xFFFFFFFF7FFFED7F, 1)                 Err#11 EAGAIN
 4.4083     Received signal #16, SIGUSR1, in fcntl() [caught]
 4.4086 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
 4.4087 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
 4.4088 close(7)                                        = 0
 4.4089 close(5)                                        = 0
 4.4089 close(3)                                        = 0
 4.4090 setcontext(0xFFFFFFFF7FFFE2F0)
fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...)
38.5953     Received signal #16, SIGUSR1, in fcntl() [caught]
38.5955 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
38.5956 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
38.5957 close(-1)                                       Err#9 EBADF
38.5957 close(-1)                                       Err#9 EBADF
38.5958 close(-1)                                       Err#9 EBADF
38.5969 setcontext(0xFFFFFFFF7FFFE2F0)
fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...)
56.4129     Received signal #16, SIGUSR1, in fcntl() [caught]
56.4130 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
56.4131 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
56.4132 close(-1)                                       Err#9 EBADF
56.4133 close(-1)                                       Err#9 EBADF
56.4133 close(-1)                                       Err#9 EBADF
56.4134 setcontext(0xFFFFFFFF7FFFE2F0)
57.3615     Received signal #16, SIGUSR1, in fcntl() [caught]
57.3617 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
57.3618 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
57.3618 close(-1)                                       Err#9 EBADF
57.3619 close(-1)                                       Err#9 EBADF
57.3619 close(-1)                                       Err#9 EBADF
57.3620 setcontext(0xFFFFFFFF7FFFE2F0)
57.8297     Received signal #16, SIGUSR1, in fcntl() [caught]
57.8298 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510)         Err#4 EINTR
57.8299 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
57.8301 close(-1)                                       Err#9 EBADF
57.8301 close(-1)                                       Err#9 EBADF
57.8302 close(-1)                                       Err#9 EBADF
57.8302 setcontext(0xFFFFFFFF7FFFE2F0)

This is what a successful USR1 handling looks like:

    Received signal #16, SIGUSR1, in pollsys() [caught]
pollsys(0xFFFFFFFF7FFFE928, 1, 0xFFFFFFFF7FFFE860, 0x00000000) Err#4 EINTR
lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
close(7)                                        = 0
close(5)                                        = 0
close(3)                                        = 0
setcontext(0xFFFFFFFF7FFFDF50)
pollsys(0xFFFFFFFF7FFFE928, 1, 0xFFFFFFFF7FFFE860, 0x00000000) = 1
read(0, 0xFFFFFFFF7FFFEAB8, 512)                = 0
close(0)                                        = 0
read(10, 0xFFFFFFFF7FFFECBF, 1)                 Err#11 EAGAIN
close(22)                                       = 0
close(11)                                       = 0
close(10)                                       = 0
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFFFFFFFF76F00000
munmap(0xFFFFFFFF76F00000, 8192)                = 0
_exit(0)

Comment 27 Matus "fantomas" Uhlar 2008-08-27 08:57:58 UTC
seems this bug still applies in 2.2.9 for graceful stop, can someone confirm?

Comment 28 Matus "fantomas" Uhlar 2008-08-28 09:53:48 UTC
Seems that this problem only appears if I have more than one socked opened by listen directive...
Comment 29 Cedric Roijakkers 2008-09-02 00:37:25 UTC
(In reply to comment #27)
> seems this bug still applies in 2.2.9 for graceful stop, can someone confirm?
> 

Yes, I can confirm. We currently have the same problem on 2.2.9 on Solaris 10 in production. I've also tested Joe's patch and that did not solve the issue for me. I'm not going to give Stephan's patch a try.
Comment 30 Drew Withers 2008-09-02 09:11:43 UTC
Joe's didn't work for me either, but Stefans' seemed to work. If nothing else it made the problem much less apparent.
Comment 31 Cedric Roijakkers 2008-09-03 23:57:24 UTC
(In reply to comment #30)
> Joe's didn't work for me either, but Stefans' seemed to work. If nothing else
> it made the problem much less apparent.

Yes, I can confirm this. We have deployed Stefan Fritsch's patch in our test environment, and the issue seems to be gone now. I will deploy to production as soon as possible, but it would be a good idea to incorporate this patch into the official source.
Comment 32 Cedric Roijakkers 2008-09-30 02:56:02 UTC
Stefan Fritsch's latest patch has now been running in our production environment for quite some time, and all seems to be well. The issue is solved, and we can do graceful restarts again.

I vote for this patch to be checked in the official source.
Comment 33 Cedric Roijakkers 2008-11-03 03:06:43 UTC
I've just upgraded our test-environment to Apache 2.2.10, and included Stefan's patch that used to work fine for me in version 2.2.9. However, even with this patch the issue is now still present in 2.2.10.

Can anyone have a look, or will I be forced to start using a different threading model?
Comment 34 Stefan Fritsch 2008-11-03 12:02:20 UTC
(In reply to comment #33)
> I've just upgraded our test-environment to Apache 2.2.10, and included Stefan's
> patch that used to work fine for me in version 2.2.9. However, even with this
> patch the issue is now still present in 2.2.10.

Nothing has changed in the relevant code in apache 2.2.10. But there was a change in apr that might have something to do with it. Can you get a stacktrace of a hanging process (should be possible with pstack)?
Comment 35 Stefan Fritsch 2008-11-03 12:05:55 UTC
Created attachment 22822 [details]
combination of the two patches above (21137 and  21406)

You could also try if applying both patches helps.
Comment 36 Stefan Fritsch 2008-11-03 12:36:18 UTC
For the record: I don't see any problems under Linux using 2.2.10 with the 2007-11-16 patch.
Comment 37 Joe Orton 2008-11-05 03:29:14 UTC
Comment on attachment 21406 [details]
close listeners using dup2 to avoid undefined behaviour

My patch won't help solve the problem, so there is no use anybody trying it; as explained in the thread on dev@httpd, the listening fds are removed from the epoll set on closure by the kernel, dup2'ing onto the fd is equivalent to closure, so the behaviour is equivalent.
Comment 38 Stefan Fritsch 2008-11-05 11:54:42 UTC
(In reply to comment #37)
> (From update of attachment 21406 [details])
> My patch won't help solve the problem, so there is no use anybody trying it; as
> explained in the thread on dev@httpd, the listening fds are removed from the
> epoll set on closure by the kernel, dup2'ing onto the fd is equivalent to
> closure, so the behaviour is equivalent.

This is true for epoll, but IIRC solaris 10 uses port_create and Cedric uses Solaris 10. I haven't checked if the semantics are the same as for epoll.
Comment 39 Stefan Fritsch 2008-11-07 10:21:45 UTC
On a Solaris 10 machine with 2.2.10 with the 2007-11-16 patch, I didn't see any problems.
Comment 40 Jeff Trawick 2009-01-07 12:03:12 UTC
A proposed fix was committed to trunk (http://svn.apache.org/viewvc?rev=732451&view=rev).  Is anyone able to try it out?  I've tested it successfully with graceful stop/restart on OpenSolaris 2008.11.
Comment 41 Stefan Fritsch 2009-01-11 08:50:41 UTC
(In reply to comment #40)
> A proposed fix was committed to trunk
> (http://svn.apache.org/viewvc?rev=732451&view=rev).  Is anyone able to try it
> out?  I've tested it successfully with graceful stop/restart on OpenSolaris
> 2008.11.
> 

Seems to work under Linux, too.
Comment 42 Philp M. Gollucci 2009-01-18 16:19:14 UTC
CC myself on FreeBSD related bugs
Comment 43 Matus "fantomas" Uhlar 2009-02-13 05:51:45 UTC
now what should I apply to 2.2.11? it rejectsthe one from trunk
Comment 44 Jeff Trawick 2009-02-13 05:56:55 UTC
This is the patch which was applied to the 2.2.x branch:

http://svn.apache.org/viewvc/httpd/httpd/branches/2.2.x/server/mpm/prefork/prefork.c?r1=739608&r2=739607&pathrev=739608
Comment 45 Dan Muey 2009-02-24 06:46:36 UTC
WIll this be in 2.2.12 ?

Is there an ETA on when 2.2.12 will arrive?

thanks!
Comment 46 Jeff Trawick 2009-02-24 07:36:08 UTC
>WIll this be in 2.2.12 ?
yes
>Is there an ETA on when 2.2.12 will arrive?
no ETA available at present
Comment 47 neofutur 2009-09-28 19:47:26 UTC
 for information, same problem here ( www-servers/apache-2.2.11-r2 gentoo hardened )
Comment 48 Jon Osborn 2010-03-19 18:03:37 UTC
FYI, also an issue on SLES 11 64 bit. graceful-stop leaves threads hanging. the sockets appear to have been closed.

/usr/sbin/httpd2 -V

Server version: Apache/2.2.10 (Linux/SUSE)
Server built:   Feb 23 2009 21:08:14
Server's Module Magic Number: 20051115:18
Server loaded:  APR 1.3.3, APR-Util 1.3.4
Compiled using: APR 1.3.3, APR-Util 1.3.4
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/srv/www"
 -D SUEXEC_BIN="/usr/sbin/suexec2"
 -D DEFAULT_PIDLOG="/var/run/httpd2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/accept.lock"
 -D DEFAULT_ERRORLOG="/var/log/apache2/err
Comment 49 Jeff Trawick 2010-03-21 22:47:33 UTC
Jon, I expect that graceful-stop issues aren't related to this bug.  Plz open a new bug and post backtraces of some hanging threads.
Comment 50 Yavuz Darendelioglu 2010-11-28 03:44:01 UTC
Hi all,

I think this is not related with normal use of Apache and formally coded modules.

I have 3 servers running 3 different Slackware servers.  Each is forced by cron, for a graceful restart at certain times of the day.

This morning, on one of my servers I noticed 
[Sun Nov 28 06:10:01 2010] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) problem.

As I searched, I found this page. I read all the comments and info provided here. Then, I checked my other two servers. But the other two are not having the same problem.

I just wanted to drop a note for you, about the difference between erroneous server and the others. So that it might be helpful in finding your problem.

The one which started producing the mentioned error is my development server. Where I am developing a module on handling requests. 

In the module I am intensively playig with pools and tables. I think the error started on my server after I coded, late in last night, some lines with apr_table_set, apr_table_unset, apr_table_merge etc. I will concentrate on this point for locating the problem.

If you're having this error, try to disable your modules one by one and give a graceful restart. Probably you have an offending module which causes this.

Following are the details of my servers. First one is my development server where I am having the error. Please notice that the first and the second are exactly the same, but they are running on different boxes (even on different countries)


Server version: Apache/2.2.15 (Unix)
Server built:   Mar  8 2010 13:36:25
Server's Module Magic Number: 20051115:24
Server loaded:  APR 1.3.9, APR-Util 1.3.9
Compiled using: APR 1.3.9, APR-Util 1.3.9
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr"
 -D SUEXEC_BIN="/usr/bin/suexec"
 -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types"
 -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf"

Following servers having no error:
Server version: Apache/2.2.15 (Unix)
Server built:   Mar  8 2010 13:36:25
Server's Module Magic Number: 20051115:24
Server loaded:  APR 1.3.9, APR-Util 1.3.9
Compiled using: APR 1.3.9, APR-Util 1.3.9
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr"
 -D SUEXEC_BIN="/usr/bin/suexec"
 -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types"
 -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf"

Server version: Apache/2.2.13 (Unix)
Server built:   Aug 10 2009 14:15:43
Server's Module Magic Number: 20051115:23
Server loaded:  APR 1.3.8, APR-Util 1.3.9
Compiled using: APR 1.3.8, APR-Util 1.3.9
Architecture:   32-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr"
 -D SUEXEC_BIN="/usr/bin/suexec"
 -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types"
 -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf"


Good luck,
Yavuz Darendelioglu
(aka bioyavuz)
Comment 51 Yuri Egorov 2010-12-25 08:19:28 UTC
# /usr/sbin/httpd -v
Server version: Apache/2.2.17 (Unix)
Server built:   Oct 19 2010 16:27:47

Still having exactly same problem (triggering graceful restart from command line one by one causes segmentation fault in approximately 3 minutes). I am using Plesk together with Zend Server, and it seems that they use this scenario (multiple graceful restarts) for log rotation or something and my web server is crashing 1-2 times a week.

Looks like no patch saved in 2.2.17 or it is not working now.
Comment 52 Yuri Egorov 2010-12-25 08:20:37 UTC
My platform is CentOS 5.5.
Comment 53 Stefan Fritsch 2010-12-27 17:00:10 UTC
(In reply to comment #51)
> Still having exactly same problem (triggering graceful restart from command
> line one by one causes segmentation fault in approximately 3 minutes).

This is not the same problem. This bug is about hanging processes, not crashes. Please open a new bug report for your segfaults and provide backtraces. See the section "Debugging intermittent crashes" at http://httpd.apache.org/dev/debugging.html
Comment 54 Milos Ivanovic 2011-06-16 17:28:13 UTC
I am still encountering this problem on Gentoo using Apache 2.2.17 with the itk mpm. Was it really resolved?
Comment 55 Milos Ivanovic 2011-06-16 17:34:03 UTC
Sorry I should have reinforced the problem more precisely - this is what is being written to /var/log/apache2/error_log when I run /etc/init.d/apache2 reload:


[Fri Jun 17 04:59:23 2011] [notice] Graceful restart requested, doing restart
[Fri Jun 17 04:59:23 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket)
[Fri Jun 17 04:59:25 2011] [notice] Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/1.0.0d configured -- resuming normal operations
[Fri Jun 17 05:01:20 2011] [notice] Graceful restart requested, doing restart
[Fri Jun 17 05:01:20 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket)
[Fri Jun 17 05:01:22 2011] [notice] Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/1.0.0d configured -- resuming normal operations

The error does not appear when I do a full Apache restart. I believe the error began as soon as I started using the MPM.
Comment 56 Jeff Trawick 2011-06-16 17:44:11 UTC
itk-mpm user:
This bug is tracking something specific to the prefork MPM.  Report elsewhere if you're using the itk-mpm.
Comment 57 Stefan Fritsch 2011-06-16 18:40:52 UTC
(In reply to comment #54)
> I am still encountering this problem on Gentoo using Apache 2.2.17 with the itk
> mpm. Was it really resolved?

If you use mpm-itk 2.2.17-01 or newer, then it should be resolved there, too.

In any case, I am really confident that this bug is fixed. If anyone encounters a similar issue with 2.2.19, he should open a new PR.
Comment 58 chris 2011-08-01 19:26:54 UTC
I just thought I would add something I noticed.

I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's in server-status never going away with apache 2.2.13 years ago, compiled with SUN cc 12.1 and preinstalled ssl from /usr/sfw.  I installed 2.2.14 and the problem went away.  If you do not implement ssl there is no problem.

This week I set up a clean, patched Solaris 10, with httpd 2.2.18, compiled using gcc and openssl 0.9.8R in /usr/local/ssl.  I saw that the problem returned, G in server-status.  I installed 2.2.19 the same way and the problem remained.  

I returned to 2.2.14, configured and compiled the same way as 18 and 19, and the problem is NOT there.  

I"m just an admin and using prefork but I thought this might help someone figure out why.

Thanks
Comment 59 Oliver Boettcher 2011-08-05 13:04:36 UTC
(In reply to comment #58)
> I just thought I would add something I noticed.
> 
> I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's
> in server-status never going away with apache 2.2.13 years ago, compiled with
> SUN cc 12.1 and preinstalled ssl from /usr/sfw.  I installed 2.2.14 and the
> problem went away.  If you do not implement ssl there is no problem.
> 
In my opinion SSL is not the Problem, this also happens with multiple listeners (80, 81 e.g. loadbalancer terminated SSL), also stated somewhere above. But I still confirm this, possibly only on Solaris 10 (x86 or sparc does not matter). I will have a closer look on 2.2.14 to mayby stick with that version. Besides, FreeBSD and all tested Linux flavors are doing fine.

> 
> I"m just an admin and using prefork but I thought this might help someone
> figure out why.
> 
Me too, so I'm not much of a help.

Thanks
Oliver
Comment 60 Stefan Fritsch 2011-08-06 11:43:39 UTC
(In reply to comment #59)
> I will have a closer look on 2.2.14 to mayby stick with that version. Besides,
> FreeBSD and all tested Linux flavors are doing fine.

PR 42829 is fixed. You are experiencing a different issue which is tracked in PR 50261, which also has a work-around. Please only add comments to that new PR.
Comment 61 chris 2011-08-19 14:27:31 UTC
Is the patch for fixing the apache-status showing G the attachment "patch for graceful-stop / graceful reload problem without race condition"?  If so, why wasn't it implemented in the subsequent releases, except 14?  I experienced the problem in 18 and 19.

Does it work for Solaris please, because all that is mentioned is linux?

Thank you for any information.
Comment 62 chris 2011-08-19 15:16:44 UTC
(In reply to comment #61)
Comment 63 paul 2011-12-06 18:11:55 UTC
Everytime I do a graceful-stop, I see the following in the error log:

[Fri Dec 02 19:02:22 2011] [notice] caught SIGWINCH, shutting down gracefully
[Fri Dec 02 19:02:22 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket)

I'm using apache 2.2.21; prefork MPM on linux.  After reading through this thread (especially Joe's comment - https://issues.apache.org/bugzilla/show_bug.cgi?id=42829#c16).  I figured out that it is a race condition where SIGUSR1 is called after the child has done "else if (die_now) {" and before the child has done "status = lr->accept_func(&csd, lr, ptrans);".

The following patch fixed the problem for me:

@@ -638,6 +638,14 @@
                  * ### than POLLIN
                  */
                 lr = pdesc[last_poll_idx++].client_data;
+
+               if (die_now) {
+                       /* In graceful stop/restart; drop the mutex
+                        * and terminate the child. */
+                       SAFE_ACCEPT(accept_mutex_off());
+                       clean_child_exit(0);
+               }
+
                 goto got_fd;
             }
         }
Comment 64 chris 2011-12-27 21:41:28 UTC
Hi,

Can paul@silvitch.com specify which source file he included the patch into please?  Thank you very very much.  

I have found this doesn't occur on my linux servers running Centos and apache 2.2.15 but solaris has the problem on that same version.  It doesn't have the problem on 2.2.14 on Solaris.  Also, it only happens when SSL is compiled in on solaris.

Thanks for any help.