If httpd dies and leaves behind a pidfile containing a pid which later gets reused, the httpd refuses to come back up. I've already seen this in production, where a host has crashed and on coming back up the web server fails to start because something else has grabed the pid. To reproduce the problem, do this: [root@laptop httpd]# pkill -9 httpd [root@laptop httpd]# pgrep httpd [root@laptop httpd]# echo 1 > /var/run/httpd/httpd.pid [root@laptop httpd]# /usr/sbin/httpd -k start httpd: Could not reliably determine the server's fully qualified domain name, using fe80::201:4aff:fe5e:5331 for ServerName httpd (pid 1) already running This is the version I'm using: [quick@laptop ~]$ httpd -v Server version: Apache/2.2.22 (Unix) Server built: Apr 30 2012 09:55:05 [quick@laptop ~]$ cat /etc/redhat-release Fedora release 17 (Beefy Miracle) I tested this out on RHEL6 which ships with httpd 2.2.15 and noted that doesn't suffer the same problem, however I can't see anything in the changelog between versions 2.2.15 and 2.2.22 which would have caused this problem to occur.
I reproduced the problem on Fedora 18 with httpd 2.4.3 as well: [root@laptop httpd]# ps -ef | grep [h]ttp root 2326 1 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2327 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2328 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2329 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2330 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2331 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 2332 2326 0 20:57 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND [root@laptop httpd]# kill -9 2326 [root@laptop httpd]# ps -ef | grep [h]ttp [root@laptop httpd]# echo 1 > /var/run/httpd/httpd.pid [root@laptop httpd]# /usr/sbin/httpd -k start httpd (pid 1) already running [root@laptop httpd]# ps -ef | grep [h]ttp [root@laptop httpd]# httpd -v Server version: Apache/2.4.3 (Fedora) Server built: Jan 8 2013 13:46:23 [root@laptop httpd]# uname -a Linux laptop 3.7.7-201.fc18.i686 #1 SMP Tue Feb 12 22:59:10 UTC 2013 i686 i686 i386 GNU/Linux
Just for comparison, I carried out the same test on nginx and that was fine. [root@laptop run]# ps -ef | grep [n]ginx root 3055 1 0 21:50 ? 00:00:00 nginx: master process /usr/sbin/nginx nginx 3056 3055 0 21:50 ? 00:00:00 nginx: worker process [root@laptop run]# cat /run/nginx.pid 3055 [root@laptop run]# kill -9 3055 [root@laptop run]# ps -ef | grep [n]ginx [root@laptop run]# echo 1 > /run/nginx.pid [root@laptop run]# /usr/sbin/nginx [root@laptop run]# ps -ef | grep [n]ginx root 3144 1 0 21:53 ? 00:00:00 nginx: master process /usr/sbin/nginx nginx 3145 3144 0 21:53 ? 00:00:00 nginx: worker process [root@laptop run]# cat /run/nginx.pid 3144 [root@laptop run]# nginx -v nginx version: nginx/1.2.6
This issue continues to be present in 2.4.18 as shipped by RHEL 7 as package: httpd24-httpd-2.4.18-11.el7.x86_64 Until I found this bug report, I was puzzled that an nfs process was being identified as httpd. So if Edward Quick would like me to send him a beer, I will be delighted to do so.
Thanks a lot for the tests, bz.apache.org/bugzilla/show_bug.cgi?id=60261 was a recent similar use case in which the same PID is re-used in Docker containers (so since it is the same PID it is safe to proceed). In the upcoming release (2.4.24) the code looks more or less like this: #Read the pid file and store the result in 'otherpid' rv = ap_read_pid(pconf, ap_pid_fname, &otherpid); if (otherpid != getpid() && kill(otherpid, 0) == 0) { # httpd already running } In this case, the new PID is different from the one used by the old httpd process (so otherpid != getpid()) but it is used by a completely different running process (so kill(otherpid, 0) == 0 is also true), that overlaps with the regular case in which httpd is already started and it is correct to end up in the "httpd already running" error case. Waiting for other feedback since I am not sure how to solve this issue simply looking at PIDs (something more might be needed).
Same problem seen with a build from trunk rev 1833619 thus : tls13# /usr/local/bin/apachectl start httpd (pid 2548) already running tls13# ps -ef | grep "2548" root 2548 2489 0 10:26:09 pts/10 0:00 -sh root 4423 2548 0 10:54:29 pts/10 0:00 grep 2548 tls13# Deleting the left behind sock file does nothing : tls13# ls -lap /usr/local/www/var/run total 9 drwxr-xr-x 2 webservd webservd 3 Aug 13 10:35 ./ drwxr-xr-x 4 root root 4 Jun 15 19:28 ../ srwx------ 1 webservd webservd 0 Aug 13 10:35 cgid.sock.2548 tls13# tls13# /usr/local/bin/httpd -V Server version: Apache/2.5.1-dev (Unix) Server built: Jun 15 2018 19:01:31 Server's Module Magic Number: 20180422:1 Server loaded: APR 1.6.3, APR-UTIL 1.5.3, PCRE 8.40 2017-01-11 Compiled using: APR 1.6.3, APR-UTIL 1.5.3, PCRE 8.40 2017-01-11 Architecture: 64-bit Server MPM: event threaded: yes (fixed thread count) forked: yes (variable process count) Server compiled with.... -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_PROC_PTHREAD_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=256 -D HTTPD_ROOT="/usr/local" -D SUEXEC_BIN="/usr/local/bin/suexec" -D DEFAULT_PIDLOG="httpd.pid" -D DEFAULT_SCOREBOARD="apache_runtime_status" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="www/conf/mime.types" -D SERVER_CONFIG_FILE="www/conf/httpd.conf" tls13# Temporary brute force method I used was to simply reboot the server and get a new set of pids in use. tls13 # uptime 10:57am up 1 min(s), 1 user, load average: 0.12, 0.08, 0.04 tls13 # /usr/local/bin/apachectl start tls13 # Not pretty but works for the moment.