Bug 62838 - semaphores appear to be leaking
Summary: semaphores appear to be leaking
Status: RESOLVED LATER
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: Core (show other bugs)
Version: 2.4.34
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-19 14:14 UTC by mark
Modified: 2018-12-20 09:36 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mark 2018-10-19 14:14:18 UTC
As we have progressively  increased the number of sites using Apache 2.4 instead of Apache 2.2, we are seeing more and more cases of semaphore leakage. The symptom is simply  that after a period of time of the order of a few days, we hit the kernel configured semaphore limit. 


(chroot) user@host:~ $ ipcs -ls

------ Semaphore Limits --------
max number of arrays = 8192
max semaphores per array = 250
max semaphores system wide = 262144
max ops per semop call = 32
semaphore max value = 32767

(chroot) user@host:~ $ ipcs -us

------ Semaphore Status --------
used arrays = 8192
allocated semaphores = 8204

(chroot) user@host:~ $

All Apache instances are completely stopped once a day and restarted, so we would not expect a net increase of any kind over time.

I believe the likely culprits are either the healthcheck module or the  proxy balancer module, both of which we make extensive use of. 

What options do I have for tracking this down?

Here's the loaded module dump.

Loaded Modules:
 core_module (static)
 authn_file_module (static)
 authz_host_module (static)
 authz_groupfile_module (static)
 authz_user_module (static)
 authnz_ldap_module (static)
 auth_basic_module (static)
 auth_digest_module (static)
 so_module (static)
 http_module (static)
 mime_module (static)
 ldap_module (static)
 log_config_module (static)
 env_module (static)
 ssl_module (static)
 http2_module (static)
 actions_module (static)
 alias_module (static)
 rewrite_module (static)
 mpm_event_module (shared)
 negotiation_module (shared)
 status_module (shared)
 info_module (shared)
 include_module (shared)
 autoindex_module (shared)
 dir_module (shared)
 cgi_module (shared)
 expires_module (shared)
 headers_module (shared)
 unique_id_module (shared)
 setenvif_module (shared)
 logio_module (shared)
 unixd_module (shared)
 fcgid_module (shared)
 proxy_module (shared)
 proxy_http_module (shared)
 proxy_http2_module (shared)
 reqtimeout_module (shared)
 proxy_ajp_module (shared)
 slotmem_shm_module (shared)
 proxy_balancer_module (shared)
 proxy_wstunnel_module (shared)
 proxy_hcheck_module (shared)
 watchdog_module (shared)
 lbmethod_byrequests_module (shared)
 lbmethod_bytraffic_module (shared)
 lbmethod_bybusyness_module (shared)
 lbmethod_heartbeat_module (shared)
 auth_kerb_module (shared)
 suexec_module (shared)
 gateform_module (shared)
 qos_module (shared)
 jk_module (shared)
 weblogic_module (shared)
 socache_shmcb_module (shared)
 ct_auth_module (shared)
 authz_dbm_module (shared)
 security2_module (shared)
 deflate_module (shared)
 filter_module (shared)
 substitute_module (shared)
 authn_core_module (shared)
 authz_core_module (shared)
 remoteip_module (shared)
Comment 1 mark 2018-10-19 16:37:06 UTC
I can see that https://httpd.apache.org/docs/2.4/mod/core.html#mutex  clearly indicates that child process crashes can lead to this outcome, although I can see no evidence of crashes in the logs.  Is it possible to log all of the mutex creation/destruction operations? I didn't spot anything in the code, but I might  not have looked carefully enough.
Comment 2 Jim Jagielski 2018-10-19 16:42:38 UTC
How are you restarting your Apache instances? Graceful restarts? Complete shutdown and starts? Are you using apachectl?
Comment 3 mark 2018-10-19 17:48:13 UTC
The rough pattern is SIGHUPs over the course of the day with a SIGTERM (to the parent) just after 01:00 for log rotation, like this.

[Thu Oct 18 03:19:34.217251 2018] [mpm_event:notice] [pid 3058813:tid 140192048617216] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 07:04:40.513395 2018] [mpm_event:notice] [pid 475524:tid 140499276367616] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 07:15:05.277515 2018] [mpm_event:notice] [pid 717288:tid 140303380227840] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 07:24:43.178583 2018] [mpm_event:notice] [pid 780677:tid 140023181338368] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 07:42:35.453410 2018] [mpm_event:notice] [pid 854650:tid 139715977262848] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 07:56:23.270647 2018] [mpm_event:notice] [pid 953435:tid 140331639346944] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 08:30:58.220769 2018] [mpm_event:notice] [pid 1048890:tid 139641276536576] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 08:50:43.442987 2018] [mpm_event:notice] [pid 1227222:tid 139671248234240] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:01:02.131030 2018] [mpm_event:notice] [pid 1345267:tid 140254622779136] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:14:18.201380 2018] [mpm_event:notice] [pid 1417997:tid 139692156507904] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:24:43.148597 2018] [mpm_event:notice] [pid 1512153:tid 140096894269184] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:38:48.019183 2018] [mpm_event:notice] [pid 1587143:tid 139858449991424] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:48:22.410284 2018] [mpm_event:notice] [pid 1680348:tid 140373525317376] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 09:56:23.741254 2018] [mpm_event:notice] [pid 1748967:tid 140016555403008] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 10:10:38.933538 2018] [mpm_event:notice] [pid 1813953:tid 140230740670208] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 10:26:59.620699 2018] [mpm_event:notice] [pid 1907230:tid 140531967248128] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 10:38:08.636185 2018] [mpm_event:notice] [pid 2001438:tid 140078777874176] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 10:55:26.661686 2018] [mpm_event:notice] [pid 2087715:tid 140430556067584] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 11:05:20.450685 2018] [mpm_event:notice] [pid 2195232:tid 140517951530752] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 11:15:45.798789 2018] [mpm_event:notice] [pid 2266456:tid 140668003170048] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 11:25:25.056891 2018] [mpm_event:notice] [pid 2345427:tid 139974508877568] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 11:40:05.265569 2018] [mpm_event:notice] [pid 2432669:tid 139917407106816] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 12:06:19.042160 2018] [mpm_event:notice] [pid 2514479:tid 139791245469440] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 12:14:27.821021 2018] [mpm_event:notice] [pid 2664078:tid 139669830293248] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 12:35:49.668875 2018] [mpm_event:notice] [pid 2729559:tid 139811383105280] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 12:44:56.870458 2018] [mpm_event:notice] [pid 2864314:tid 140215513036544] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 12:57:36.917337 2018] [mpm_event:notice] [pid 2932108:tid 139853256886016] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 13:16:04.227476 2018] [mpm_event:notice] [pid 3016631:tid 139778178967296] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 14:14:04.107826 2018] [mpm_event:notice] [pid 3129021:tid 139774102066944] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 14:52:22.126417 2018] [mpm_event:notice] [pid 3447655:tid 140403792983808] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 15:01:16.477650 2018] [mpm_event:notice] [pid 3681299:tid 140150433875712] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 15:24:42.809245 2018] [mpm_event:notice] [pid 3742982:tid 140258378319616] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 15:34:30.657432 2018] [mpm_event:notice] [pid 3879714:tid 140478359144192] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 15:43:57.991859 2018] [mpm_event:notice] [pid 3952798:tid 140655863158528] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 16:26:35.816201 2018] [mpm_event:notice] [pid 4052316:tid 140065128847104] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 16:31:53.645372 2018] [mpm_event:notice] [pid 65736:tid 140362851575552] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 16:40:28.808164 2018] [mpm_event:notice] [pid 109456:tid 140604273059584] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 16:54:06.353772 2018] [mpm_event:notice] [pid 176211:tid 139934321317632] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 17:08:31.741640 2018] [mpm_event:notice] [pid 272353:tid 140085452125952] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 17:58:01.055361 2018] [mpm_event:notice] [pid 364997:tid 140507948091136] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 18:11:08.913313 2018] [mpm_event:notice] [pid 643816:tid 140492747138816] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 18:27:16.758561 2018] [mpm_event:notice] [pid 719057:tid 139999932757760] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 18:34:56.351274 2018] [mpm_event:notice] [pid 824978:tid 140478992578304] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 18:39:51.844205 2018] [mpm_event:notice] [pid 887082:tid 139851501393664] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 18:57:42.577825 2018] [mpm_event:notice] [pid 923681:tid 139664966297344] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 19:02:15.307488 2018] [mpm_event:notice] [pid 1036942:tid 139974597519104] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 19:11:02.630329 2018] [mpm_event:notice] [pid 1072907:tid 140637559154432] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 19:19:03.167280 2018] [mpm_event:notice] [pid 1143917:tid 140489866045184] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 19:27:48.270997 2018] [mpm_event:notice] [pid 1205882:tid 140486785251072] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 21:27:57.698779 2018] [mpm_event:notice] [pid 1263176:tid 139997587175168] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 21:34:08.127439 2018] [mpm_event:notice] [pid 1848340:tid 140261775292160] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 21:40:25.082499 2018] [mpm_event:notice] [pid 1944847:tid 140245121877760] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 21:52:28.003812 2018] [mpm_event:notice] [pid 1978561:tid 139664764069632] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 23:49:38.886666 2018] [mpm_event:notice] [pid 2050766:tid 140626123683584] AH00494: SIGHUP received.  Attempting to restart
[Fri Oct 19 00:43:49.371914 2018] [mpm_event:notice] [pid 2684324:tid 140592126330624] AH00494: SIGHUP received.  Attempting to restart
[Fri Oct 19 00:48:57.350019 2018] [mpm_event:notice] [pid 2957785:tid 140626969446144] AH00494: SIGHUP received.  Attempting to restart
[Fri Oct 19 01:01:29.873277 2018] [mpm_event:notice] [pid 3003584:tid 140633418872576] AH00494: SIGHUP received.  Attempting to restart
Comment 4 mark 2018-10-19 17:51:01 UTC
Actually, I filtered out the child signals, but the children are often unresponsive so this pattern is common (resorting to a SIGKILL).

[Thu Oct 18 21:52:28.003812 2018] [mpm_event:notice] [pid 1978561:tid 139664764069632] AH00494: SIGHUP received.  Attempting to restart
[Thu Oct 18 23:49:31.866773 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:31.866815 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:33.868846 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:33.868913 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:35.870945 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:35.870993 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:37.873019 2018] [core:error] [pid 2050766:tid 140626123683584] AH00046: child process 2064824 still did not exit, sending a SIGKILL
[Thu Oct 18 23:49:37.873288 2018] [core:error] [pid 2050766:tid 140626123683584] AH00046: child process 2065149 still did not exit, sending a SIGKILL
Comment 5 mark 2018-10-19 17:56:43 UTC
Actually this is probably more useful (following a particular parent PID reference)

[Thu Oct 18 23:49:31.866773 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:31.866815 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:33.868846 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:33.868913 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:35.870945 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2064824 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:35.870993 2018] [core:warn] [pid 2050766:tid 140626123683584] AH00045: child process 2065149 still did not exit, sending a SIGTERM
[Thu Oct 18 23:49:37.873019 2018] [core:error] [pid 2050766:tid 140626123683584] AH00046: child process 2064824 still did not exit, sending a SIGKILL
[Thu Oct 18 23:49:37.873288 2018] [core:error] [pid 2050766:tid 140626123683584] AH00046: child process 2065149 still did not exit, sending a SIGKILL
[Thu Oct 18 23:49:38.886666 2018] [mpm_event:notice] [pid 2050766:tid 140626123683584] AH00494: SIGHUP received.  Attempting to restart
Comment 6 William A. Rowe Jr. 2018-10-19 20:01:32 UTC
Hi Mark,

You failed to mention your operating system but let's presume this is linux.
Semaphores are not HTTPD's default signaling mechanism, but many third party
modules default to semaphores.

You could narrow down which modules in a non-production environment by
simply checking `ipcs -s` and then avoiding loading one specific module
or another and rechecking. Again, if this is linux, these aren't likely 
from httpd, but from one or more of these non-core third party modules;

 auth_kerb_module (shared)
 suexec_module (shared)
 gateform_module (shared)
 qos_module (shared)
 jk_module (shared)
 weblogic_module (shared)
 ct_auth_module (shared)
 security2_module (shared)

Very often these are httpd-2.2 module legacies which weren't entirely
ported to the capabilities of httpd-2.4. That doesn't mean they won't
work, but modules relying on semaphore locking are subject to the
instability you've described.

From your logs;

[Thu Oct 18 23:49:37.873019 2018] [core:error] [pid 2050766:tid
140626123683584] AH00046: child process 2064824 still did not exit, sending a
SIGKILL

Semaphores orphaned

[Thu Oct 18 23:49:37.873288 2018] [core:error] [pid 2050766:tid
140626123683584] AH00046: child process 2065149 still did not exit, sending a
SIGKILL

Semaphores orphaned

Due to the nature of semaphores, there is nothing we can do here. Using
a clever combination of ipcs -s and -p flags it is possible to identify the
semaphores not-in-use, and purge them periodically. An internet search will
quickly lead you to such scripts.

If any of the modules mentioned above are using semaphores, your best next
step is to inquire of those third party module authors and find out whether
there is an updated module available or whether the authors would modify
their module to use the default, preferred apr locking mechanism.
Comment 7 mark 2018-10-19 20:16:44 UTC
It is Linux, yes. Here's the -V output and I was under the impression semaphores were the default mutex, but perhaps I've misunderstood. 

Server version: Apache/2.4.34 (Unix)
Server built:   Sep 14 2018 13:35:35
Server's Module Magic Number: 20120211:79
Server loaded:  APR 1.6.3, APR-UTIL 1.6.1
Compiled using: APR 1.6.3, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/versions64/apache-2.4.34"
 -D SUEXEC_BIN="/versions64/apache-2.4.34/bin/suexec"
 -D DEFAULT_PIDLOG="logs/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

I can believe the SIGKILL cases might represent orphans, although a central registry of semaphores, say in the filesystem,  would allow this case to be handled, but that's a lot of management overhead for what should be a rare case.

It would be difficult, but not impossible to reproduce the production behaviours outside of production, which is why I was hoping some higher level of tracing might give me some clues. I suppose, in the extreme, I could hook up a permanent strace and hope there's enough surrounding detail to pinpoint who exactly is creating the semaphores.
Comment 8 Yann Ylavic 2018-10-19 20:42:26 UTC
KILLing the children processes shouldn't leak semaphores since there are created (and destroyed) by the parent process.
Leaks usually happen when the parent process itself is killed, or crashes.
Does your system report such parent process crashes?

Anyway, I'd recommend setting:

    Mutex pthread default

at the global scope to avoid such issues.
Comment 9 Yann Ylavic 2018-10-19 20:48:01 UTC
If you compile APR by yourself, you might want to ./configure it with "apr_lock_method=USE_PROC_PTHREAD_SERIALIZE" too.
Comment 10 Yann Ylavic 2018-10-19 20:56:13 UTC
Also, IPC SysV SHMs which have the same possible leak issue can be replaced by posix SHMs with --enable-posix-shm, pthreads mutexes et posix SHMs are the best options on Linux (IMHO).
Comment 11 mark 2018-10-19 21:02:37 UTC
Interesting ideas, thanks. Yes, we compile everything ourselves aside from libc.
Comment 12 mark 2018-10-20 12:04:34 UTC
(In reply to Yann Ylavic from comment #8)
> KILLing the children processes shouldn't leak semaphores since there are
> created (and destroyed) by the parent process.
> Leaks usually happen when the parent process itself is killed, or crashes.
> Does your system report such parent process crashes?

No, there's no indication of the parent crashing. Should the parent PID change after a SIGHUP? I will peer at the docs though.
Comment 13 mark 2018-12-20 09:36:18 UTC
Never found the source of the semaphore leaks, but using pthread mutexes, has made that problem go away. Might as well close this as the Apache team seem pretty convinced the issue is in a 3rd party module and I don't have the spare time to try and reproduce this case in a non-production environment.