Bug 62044 - shared memory segments are not found in global list, but appear to exist in kernel.
Summary: shared memory segments are not found in global list, but appear to exist in k...
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_balancer (show other bugs)
Version: 2.4.29
Hardware: PC Linux
: P2 critical (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2018-01-25 12:35 UTC by mark
Modified: 2018-02-14 07:29 UTC (History)
1 user (show)



Attachments
Also remove SHM file if any (541 bytes, patch)
2018-01-25 23:46 UTC, Yann Ylavic
Details | Diff
slotmem SHMs reuse (2.4.x) (27.88 KB, patch)
2018-01-28 23:29 UTC, Yann Ylavic
Details | Diff
Unique balancer id per vhost (1.55 KB, patch)
2018-01-31 10:01 UTC, Yann Ylavic
Details | Diff
Reuse SHMs names on restart or stop/start (2.4.x) (38.09 KB, patch)
2018-02-08 16:32 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mark 2018-01-25 12:35:03 UTC
With a large number of vhosts ( > 1000 ) and proxy balancer configurations ( > 1000), we are seeing Apache exit at start up time with a configuration error (very frequently) with an error like. 

[Wed Jan 10 16:28:45.853599 2018] [slotmem_shm:error] [pid 29764:tid 140038537377536] (17)File exists: AH02611: create: apr_shm_create(/apache24/logs/slotmem-shm-p71143bd8_balancer1.shm) failed

[Wed Jan 10 16:28:45.853641 2018] [:emerg] [pid 29764:tid 140038537377536] AH00020: Configuration Failed, exiting 

turning on trace5 level logs we see things like the following for a single balancer worker (I filtered on the balance SHM name)

[Thu Jan 25 03:48:08.397926 2018] [slotmem_shm:debug] [pid 13310:tid 140455729428224] mod_slotmem_shm.c(364): AH02602: create didn't find /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm in global list
[Thu Jan 25 03:48:08.397932 2018] [slotmem_shm:debug] [pid 13310:tid 140455729428224] mod_slotmem_shm.c(374): AH02300: create /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:48:08.398076 2018] [slotmem_shm:debug] [pid 13310:tid 140455729428224] mod_slotmem_shm.c(417): AH02611: create: apr_shm_create(/apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm) succeeded
[Thu Jan 25 03:48:58.529349 2018] [slotmem_shm:debug] [pid 45813:tid 139795075143424] mod_slotmem_shm.c(364): AH02602: create didn't find /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm in global list
[Thu Jan 25 03:48:58.529357 2018] [slotmem_shm:debug] [pid 45813:tid 139795075143424] mod_slotmem_shm.c(374): AH02300: create /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:49:01.835207 2018] [slotmem_shm:debug] [pid 46229:tid 139795075143424] mod_slotmem_shm.c(496): AH02301: attach looking for /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm
[Thu Jan 25 03:49:01.835222 2018] [slotmem_shm:debug] [pid 46625:tid 139795075143424] mod_slotmem_shm.c(496): AH02301: attach looking for /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm
[Thu Jan 25 03:49:01.835230 2018] [slotmem_shm:debug] [pid 46229:tid 139795075143424] mod_slotmem_shm.c(509): AH02302: attach found /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:49:01.835254 2018] [slotmem_shm:debug] [pid 46625:tid 139795075143424] mod_slotmem_shm.c(509): AH02302: attach found /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:49:01.886171 2018] [slotmem_shm:debug] [pid 47011:tid 139795075143424] mod_slotmem_shm.c(496): AH02301: attach looking for /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm
[Thu Jan 25 03:49:01.886284 2018] [slotmem_shm:debug] [pid 47011:tid 139795075143424] mod_slotmem_shm.c(509): AH02302: attach found /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:49:01.899288 2018] [slotmem_shm:debug] [pid 47281:tid 139795075143424] mod_slotmem_shm.c(496): AH02301: attach looking for /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm
[Thu Jan 25 03:49:01.899321 2018] [slotmem_shm:debug] [pid 47281:tid 139795075143424] mod_slotmem_shm.c(509): AH02302: attach found /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:53:03.516455 2018] [slotmem_shm:debug] [pid 45813:tid 139795075143424] mod_slotmem_shm.c(364): AH02602: create didn't find /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm in global list
[Thu Jan 25 03:53:03.516462 2018] [slotmem_shm:debug] [pid 45813:tid 139795075143424] mod_slotmem_shm.c(374): AH02300: create /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm: 1176/2
[Thu Jan 25 03:53:03.516499 2018] [slotmem_shm:error] [pid 45813:tid 139795075143424] (17)File exists: AH02611: create: apr_shm_create(/apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm) failed



In other words, in the space of five minutes, the balancer was not found in the global list (03:48:08), successfully created, then found several times, then went missing at 03:53:03, and then failed to create it, which then triggered an Apache exit (not shown here)

Rather confusingly, the choice of the DefaultRuntimeDirectory has an impact on frequence.
Comment 1 mark 2018-01-25 16:39:56 UTC
I believe the error arises here

https://github.com/apache/httpd/blob/2.4.29/modules/slotmem/mod_slotmem_shm.c#L408

I assume the 'file exists' error refers to the SHM key rather than the placeholder file in the filesystem.

However, there is a defensive removal of the key *before* the create, which makes this error very mysterious, it should be nearly impossible to fail here I think.

apr_shm_remove(fname, gpool);
rv = apr_shm_create(&shm, size, fname, gpool);

Is there any possibility there is some latency between the removal being effective and the create starting? Or could the remove fail silently?
Comment 2 mark 2018-01-25 16:57:34 UTC
looking at the code for apr_shm_remove at 

https://github.com/apache/apr/blob/1.6.1/shmem/unix/shm.c#L436

I am reminded that

    /* Indicate that the segment is to be destroyed as soon
     * as all processes have detached. This also disallows any
     * new attachments to the segment. */
    if (shmctl(shmid, IPC_RMID, NULL) == -1) {
        goto shm_remove_failed;
}

So, while the remove can succeed, although I note the return status isn't tested here, the key will hang around until the last process detaches, so the defensive measure isn't effective.

So back to the original question, why does Apache think this slot isn't already in the global list.
Comment 3 Yann Ylavic 2018-01-25 23:46:10 UTC
Created attachment 35698 [details]
Also remove SHM file if any

Does this help?
Comment 4 mark 2018-01-26 08:07:58 UTC
Thanks for looking, the apr_shm_remove does an apr_file_remove as the final step, so I would be surprised if another one helps
Comment 5 Ruediger Pluem 2018-01-26 09:08:35 UTC
(In reply to mark from comment #0)
> With a large number of vhosts ( > 1000 ) and proxy balancer configurations (
> > 1000), we are seeing Apache exit at start up time with a configuration
> error (very frequently) with an error like. 
> 
> [Wed Jan 10 16:28:45.853599 2018] [slotmem_shm:error] [pid 29764:tid
> 140038537377536] (17)File exists: AH02611: create:
> apr_shm_create(/apache24/logs/slotmem-shm-p71143bd8_balancer1.shm) failed
> 
> [Wed Jan 10 16:28:45.853641 2018] [:emerg] [pid 29764:tid 140038537377536]
> AH00020: Configuration Failed, exiting 
> 
> turning on trace5 level logs we see things like the following for a single
> balancer worker (I filtered on the balance SHM name)
> 
> [Thu Jan 25 03:48:08.397926 2018] [slotmem_shm:debug] [pid 13310:tid
> 140455729428224] mod_slotmem_shm.c(364): AH02602: create didn't find
> /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm in global list


> [Thu Jan 25 03:48:58.529349 2018] [slotmem_shm:debug] [pid 45813:tid
> 139795075143424] mod_slotmem_shm.c(364): AH02602: create didn't find
> /apache24/logs/slotmem-shm-pe1b232bb_balancer1.shm in global list

Hm. The above two lines are weird. mod_proxy_balancer only creates the shm segments in the post_config phase where there is still only one httpd process. But I see two different pid's in the above log messages. Did you do a graceful restart between 03:48:08 and 03:48:58?
Comment 6 mark 2018-01-26 10:37:32 UTC
Probably, we do a lot of active restarts both to bring in managed changes to the configuration (but only hourly) and reactive restarts when apache stops responding.  I will examine and get back to you. 

My feeling after reading the code is that an old process still hasn't detached from the SHM segment, so the SHM key hangs around, but the placeholder file does get deleted, so when the next Apache process comes along, presumably without a filled-in global list, it attempts to re-instate a SHM key that still hasn't been quite released by the last process.
Comment 7 mark 2018-01-26 11:14:56 UTC
give that man a teddy bear. pid 13310 was born at 03:40:12 with a SIGHUP at 03:47:58 and then permanently exiting at 03:48:11.

[Thu Jan 25 03:40:22.300797 2018] [mpm_event:notice] [pid 13310:tid 140455729428224] AH00489: Apache/2.4.29 (Unix) OpenSSL/1.0.2n mod_fcgid/2.3.9 mod_auth_kerb/5.4 mod_qos/11.43 mod_jk/1.2.42 configured -- resuming normal operations
[Thu Jan 25 03:40:22.300851 2018] [core:notice] [pid 13310:tid 140455729428224] AH00094: Command line: '/apache24/bin/httpd -f /apache24/conf/dynamic/apache24/httpd.conf -D XXXXX'
[Thu Jan 25 03:47:58.097848 2018] [mpm_event:notice] [pid 13310:tid 140455729428224] AH00494: SIGHUP received.  Attempting to restart
[Thu Jan 25 03:48:11.467544 2018] [core:notice] [pid 13310:tid 140455729428224] AH00060: seg fault or similar nasty error detected in the parent process

so, the diagnosis probably remains roughly the same, some SHM keys are not getting removed or not removed quickly enough and are still in place the next time the same configuration starts up.

I can't yet find any trace of the seg fault suggested though. We do see that line a lot

" AH00060: seg fault or similar nasty error detected in the parent process" but I cannot tell what it's referring to.
Comment 8 Eric Covener 2018-01-26 12:35:00 UTC
(In reply to mark from comment #7)
> give that man a teddy bear. pid 13310 was born at 03:40:12 with a SIGHUP at
> 03:47:58 and then permanently exiting at 03:48:11.
> 
> [Thu Jan 25 03:40:22.300797 2018] [mpm_event:notice] [pid 13310:tid
> 140455729428224] AH00489: Apache/2.4.29 (Unix) OpenSSL/1.0.2n
> mod_fcgid/2.3.9 mod_auth_kerb/5.4 mod_qos/11.43 mod_jk/1.2.42 configured --
> resuming normal operations
> [Thu Jan 25 03:40:22.300851 2018] [core:notice] [pid 13310:tid
> 140455729428224] AH00094: Command line: '/apache24/bin/httpd -f
> /apache24/conf/dynamic/apache24/httpd.conf -D XXXXX'
> [Thu Jan 25 03:47:58.097848 2018] [mpm_event:notice] [pid 13310:tid
> 140455729428224] AH00494: SIGHUP received.  Attempting to restart
> [Thu Jan 25 03:48:11.467544 2018] [core:notice] [pid 13310:tid
> 140455729428224] AH00060: seg fault or similar nasty error detected in the
> parent process
> 
> so, the diagnosis probably remains roughly the same, some SHM keys are not
> getting removed or not removed quickly enough and are still in place the
> next time the same configuration starts up.

If this is the case maybe we could bake the generation name into the filename.
Comment 9 Jim Jagielski 2018-01-26 14:29:04 UTC
... or possibly re-used?? I'll need to look. It's been awhile since I've reviewed that chunk of code.
Comment 10 mark 2018-01-26 17:02:57 UTC
baked here:

https://github.com/apache/httpd/blob/2.4.29/modules/proxy/mod_proxy_balancer.c#L787

        id = apr_psprintf(pconf, "%s.%s.%d.%s.%s.%u.%s",
                          (s->server_scheme ? s->server_scheme : "????"),
                          (s->server_hostname ? s->server_hostname : "???"),
                          (int)s->port,
                          (s->server_admin ? s->server_admin : "??"),
                          (s->defn_name ? s->defn_name : "?"),
                          s->defn_line_number,
                          (s->error_fname ? s->error_fname : DEFAULT_ERRORLOG));

        conf->id = apr_psprintf(pconf, "p%x",
ap_proxy_hashfunc(id, PROXY_HASHFUNC_DEFAULT));
Comment 11 Jim Jagielski 2018-01-26 18:30:45 UTC
Yeah, it looks like adding in the generation to conf->id will create a unique name. But I need to see how it effects persistence
Comment 12 Jim Jagielski 2018-01-26 18:51:01 UTC
Upon review, it appears that in slotmem_filenames() there is code that will automagically add generational data to the SHM filename... this is done by default for Win and OS/2.

Are you able to test any fixes?
Comment 13 mark 2018-01-26 20:12:02 UTC
Yes, I can test fixes.
Comment 14 Yann Ylavic 2018-01-28 23:29:31 UTC
Created attachment 35702 [details]
slotmem SHMs reuse (2.4.x)

This patch does:
1/ use a constant file name for all systems (no generation suffix),
2/ maintain the list of the created SHMs *accross restarts*
3/ not unlink the files on (graceful) restart anymore (not needed),
4/ not attach in slotmem_create() anymore (not needed),
5/ add type/sizes consistency check for persisted slots on restoration,
6/ unlink the files only on stop/exit or before creating them (crash remainder).

Mark, could you please try it?

I think we could avoid 6/ if we remove the file just after the SHM is created.
This would work for systems with "unlink semantics" (i.e. unlink allowed while some descriptors are opened even if it really happens when the last one is closed, since we don't need to re-open them now), but not for others so I kept the code generic to start with...
Comment 15 mark 2018-01-29 09:26:13 UTC
We were a bit keen for a fix for this morning (29 Jan), so we went with Jim's patch in trunk as it looked very conservative (extending tested behaviours to Unix from Windows). I didn't see your patch at that point.

http://svn.apache.org/viewvc/httpd/httpd/trunk/modules/slotmem/mod_slotmem_shm.c?r1=1822341&r2=1822340&pathrev=1822341&view=patch

and we're now rolling that out across the pre-production environments today, 29 Jan.

I can't really comment on the relative merits of either approach, so can you give me a recommendation. Is this later patch either more robust or more comprehensive than Jim's?  If you're making a strong recommendation, we will see about pushing that version out to the pre-production environments as an exceptional change, in advance of the next scheduled roll-out.
Comment 16 Yann Ylavic 2018-01-29 10:01:52 UTC
(In reply to mark from comment #15)
> Is this later patch either more robust or more
> comprehensive than Jim's?  If you're making a strong recommendation, we will
> see about pushing that version out to the pre-production environments as an
> exceptional change, in advance of the next scheduled roll-out.
I can't do a recommendation given your time constraints, what I can tell is that if the Windows approach indeed avoids the (re)start failures, it however does not preserve the state of the balancers accross restarts (including graceful).
So things like load distribution, error states, ...,  are reset/lost, as if it were the first startup.

This is not the right fix for httpd, but it may be enough for your use case...
Comment 17 Yann Ylavic 2018-01-29 10:07:40 UTC
In any case, if you go with the "Windows" approach for your production, we are still interested in your testing of attachment 35702 [details] for the future ;)
Comment 18 mark 2018-01-29 10:48:32 UTC
Thanks for the perspective. We were seeing Apache instances fail and not restart due to the orphaned segments, requiring manual intervention to resolve, hence our urgency.

However, I see your point now and this "Windows" fix loses too much state to be the right long term fix and we make extensive use of the proxy  balancer feature, so I will see about an exceptional change to test this more comprehensive change in our pre-production  environments.
Comment 19 mark 2018-01-30 11:31:03 UTC
We were able to rebuild and deploy Yann's patch for the pre-production environments and we're not yet seeing slotmem_shm "File Exists" errors. However, we are seeing a lot of orphaned shared segments (i.e. zero attached processes) as though cleanup is not happening appropriately or is getting bypassed.
Comment 20 mark 2018-01-30 11:36:11 UTC
Sorry, I am wrong, we are still seeing the "file exists" error in our logs.

[Tue Jan 30 09:07:05.575349 2018] [slotmem_shm:debug] [pid 3716:tid 139969799624448] mod_slotmem_shm.c(380): AH02602: create didn't find /var/run/http/apache24/tmp/slotmem-shm-p7a67b429_balancer1.shm in gl
obal list
[Tue Jan 30 09:07:05.575357 2018] [slotmem_shm:debug] [pid 3716:tid 139969799624448] mod_slotmem_shm.c(390): AH02300: create /var/run/http/apache24/tmp/slotmem-shm-p7a67b429_balancer1.shm: 1176/2
[Tue Jan 30 09:07:05.575398 2018] [slotmem_shm:error] [pid 3716:tid 139969799624448] (17)File exists: AH02611: create: apr_shm_create(/var/run/http/apache24/tmp/slotmem-shm-p7a67b429_balancer1.shm) failed
[Tue Jan 30 09:07:05.575442 2018] [:emerg] [pid 3716:tid 139969799624448] AH00020: Configuration Failed, exiting
Comment 21 mark 2018-01-30 13:48:31 UTC
In the patched file, line 396 was updated with "gpool" I believe, should 395 have been updated as well?

    393     {
    394         if (fbased) {
    395             apr_shm_remove(fname, pool);
    396             rv = apr_shm_create(&shm, size, fname, gpool);
    397         }
Comment 22 mark 2018-01-30 14:28:21 UTC
Anyway, in the absence of other ideas, we're going revert to the more conservative patch, even at the cost of cross-generation persistence, at 

http://svn.apache.org/viewvc/httpd/httpd/trunk/modules/slotmem/mod_slotmem_shm.c?r1=1822341&r2=1822340&pathrev=1822341&view=patch

for now.
Comment 23 mark 2018-01-31 09:22:34 UTC
That more conservative patch doesn't seem to have helped either.

[Wed Jan 31 08:44:12.677361 2018] [proxy:debug] [pid 58615:tid 140446564935424] proxy_util.c(1225): AH02337: copying shm[2] (0x7fbc398b07d8) for balancer://balancer3
[Wed Jan 31 08:44:12.677429 2018] [slotmem_shm:debug] [pid 58615:tid 140446564935424] mod_slotmem_shm.c(331): AH02602: create didn't find /var/run/http/apache24/tmp/slotmem-shm-p5dfa5b80_balancer3_0.shm in global list
[Wed Jan 31 08:44:12.677469 2018] [slotmem_shm:debug] [pid 58615:tid 140446564935424] mod_slotmem_shm.c(341): AH02300: create /var/run/http/apache24/tmp/slotmem-shm-p5dfa5b80_balancer3_0.shm: 1176/2
[Wed Jan 31 08:44:12.677585 2018] [slotmem_shm:error] [pid 58615:tid 140446564935424] (17)File exists: AH02611: create: apr_shm_create(/var/run/http/apache24/tmp/slotmem-shm-p5dfa5b80_balancer3_0.shm) failed
[Wed Jan 31 08:44:12.677677 2018] [:emerg] [pid 58615:tid 140446564935424] AH00020: Configuration Failed, exiting

We keep bumping into previously created keys. I wonder if our balancer naming isn't distinctive enough, literally each vhost gets balancer1, balancer2, balancer3. So those names appear hundreds or thousands of times per configuration, but always inside a virtualhost container.

Any ideas?
Comment 24 Yann Ylavic 2018-01-31 10:01:17 UTC
Created attachment 35710 [details]
Unique balancer id per vhost

It seems indeed that if balancer:// are not unique the slotmem is reused accross vhosts.

Does this patch help?
Comment 25 mark 2018-01-31 12:19:20 UTC
Perhaps, we're seeing the error even in the balancer SHMs as well as the worker SHMs and the balancer SHM already uses conf->id as a distinguisher.

https://github.com/apache/httpd/blob/2.4.29/modules/proxy/mod_proxy_balancer.c#L814

For this balancer (not worker), even with Jim's change, we saw the following. Summarizing first

12:26:41  - attach found and attached to slotmem-shm-p701d8bbe_0
12:33:51  - SIGHUP
12:34:54  - create (not attach) fails to find slotmem-shm-p701d8bbe_0
12:34:54  - create fails to create because the SHM key/segment is still in the kernel
12:38:54  - create (under a new PID) fails to find slotmem-shm-p701d8bbe_0 but successfully creates it, presumably because all attached processes detached finally.

Why didnt the generation change? it was zero before and after the HUP.

[Wed Jan 31 12:26:41.463136 2018] [slotmem_shm:debug] [pid 1322:tid 139715805775616] mod_slotmem_shm.c(463): AH02301: attach looking for /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm
[Wed Jan 31 12:26:41.463169 2018] [slotmem_shm:debug] [pid 1322:tid 139715805775616] mod_slotmem_shm.c(476): AH02302: attach found /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
[Wed Jan 31 12:33:51.761487 2018] [mpm_event:notice] [pid 65265:tid 139715805775616] AH00494: SIGHUP received.  Attempting to restart
[Wed Jan 31 12:34:54.471933 2018] [slotmem_shm:debug] [pid 20672:tid 139965041129216] mod_slotmem_shm.c(331): AH02602: create didn't find /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm in global list
[Wed Jan 31 12:34:54.471939 2018] [slotmem_shm:debug] [pid 20672:tid 139965041129216] mod_slotmem_shm.c(341): AH02300: create /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
[Wed Jan 31 12:34:54.471970 2018] [slotmem_shm:error] [pid 20672:tid 139965041129216] (17)File exists: AH02611: create: apr_shm_create(/var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm) failed
[Wed Jan 31 12:38:46.746713 2018] [slotmem_shm:debug] [pid 31117:tid 140506605512448] mod_slotmem_shm.c(331): AH02602: create didn't find /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm in global list
[Wed Jan 31 12:38:46.746719 2018] [slotmem_shm:debug] [pid 31117:tid 140506605512448] mod_slotmem_shm.c(341): AH02300: create /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
[Wed Jan 31 12:38:46.746893 2018] [slotmem_shm:debug] [pid 31117:tid 140506605512448] mod_slotmem_shm.c(384): AH02611: create: apr_shm_create(/var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm) succeeded
[Wed Jan 31 12:38:49.922030 2018] [mpm_event:notice] [pid 31117:tid 140506605512448] AH00489: Apache/2.4.29 (Unix) OpenSSL/1.0.2n mod_fcgid/2.3.9 mod_auth_kerb/5.4 mod_qos/11.43 mod_jk/1.2.42 configured -- resuming
Comment 26 Yann Ylavic 2018-01-31 13:20:00 UTC
(In reply to mark from comment #25)
> 
> [Wed Jan 31 12:26:41.463136 2018] [slotmem_shm:debug] [pid 1322:tid
> 139715805775616] mod_slotmem_shm.c(463): AH02301: attach looking for
> /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm
> [Wed Jan 31 12:26:41.463169 2018] [slotmem_shm:debug] [pid 1322:tid
> 139715805775616] mod_slotmem_shm.c(476): AH02302: attach found
^ This is a child process attaching the SHMs created by the parent process.

> /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
> [Wed Jan 31 12:33:51.761487 2018] [mpm_event:notice] [pid 65265:tid
> 139715805775616] AH00494: SIGHUP received.  Attempting to restart
^ This is the parent process asked to restart (non graceful).

> [Wed Jan 31 12:34:54.471933 2018] [slotmem_shm:debug] [pid 20672:tid
> 139965041129216] mod_slotmem_shm.c(331): AH02602: create didn't find
> /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm in global list
> [Wed Jan 31 12:34:54.471939 2018] [slotmem_shm:debug] [pid 20672:tid
> 139965041129216] mod_slotmem_shm.c(341): AH02300: create
> /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
> [Wed Jan 31 12:34:54.471970 2018] [slotmem_shm:error] [pid 20672:tid
> 139965041129216] (17)File exists: AH02611: create:
^ This is *another* parent process(not the same pid), ditto for the following messages (stripped here).

How so? One minute for a non-graceful restart looks huge too.
Do you have multiple instances of httpd running (and using the same log file)?
Could you monitor the processes here?
Comment 27 Yann Ylavic 2018-01-31 13:34:46 UTC
(In reply to mark from comment #7)
> 
> " AH00060: seg fault or similar nasty error detected in the parent process"
> but I cannot tell what it's referring to.

The parent process crashed leaving children orphaned (hence attached to SHMs).

You possibly need this patch too: https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-PR61558.patch
It was merged for upcoming 2.4.30 already (r1820794).
See Bug 61558.
Comment 28 Yann Ylavic 2018-01-31 14:02:19 UTC
(In reply to Yann Ylavic from comment #14)
> Created attachment 35702 [details]
> slotmem SHMs reuse (2.4.x)

Committed to trunk in r1822509.

(In reply to Yann Ylavic from comment #24)
> Created attachment 35710 [details]
> Unique balancer id per vhost

Committed to trunk in r1822800.
Comment 29 Yann Ylavic 2018-01-31 14:30:10 UTC
> (In reply to Yann Ylavic from comment #24)
> > Created attachment 35710 [details]
> > Unique balancer id per vhost
> 
> Committed to trunk in r1822800.
Reverted, all was there already (sname vs name).
Comment 30 mark 2018-01-31 16:55:55 UTC
(In reply to Yann Ylavic from comment #26)
> (In reply to mark from comment #25)
> > 
> > [Wed Jan 31 12:26:41.463136 2018] [slotmem_shm:debug] [pid 1322:tid
> > 139715805775616] mod_slotmem_shm.c(463): AH02301: attach looking for
> > /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm
> > [Wed Jan 31 12:26:41.463169 2018] [slotmem_shm:debug] [pid 1322:tid
> > 139715805775616] mod_slotmem_shm.c(476): AH02302: attach found
> ^ This is a child process attaching the SHMs created by the parent process.
> 
> > /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
> > [Wed Jan 31 12:33:51.761487 2018] [mpm_event:notice] [pid 65265:tid
> > 139715805775616] AH00494: SIGHUP received.  Attempting to restart
> ^ This is the parent process asked to restart (non graceful).
> 
> > [Wed Jan 31 12:34:54.471933 2018] [slotmem_shm:debug] [pid 20672:tid
> > 139965041129216] mod_slotmem_shm.c(331): AH02602: create didn't find
> > /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm in global list
> > [Wed Jan 31 12:34:54.471939 2018] [slotmem_shm:debug] [pid 20672:tid
> > 139965041129216] mod_slotmem_shm.c(341): AH02300: create
> > /var/run/http/apache24/tmp/slotmem-shm-p701d8bbe_0.shm: 992/6
> > [Wed Jan 31 12:34:54.471970 2018] [slotmem_shm:error] [pid 20672:tid
> > 139965041129216] (17)File exists: AH02611: create:
> ^ This is *another* parent process(not the same pid), ditto for the
> following messages (stripped here).
> 
> How so? One minute for a non-graceful restart looks huge too.
> Do you have multiple instances of httpd running (and using the same log
> file)?
> Could you monitor the processes here?

We have multiple configurations running, but each with their own log files. We have both Apache 2.2 and Apache 2.4 configurations running side by side, but completely isolated in terms of configuration, log and run directories.  Each of our configuration files tends to have around 200k lines including comments and blank lines and we use a lot of 3rd party modules, so it's they're big configurations.
Comment 31 mark 2018-01-31 17:23:10 UTC
(In reply to Yann Ylavic from comment #27)
> (In reply to mark from comment #7)
> > 
> > " AH00060: seg fault or similar nasty error detected in the parent process"
> > but I cannot tell what it's referring to.
> 
> The parent process crashed leaving children orphaned (hence attached to
> SHMs).
> 
> You possibly need this patch too:
> https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-
> PR61558.patch
> It was merged for upcoming 2.4.30 already (r1820794).
> See Bug 61558.

I can't see evidence of a crash beyond that message. Could it be referring to the exit triggered by the "file exists" problem?

i.e. HUP is received, SHMs are marked as deleted but processes are still attached so they are still present for the HUP restart and that triggers the "crash" exit and thus other SHMs fail to get deleted?
Comment 32 mark 2018-01-31 21:15:09 UTC
so sig_coredump is being triggered by an unknown signal, multiple times a day.  It's not a segfault, nothing in /var/log/messages. That results in a bunch of undeleted shared memory segments and probably some that will no longer be in the global list, but still present in the kernel.
Comment 33 Yann Ylavic 2018-01-31 22:44:51 UTC
Mark, followed up on dev@ since debugging in not really suitable in bugzilla. Thanks.
Comment 34 Yann Ylavic 2018-02-08 16:32:13 UTC
Created attachment 35723 [details]
Reuse SHMs names on restart or stop/start (2.4.x)

This is the full patch proposed to be backported to 2.4.next.

It should reuse the SHMs names as much as possible on restart or stop/start, which should address the increasing number of IPCs on the system if/when the parent process crashes.

Please note that it won't reuse SHMs if by some means children process from an old httpd instance (whose parent process crashed) are still alive, this is not something desirable.

Could you test it with your large configuration?
Comment 35 Graham Leggett 2018-02-13 22:39:57 UTC
Backported to v2.4.30.
Comment 36 mark 2018-02-14 07:29:54 UTC
(In reply to Yann Ylavic from comment #34)
> Created attachment 35723 [details]
> Reuse SHMs names on restart or stop/start (2.4.x)
> 
> This is the full patch proposed to be backported to 2.4.next.
> 
> It should reuse the SHMs names as much as possible on restart or stop/start,
> which should address the increasing number of IPCs on the system if/when the
> parent process crashes.
> 
> Please note that it won't reuse SHMs if by some means children process from
> an old httpd instance (whose parent process crashed) are still alive, this
> is not something desirable.
> 
> Could you test it with your large configuration?

Thanks, we will aim to test it in our next scheduled update, early March.