Bug 62277

Summary: mod_slotmem_shm is causing error in apache2.4.33 loading
Product: Apache httpd-2 Reporter: Tauseef Anjum <tanjum>
Component: mod_slotmem_plain / mod_slotmem_shmAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: critical CC: cbarbara
Priority: P2    
Version: 2.4.33   
Target Milestone: ---   
Hardware: Sun   
OS: Solaris   
Attachments: http.conf
vhosts
balancer
Use double hash for APR's IPC SysV ftok()
ftok() collisions tool
New tool with full shmget()/shmat()

Description Tauseef Anjum 2018-04-10 11:30:04 UTC
I am in process of upgrading from apache 2.2 to 2.4.33, while everything is working fine without ssl and ssl. But when I try to use proxy modules along with slotmem_shm modules, which is according to my understanding is mandatory now while using load balancers modules, I am getting following errors in error_log :

[Fri Mar 16 17:52:12.073931 2018] [lbmethod_heartbeat:notice] [pid 15571:tid 1] AH02282: No slotmem from mod_heartmonitor
[Fri Mar 16 17:52:12.275257 2018] [slotmem_shm:error] [pid 15571:tid 1] (17)File exists: AH02611: create: apr_shm_create(/usr/local/apache_QA_New/logs/slotmem-shm-p9cbf72c_check.shm) failed
[Fri Mar 16 17:52:12.275543 2018] [proxy_balancer:emerg] [pid 15571:tid 1] (17)File exists: AH01185: worker slotmem_create failed
[Fri Mar 16 17:52:12.275721 2018] [:emerg] [pid 15571:tid 1] AH00020: Configuration Failed, exiting

I have been brainstorming for last 5 weeks. I have tried different solutions like increasing kernel semaphores memory, they were about 128K which I have increased to 16384K, but this solution still does not seem to work and it does not seem to be proper solution either.

There are almost 30 balancers entries in my balancer file and more than 70 virtual hosts entries in httpd-vhosts.conf and at each restart apache creates some file for shm and then go down and it keeps doing so until creation of all files for which I have to restart it again and again.



Moreover, this is work on QA and I am not proceeding to prod to due to this issue
Comment 1 Eric Covener 2018-04-10 11:32:18 UTC
Do you have any pairs of identical virtual hosts?
Comment 2 Tauseef Anjum 2018-04-10 11:40:49 UTC
(In reply to Eric Covener from comment #1)
> Do you have any pairs of identical virtual hosts?

Well , most of them point to same application but their starting DNS is always different. Plus If I reduce the virtual hosts in httpd-vhosts then it works fine for some of them , so I don't thinks identical entries an issue
Comment 3 Yann Ylavic 2018-04-10 13:39:56 UTC
(In reply to Tauseef Anjum from comment #0)
> [Fri Mar 16 17:52:12.275257 2018] [slotmem_shm:error] [pid 15571:tid 1]
> (17)File exists: AH02611: create:
> apr_shm_create(/usr/local/apache_QA_New/logs/slotmem-shm-p9cbf72c_check.shm)
> failed

I wonder where this "_check" suffix comes from, is it vanilla httpd?
Comment 4 Tauseef Anjum 2018-04-10 13:47:33 UTC
(In reply to Yann Ylavic from comment #3)
> (In reply to Tauseef Anjum from comment #0)
> > [Fri Mar 16 17:52:12.275257 2018] [slotmem_shm:error] [pid 15571:tid 1]
> > (17)File exists: AH02611: create:
> > apr_shm_create(/usr/local/apache_QA_New/logs/slotmem-shm-p9cbf72c_check.shm)
> > failed
> 
> I wonder where this "_check" suffix comes from, is it vanilla httpd?

Yan : I think it is taking this suffix from vhost entry of application , as you can see below latest logs for diffrenet application:

[Wed Mar 28 11:25:28.464483 2018] [ssl:warn] [pid 9617:tid 1] AH01909: example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:28.784271 2018] [ssl:warn] [pid 9621:tid 1] AH01909: www.example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:28.795671 2018] [ssl:warn] [pid 9621:tid 1] AH01909: example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:29.277953 2018] [slotmem_shm:error] [pid 9621:tid 1] (17)File exists: AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-pdad182b4_t_mobile.shm) failed
[Wed Mar 28 11:25:29.278383 2018] [proxy_balancer:emerg] [pid 9621:tid 1] (17)File exists: AH01185: worker slotmem_create failed
[Wed Mar 28 11:25:29.278513 2018] [:emerg] [pid 9621:tid 1] AH00020: Configuration Failed, exiting




Moreover , I don't know what do you meant by vanilla , it is 2.4.33 version which is downloaded from aoache site
Comment 5 Yann Ylavic 2018-04-10 13:56:29 UTC
(In reply to Tauseef Anjum from comment #4)
> Yan : I think it is taking this suffix from vhost entry of application , as
> you can see below latest logs for diffrenet application:

I don't see where the httpd code use the name of the vhost directly, it's supposed to be a hash/digest which composes the slotmem file name.

> Moreover , I don't know what do you meant by vanilla , it is 2.4.33 version
> which is downloaded from aoache site

I meant unpatched httpd, sorry for language abuse.
Comment 6 Tauseef Anjum 2018-04-10 13:58:24 UTC
(In reply to Yann Ylavic from comment #3)
> (In reply to Tauseef Anjum from comment #0)
> > [Fri Mar 16 17:52:12.275257 2018] [slotmem_shm:error] [pid 15571:tid 1]
> > (17)File exists: AH02611: create:
> > apr_shm_create(/usr/local/apache_QA_New/logs/slotmem-shm-p9cbf72c_check.shm)
> > failed
> 
> I wonder where this "_check" suffix comes from, is it vanilla httpd?

Yan : I think it is taking this suffix from vhost entry of application , as you can see below latest logs for diffrenet application:

[Wed Mar 28 11:25:28.464483 2018] [ssl:warn] [pid 9617:tid 1] AH01909: example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:28.784271 2018] [ssl:warn] [pid 9621:tid 1] AH01909: www.example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:28.795671 2018] [ssl:warn] [pid 9621:tid 1] AH01909: example.com:443:0 server certificate does NOT include an ID which matches the server name
[Wed Mar 28 11:25:29.277953 2018] [slotmem_shm:error] [pid 9621:tid 1] (17)File exists: AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-pdad182b4_t_mobile.shm) failed
[Wed Mar 28 11:25:29.278383 2018] [proxy_balancer:emerg] [pid 9621:tid 1] (17)File exists: AH01185: worker slotmem_create failed
[Wed Mar 28 11:25:29.278513 2018] [:emerg] [pid 9621:tid 1] AH00020: Configuration Failed, exiting




Moreover , I don't know what do you meant by vanilla , it is 2.4.33 version which is downloaded from aoache site (In reply to Yann Ylavic from comment #5)
> (In reply to Tauseef Anjum from comment #4)
> > Yan : I think it is taking this suffix from vhost entry of application , as
> > you can see below latest logs for diffrenet application:
> 
> I don't see where the httpd code use the name of the vhost directly, it's
> supposed to be a hash/digest which composes the slotmem file name.
> 
> > Moreover , I don't know what do you meant by vanilla , it is 2.4.33 version
> > which is downloaded from aoache site
> 
> I meant unpatched httpd, sorry for language abuse.

I don't think it is unpatched as It was downloaded from their site
Plus in case you need my configure let me know
Comment 7 Yann Ylavic 2018-04-10 14:11:36 UTC
Yes, please attach your httpd.conf (anonymized eventually) or a simpler one that reproduces the issue.
Comment 8 Tauseef Anjum 2018-04-10 14:25:50 UTC
Created attachment 35851 [details]
http.conf

Find the attached httpd.conf
Comment 9 Yann Ylavic 2018-04-10 14:28:36 UTC
Please also provide "conf/extra/httpd-vhosts.conf".
Comment 10 Tauseef Anjum 2018-04-10 14:32:18 UTC
Created attachment 35852 [details]
vhosts

httpd-vhosts files
Comment 11 Yann Ylavic 2018-04-10 14:44:03 UTC
And "conf/balancer-member-entries.conf" please, actually I thought the balancers were declared with the vhosts.
Comment 12 Tauseef Anjum 2018-04-10 15:34:04 UTC
Created attachment 35853 [details]
balancer
Comment 13 Yann Ylavic 2018-04-11 10:10:06 UTC
Thanks Tauseef for the configuration files.

Unfortunately I can't reproduce on Linux with the same configuration (minus certificate files...), I must be missing something.

Can there be by any chance another httpd server running at the same time?

Also, could you please provide the output "httpd -V"?
Solaris is probably using another SHM mechanism than Linux by default (at least the one I configured), this could be a track too if short balancer names like the ones used in your configuration start to collide at the system level.
Comment 14 Tauseef Anjum 2018-04-11 10:50:24 UTC
(In reply to Yann Ylavic from comment #13)
> Thanks Tauseef for the configuration files.
> 
> Unfortunately I can't reproduce on Linux with the same configuration (minus
> certificate files...), I must be missing something.
> 
> Can there be by any chance another httpd server running at the same time?
> 
> Also, could you please provide the output "httpd -V"?
> Solaris is probably using another SHM mechanism than Linux by default (at
> least the one I configured), this could be a track too if short balancer
> names like the ones used in your configuration start to collide at the
> system level.

Yann ! there is another apache configured on the server but I stop the earlier version before starting this one. Plus this is the out put of apache -V
Server version: Apache/2.4.33 (Unix)
Server built:   Mar 26 2018 16:49:08
Server's Module Magic Number: 20120211:76
Server loaded:  APR 1.6.3, APR-UTIL 1.6.1
Compiled using: APR 1.6.3, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     worker
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_PROC_PTHREAD_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/usr/local/apache_QA"
 -D SUEXEC_BIN="/usr/local/apache_QA/bin/suexec"
 -D DEFAULT_PIDLOG="logs/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

And about short name of balancer , I don;t think it should be problem as I have it work with less balancer entries but increasing the memory at kernel level !
Comment 15 Yann Ylavic 2018-04-11 17:18:05 UTC
Thanks, looks like the default SHM mechanism for Solaris is used (nothing forced), will figure out which one in APR code. The system memory you increased was for IPCs (SySV)?

Also, could you provide the error_log (at LogLevel debug) for a failing startup please? It's hard to diagnose without reproducing, so debug/trace logs welcome..
Comment 16 Tauseef Anjum 2018-04-12 17:40:59 UTC
(In reply to Yann Ylavic from comment #15)
> Thanks, looks like the default SHM mechanism for Solaris is used (nothing
> forced), will figure out which one in APR code. The system memory you
> increased was for IPCs (SySV)?
> 
> Also, could you provide the error_log (at LogLevel debug) for a failing
> startup please? It's hard to diagnose without reproducing, so debug/trace
> logs welcome..

Yes the memory I increased was for IPCs,
plus here is the logs in debug mode for some of the vhosts.!

[Thu Apr 12 22:37:06.761982 2018] [proxy_balancer:debug] [pid 4718:tid 1] mod_proxy_balancer.c(989): AH01184: Doing workers create: balancer://cs (p597b5960_cs), 984, 1 [12]
[Thu Apr 12 22:37:06.762181 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(447): AH02602: create didn't find /usr/local/apache_QA/logs/slotmem-shm-p597b5960_cs.shm in global list
[Thu Apr 12 22:37:06.762289 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(457): AH02300: create /usr/local/apache_QA/logs/slotmem-shm-p597b5960_cs.shm: 984/1
[Thu Apr 12 22:37:06.763418 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(480): AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-p597b5960_cs.shm) succeeded
[Thu Apr 12 22:37:06.763618 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1763): AH02338: copying shm[0] (0xffffffff66000018) for worker: http://192.168.150.217:8444
[Thu Apr 12 22:37:06.763951 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1225): AH02337: copying shm[13] (0xffffffff67a01bb8) for balancer://cp
[Thu Apr 12 22:37:06.764220 2018] [proxy_balancer:debug] [pid 4718:tid 1] mod_proxy_balancer.c(989): AH01184: Doing workers create: balancer://cp (p597b5960_cp), 984, 1 [13]
[Thu Apr 12 22:37:06.764370 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(447): AH02602: create didn't find /usr/local/apache_QA/logs/slotmem-shm-p597b5960_cp.shm in global list
[Thu Apr 12 22:37:06.764480 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(457): AH02300: create /usr/local/apache_QA/logs/slotmem-shm-p597b5960_cp.shm: 984/1
[Thu Apr 12 22:37:06.765382 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(480): AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-p597b5960_cp.shm) succeeded
[Thu Apr 12 22:37:06.765579 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1763): AH02338: copying shm[0] (0xffffffff65e00018) for worker: http://192.168.150.217:8445
[Thu Apr 12 22:37:06.765906 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1225): AH02337: copying shm[14] (0xffffffff67a01dd8) for balancer://ic
[Thu Apr 12 22:37:06.766179 2018] [proxy_balancer:debug] [pid 4718:tid 1] mod_proxy_balancer.c(989): AH01184: Doing workers create: balancer://ic (p597b5960_ic), 984, 1 [14]
[Thu Apr 12 22:37:06.766333 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(447): AH02602: create didn't find /usr/local/apache_QA/logs/slotmem-shm-p597b5960_ic.shm in global list
[Thu Apr 12 22:37:06.766445 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(457): AH02300: create /usr/local/apache_QA/logs/slotmem-shm-p597b5960_ic.shm: 984/1
[Thu Apr 12 22:37:06.767346 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(480): AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-p597b5960_ic.shm) succeeded
[Thu Apr 12 22:37:06.767558 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1763): AH02338: copying shm[0] (0xffffffff65c00018) for worker: http://192.168.150.217:8459
[Thu Apr 12 22:37:06.767893 2018] [proxy:debug] [pid 4718:tid 1] proxy_util.c(1225): AH02337: copying shm[15] (0xffffffff67a01ff8) for balancer://mob
[Thu Apr 12 22:37:06.768168 2018] [proxy_balancer:debug] [pid 4718:tid 1] mod_proxy_balancer.c(989): AH01184: Doing workers create: balancer://mob (p597b5960_mob), 984, 2 [15]
[Thu Apr 12 22:37:06.768321 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(447): AH02602: create didn't find /usr/local/apache_QA/logs/slotmem-shm-p597b5960_mob.shm in global list
[Thu Apr 12 22:37:06.768428 2018] [slotmem_shm:debug] [pid 4718:tid 1] mod_slotmem_shm.c(457): AH02300: create /usr/local/apache_QA/logs/slotmem-shm-p597b5960_mob.shm: 984/2
[Thu Apr 12 22:37:06.769121 2018] [slotmem_shm:error] [pid 4718:tid 1] (17)File exists: AH02611: create: apr_shm_create(/usr/local/apache_QA/logs/slotmem-shm-p597b5960_mob.shm) failed
[Thu Apr 12 22:37:06.769413 2018] [proxy_balancer:emerg] [pid 4718:tid 1] (17)File exists: AH01185: worker slotmem_create failed
[Thu Apr 12 22:37:06.769530 2018] [:emerg] [pid 4718:tid 1] AH00020: Configuration Failed, exiting
Comment 17 Yann Ylavic 2018-04-12 21:10:07 UTC
(In reply to Tauseef Anjum from comment #16)
> Yes the memory I increased was for IPCs,

I really suspect a collision in the filenames => tokens mapping of IPC SysV SHMs.


> [Thu Apr 12 22:37:06.768428 2018] [slotmem_shm:debug] [pid 4718:tid 1]
> mod_slotmem_shm.c(457): AH02300: create
> /usr/local/apache_QA/logs/slotmem-shm-p597b5960_mob.shm: 984/2

Do you see this same AH02300 message for the same path ("/usr/local/apache_QA/logs/slotmem-shm-p597b5960_mob.shm") somewhere before in the log file (a clear log file with no message from a previous startup)?

If not, I can only think of a collision, you may want to try the patch provided in the next message (against APR lib).
Comment 18 Yann Ylavic 2018-04-12 21:19:44 UTC
Created attachment 35868 [details]
Use double hash for APR's IPC SysV ftok()

Does this patch help?
Comment 19 Tauseef Anjum 2018-04-13 10:10:31 UTC
(In reply to Yann Ylavic from comment #18)
> Created attachment 35868 [details]
> Use double hash for APR's IPC SysV ftok()
> 
> Does this patch help?

No it didn't helped
Comment 20 Yann Ylavic 2018-04-13 10:23:34 UTC
And about AH02300 in comment 17? If this log message (same file path/name) doesn't appear twice at startup there could be something messy with Solaris' ftok() (proj_id parameter ignored?).

Maybe you could try to extend the "balancer://mob" name a bit (like "balancer://mob123456789" and see if it still fails there?
Comment 21 Tauseef Anjum 2018-04-13 10:28:56 UTC
(In reply to Yann Ylavic from comment #20)
> And about AH02300 in comment 17? If this log message (same file path/name)
> doesn't appear twice at startup there could be something messy with Solaris'
> ftok() (proj_id parameter ignored?).
> 
> Maybe you could try to extend the "balancer://mob" name a bit (like
> "balancer://mob123456789" and see if it still fails there?

No,there was not any file with this name before. and name is not the issue as with each restart it fails on different balancer.
FYI ! If I remove my httpd-vhosts entry it works fine.
Furthermore , same issue is occurring on linux too for me !
Comment 22 Yann Ylavic 2018-04-13 11:25:58 UTC
Just retested with your httpd-vhosts.conf and balancer-member-entries.conf and it works for me on my Linux (Debian 4.15.4-1), with IPC SysV SHMs and sysctl's kernel.shmmni=16384, and with or without attachment 35868 [details].

So there must be something, but I can't reproduce. Maybe the full startup log would help.
Comment 23 Tauseef Anjum 2018-04-13 11:50:24 UTC
(In reply to Yann Ylavic from comment #22)
> Just retested with your httpd-vhosts.conf and balancer-member-entries.conf
> and it works for me on my Linux (Debian 4.15.4-1), with IPC SysV SHMs and
> sysctl's kernel.shmmni=16384, and with or without attachment 35868 [details].
> 
> So there must be something, but I can't reproduce. Maybe the full startup
> log would help.

Well , I haven't set the kernel.shmmni=16384 value in Linux, may be that is causing the issue because when commentout the vhost file it works fine
Comment 24 Yann Ylavic 2018-04-13 12:05:34 UTC
If I don't increase kernel.shmmni, default value is not enough for me because your configuration requires more than 4K SHMs. But the error is rather "No space left on device" in this case, not the "File exists" your are reporting.

Note that with your configuration, the more vhosts the more SHMs (exponentially) since balancers are declared globally though they are not shared between vhosts.

So please distinguish between those two errors in your testing, the first one is about the need to increase kernel.shmmni, the second one is a mystery...
Comment 25 Tauseef Anjum 2018-04-13 12:10:08 UTC
(In reply to Yann Ylavic from comment #24)
> If I don't increase kernel.shmmni, default value is not enough for me
> because your configuration requires more than 4K SHMs. But the error is
> rather "No space left on device" in this case, not the "File exists" your
> are reporting.
> 
> Note that with your configuration, the more vhosts the more SHMs
> (exponentially) since balancers are declared globally though they are not
> shared between vhosts.
> 
> So please distinguish between those two errors in your testing, the first
> one is about the need to increase kernel.shmmni, the second one is a
> mystery...

Yes you are right, Linux is giving the no Space left on device error which is understandable as my machine have 4 GB ram which is not enough I think. But we have to resolve the second one.
Comment 26 Yann Ylavic 2018-04-13 12:19:29 UTC
No Solaris at hand to confirm/infirm my suspicion on ftok().

Maybe you could try to configure APR (or httpd if built --with-included-apr) to use --enable-posix-shm. This will switch from IPC SysV SHMs to posix's, which possibly/likely don't have the same issue (supposedly).
Comment 27 Yann Ylavic 2018-04-13 13:46:07 UTC
Created attachment 35873 [details]
ftok() collisions tool

Small tool to detect ftok() collisions on filenames with the proj_id hashed like in APR lib.

The archive contains "ftok_collisions.c" and a bunch of empty files named off your balancers.

$ mkdir tmp && cd tmp
$ tar xzf ftok_collisions.tar.gz
$ gcc ftok_collisions.c -o ftok_collisions
$ ./ftok_collisions *
0 collision(s) found
$ ./ftok_collisions ftok_collisions.c *
Collision between 'ftok_collisions.c' and 'ftok_collisions.c'
1 collision(s) found

What's the output for you when compiled and run on Solaris?
Comment 28 Rainer Jung 2018-04-14 09:50:22 UTC
Careful, I'm not the OP: Extacted, compiled and ran on Solaris 8 Sparc (gcc 4.1.2) and Solaris 10 Sparc (gcc 7.3.0) with result:

0 collision(s) found

So it will be very interesting, what the original poster gets as a result.
Comment 29 Tauseef Anjum 2018-04-16 09:34:33 UTC
Comment on attachment 35873 [details]
ftok() collisions tool

 ./ftok_collisions *
0 collision(s) found
-bash-4.4$ ./ftok_collisions ftok_collisions.c *
Collision between 'ftok_collisions.c' and 'ftok_collisions.c'
1 collision(s) found
Here is my output for the scenario you mentioned although I didn't get what it is doing.
Comment 30 Tauseef Anjum 2018-04-17 17:41:42 UTC
(In reply to Yann Ylavic from comment #26)
> No Solaris at hand to confirm/infirm my suspicion on ftok().
> 
> Maybe you could try to configure APR (or httpd if built --with-included-apr)
> to use --enable-posix-shm. This will switch from IPC SysV SHMs to posix's,
> which possibly/likely don't have the same issue (supposedly).

Yann ! Issue was resolved using this parameter in my configure,
Apache is working for now . Thanks !
Comment 31 Rainer Jung 2018-05-02 11:34:38 UTC
Tested again today from a fresh download and build. Still getting no collisions on Solaris 10 Sparc:

apache% ./ftok_collisions *
0 collision(s) found
apache% ./ftok_collisions ftok_collisions.c *
Collision between 'ftok_collisions.c' and 'ftok_collisions.c'
1 collision(s) found
apache% ./ftok_collisions * data/*
0 collision(s) found
Comment 32 Yann Ylavic 2018-05-02 12:11:57 UTC
Created attachment 35904 [details]
New tool with full shmget()/shmat()

I changed the tool a bit to do the full IPC SysV creation process (maybe the collision happens there), and also be able to work on the real path from this report with all the balancers names collected from the logs (hard coded in the tool itself).

Rainer, could you please give it a try?

$ gzip -d ftok_collisions_full.c.gz
$ gcc ftok_collisions_full.c -o ftok_collisions_full
$ mkdir -p /usr/local/apache_QA_New/logs # to be cleaned, eventually
$ (ulimit -n 4096; ./ftok_collisions_full --path /usr/local/apache_QA_New/logs)
0 collision(s) for 3990/3990 files

The tool should cleanup after itself w.r.t. files/SHMs...
Comment 33 Rainer Jung 2018-05-02 13:37:58 UTC
Hi Yann,

in addition to the file descriptor ulimit I also had to increase the number of shared memory identifiers "project.max-shm-ids" from the default of 256 (I chose 10000).

Running the tool with a fresh work firectory named "work" and "--path work" I got

0 collision(s) for 3990/3990 files


BUT: using no "--path" and thus using /tmp I got:

shmget: File exists
0 collision(s) for 1425/3990 files

and the number 1425 varies by test run.

Underneath /tmp I always get no collisions but "shmget: File exists" using default settings as well as with a custom sub directory. On other local or NFS mounted directories, I do not get them. Even if I check for collisions directly after the shmget error, I do not get such. But I had to adjust the check_collisions argument from "i" to "i-1" (otherwise it segfaulted when shmget threw an error).

Concerning "shmget: File exists" from the Solaris man page:

     EEXIST          A shared memory identifier  exists  for  key
                     but      both     (shmflg&IPC_CREAT)     and
                     (shmflg&IPC_EXCL) are true.

and looking at the ftok man page:

NOTES
     Since the ftok() function returns a value based  on  the  id
     given  and  the file serial number of the file named by path
     in a type that is no longer large enough to  hold  all  file
     serial  numbers, it may return the same key for paths naming
     different files on large filesystems.

... and ...

USAGE
     ...
     Another way to compose keys is to include the project ID  in
     the  most  significant byte and to use the remaining portion
     as a sequence number. There are  many  other  ways  to  form
     keys,  but  it  is necessary for each system to define stan-
     dards for forming them. If some standard is not adhered  to,
     it  will be possible for unrelated processes to unintention-
     ally interfere with each other's operation. It is still pos-
     sible  to interfere intentionally. Therefore, it is strongly
     suggested that the most significant byte of a  key  in  some
     sense refer to a project so that keys do not conflict across
     a given system.


I also compiled as a 64 bit binary but got the same results.

Here's an example for two identical keys generated by ftok:

ftok 81/3990 for /tmp/slotmem-shm-p98b37289_ir.shm and -544118222 is 838870549
ftok 168/3990 for /tmp/slotmem-shm-p5750b6d1_jra.shm and 1374409266 is 838870549

In sys/types.h there's a part

/*
 * POSIX and XOPEN Declarations
 */
typedef int     key_t;                  /* IPC key type         */


Regards,

Rainer
Comment 34 Rainer Jung 2018-05-02 13:39:06 UTC
PS: /tmp is mounted as tmpfs in swap.
Comment 35 Rainer Jung 2018-05-02 14:19:20 UTC
And old OpenSolaris code forked e.g. at

https://searchcode.com/codesearch/view/5482758/

indicates, that the 32 bit ftok consist of the lower 8 bits of the id, 12 bits from the device id and another 12 bits from the ino. So if the files are on the same file system, the 12 bit device id will not differentiate and we are left with the lower 8 bit from the id and lower 12 bits from the ino.

Unfortunately when checking the generated ftok numbers, I get the "shmget: File exists" much more frequent, than I see a real ftok collision! Your collission check does not trigger, but casting the key to an int sometimes shows duplicates, but less frequent than the shmget EEXIST error.

I couldn't find anything about shmget() pecularities, but about the ftok() impl on Solaris one can find:


... ftok() returing a non-unique key is happening very frequently in solaris 10 
update 10.  We have our code working on solaris and other UX OS. We never had this issue even before on solaris. but as soon as we updated to solaris 10 update10 we 
are seeing this issue. And it happens quite frequently. update 10 seems to make the symptoms worse. ...

Regards,

Rainer
Comment 36 Yann Ylavic 2018-05-02 15:21:57 UTC
Thanks Rainer for confirming.

I suspected it could be related to filesystem, hence this new test with semget() calls not "released" until the end of the test.

Regarding the project_id and how it seems to be (poorly) used on Solaris 10, we may not be doing the best thing on the APR either. Not sure we can do much given that 8 bits only seem to be considered, but possibly a sequential number (i.e. the fd) is better than the hash currently used (depending on the filename only).

What if you change:
        ctx->h1 = (nhash >= 1) ? hash1(ctx->name) : 0;
        ctx->h2 = (nhash >= 2) ? hash2(ctx->name) : 0;

by something like:
        ctx->h1 = (unsigned int)('A' ^ 'P' ^ 'R') << 24;
        ctx->h2 = (unsigned int)(ctx->fd & 0x00ffffff);

in the code?
Comment 37 Rainer Jung 2018-05-02 15:46:15 UTC
Again EEXIST for shmget on /tmp, not happening much earlier or later than with the previous code (around iteration 50 - 500 of 3990).
Comment 38 Yann Ylavic 2018-05-02 16:03:01 UTC
OK thanks, I guess we should recommend POSIX sems, at least on Solaris 10...
Comment 39 Tauseef Anjum 2019-06-27 17:00:10 UTC
Hi Yaan , 
I am facing the same issue again and this time it is on linux server with httpd2.4.39. I have set value kernel.shmmni=16384 but it is still given same error.
There are around 12 apache server around same server all other are working fine but this one is generating error.it has a lot of vhost errors.
same issue apache was working fine with posix-shm on solaris.