Bug 58024 - (Graceful) Restart does not work after adding a BalancerMember
Summary: (Graceful) Restart does not work after adding a BalancerMember
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy (show other bugs)
Version: 2.4.12
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2015-06-11 13:49 UTC by dnie
Modified: 2015-09-26 23:40 UTC (History)
4 users (show)



Attachments
httpd.conf (3.08 KB, text/plain)
2015-06-11 13:49 UTC, dnie
Details
Before creating, try to reuse stored balancers/workers with enough space (5.06 KB, patch)
2015-06-19 11:49 UTC, Yann Ylavic
Details | Diff
Fix slotmem destroy on Windows (833 bytes, patch)
2015-08-28 15:31 UTC, Yann Ylavic
Details | Diff
error.log after patching mod_slotmem_shm (14.60 KB, text/plain)
2015-08-31 10:36 UTC, dnie
Details
error.log with more details (18.15 KB, text/plain)
2015-08-31 11:57 UTC, dnie
Details
The error.log with start and restart (19.02 KB, text/plain)
2015-09-11 14:33 UTC, dnie
Details
mod_slotmem_shm and mpm_winnt (children) with generation number (2.4.x) (11.12 KB, patch)
2015-09-14 08:03 UTC, Yann Ylavic
Details | Diff
error.log with r1702501 (38.16 KB, text/plain)
2015-09-14 08:20 UTC, dnie
Details
mod_slotmem_shm and mpm_winnt (children) with generation number (2.4.x) (15.87 KB, patch)
2015-09-15 14:50 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description dnie 2015-06-11 13:49:40 UTC
Created attachment 32808 [details]
httpd.conf

I am using Apache 2.4.12 on Windows (details below)
 
I am using mod_proxy and mod_proxy_balancer and sometimes I have to add or remove BalancerMember at runtime (without a downtime).
But I have an issue when (graceful) restarting the Apache after I had modified the Proxy/BalancerMember configuration. The Apache will shutdown with error AH02599 (error.log below)
When just stopping and then starting the Apache, everything works properly.
 
This scenario completely works properly when using Apache 2.2.27
 
Some Details:
=============
 OS: Windows 7 64Bit
 Apache downloaded from http://www.apachelounge.com/download/VC11/ ==> httpd-2.4.12-win64-VC11.zip    
 Restart-Command: httpd -k restart -n Apache
 
error.log:
==========
 The 'Apache' service is restarting.
 Failed to restart the 'Apache' service.
 ice] [pid 13636:tid 552] AH00424: Parent: Received restart signal -- Restarting the server.
 [Thu Jun 11 14:41:18.545804 2015] [slotmem_shm:error] [pid 13636:tid 552] AH02599: existing shared memory for C:/test/apache/logs/slotmem-shm-p7bd87614_lb2.shm could not be used (failed size check)
 [Thu Jun 11 14:41:18.545804 2015] [proxy_balancer:emerg] [pid 13636:tid 552] (22)Invalid argument: AH01185: worker slotmem_create failed
 [Thu Jun 11 14:41:18.545804 2015] [:emerg] [pid 13636:tid 552] AH00020: Configuration Failed, exiting
 [Thu Jun 11 14:41:20.536003 2015] [mpm_winnt:notice] [pid 12032:tid 488] AH00364: Child: All worker threads have exited.
Comment 1 dnie 2015-06-18 09:35:29 UTC
Here is a more simple "steps to reproduce"

1. Download Apache from http://www.apachelounge.com/download/VC11/ httpd-2.4.12-win64-VC11.zip  
2. Unpack it to C:\Apache24
3. Be sure that you have installed Visual C++ Redistributable for Visual Studio 2012 (==> msvcr110.dll). Maybe download http://www.microsoft.com/en-us/download/details.aspx?id=30679
4. Open Terminal as Admin and install the service and start Apache using:

   cd C:\Apache24\bin
   httpd.exe -k install -d C:\Apache24
   httpd.exe -k start

5. Restart is working now:

   httpd.exe -k restart   
   
6. Add the following lines to the end of httpd.conf:

	LoadModule proxy_module modules/mod_proxy.so
	LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
	LoadModule proxy_http_module modules/mod_proxy_http.so
	LoadModule lbmethod_byrequests_module modules/mod_lbmethod_byrequests.so
	LoadModule slotmem_shm_module modules/mod_slotmem_shm.so

	<Proxy balancer://lb1>
		BalancerMember http://127.0.0.1:8080
		#BalancerMember http://127.0.0.2:8080
	</Proxy>

	<Location /myWebapp>
		ProxyPass balancer://lb1/myWebapp
		ProxyPassReverse balancer://lb1/myWebapp
	</Location>

5. Restart is still working 

   httpd.exe -k restart

6. Uncomment the second BalancerMember http://127.0.0.2:8080

7. Restart is "crashing" now

   httpd.exe -k restart

error.log:
The 'Apache24' service is restarting.
Failed to restart the 'Apache24' service.
nt:notice] [pid 2564:tid 492] AH00424: Parent: Received restart signal -- Restarting the server.
[Thu Jun 18 11:19:15.350509 2015] [slotmem_shm:error] [pid 2564:tid 492] AH02599: existing shared memory for C:/Apache24/logs/slotmem-shm-pac00a502_lb1.shm could not be used (failed size check)
[Thu Jun 18 11:19:15.350509 2015] [proxy_balancer:emerg] [pid 2564:tid 492] (22)Invalid argument: AH01185: worker slotmem_create failed
[Thu Jun 18 11:19:15.350509 2015] [:emerg] [pid 2564:tid 492] AH00020: Configuration Failed, exiting
[Thu Jun 18 11:19:17.341708 2015] [mpm_winnt:notice] [pid 1976:tid 476] AH00364: Child: All worker threads have exited.
Comment 2 dnie 2015-06-19 06:15:13 UTC
This morning, I tested Apache 2.4.10 on a clean Debian 8.1 and this scenario worked with a "graceful" restart without any issue. So it is maybe a Windows-only-problem. 

I also tested the "BalancerPersist On|Off" switch on Debian and Windows
- On Windows, there was no effect at all. 
- On Debian there was a difference: Both "On" and "Off" seemed to work, but "On" causes the message "AH02551: bad md5 match" in the error.log
Comment 3 Yann Ylavic 2015-06-19 11:49:52 UTC
Created attachment 32836 [details]
Before creating, try to reuse stored balancers/workers with enough space

Can you please try this patch (against 2.4.x) and see if it solves your issue?

The goal is to be less strict about slots reuse when the growth margin allows to restart with more balancers/workers.
Comment 4 dnie 2015-06-23 13:29:29 UTC
I was not able to get that changes running. 

Until now, I got the mod_proxy_balancer.so compiled by Visual Studio 2012 under Windows 7. (Only this file. Not the whole Project)

The Apache does (not) start with Error: 

   "httpd.exe: Syntax error on line 532 of C:/Apache24/conf/httpd.conf: Cannot load modules/mod_proxy_balancer.so into server: The Specified Procedure Could not be Found."    
(message translated from german to english)


I have no idea what the problem could be. (I am not a C-Developer)
Are there any other options to get the code compiled for Windows?

Is there any commandline option for httpd to get more information about the "Procedure" that could not be found?
Comment 5 dnie 2015-06-24 13:37:39 UTC
Now I am able to compile the mod_proxy_balancer.so with Visual Studio and run the compiled version within the existing Apache installation by replacing this one file.

I applied the SVN patch and tested the behaviour:

Adding and removing BalancerMember is working in most cases. But in some random cases I get the following error:

[Wed Jun 24 14:27:13.324773 2015] [proxy_balancer:emerg] [pid 14172:tid 600] (22)Invalid argument: AH01186: worker slotmem_grab failed
[Wed Jun 24 14:27:13.324773 2015] [:emerg] [pid 14172:tid 600] AH00020: Configuration Failed, exiting
[Wed Jun 24 14:27:15.298971 2015] [mpm_winnt:notice] [pid 10256:tid 532] AH00364: Child: All worker threads have exited.
Comment 6 Christophe JAILLET 2015-08-11 09:57:06 UTC
Hi,

apparently, a similar issue has been reported in http://marc.info/?l=apache-httpd-dev&m=133430638626492

Could you please try the proposed patch and report if it helps ?
Code has changed slighly in this area since 2.4.2, but looking at APLOGNO(01186) in the code, you should find where to apply the patch.
Comment 7 Mario 2015-08-11 10:44:09 UTC
We had once a simular bug 
https://bz.apache.org/bugzilla/show_bug.cgi?id=52402

However: Can you try to start apache not as the service but from a cmd.exe which you started as Adminstrator? For me the shared memory works only if I start the process as Administrator.
Comment 8 Mario 2015-08-11 10:46:49 UTC
I forgot to mention that you can restart apache by hitting CRTL + break. In your cause it is I guess Strg + Pause.
Comment 9 dnie 2015-08-12 07:10:12 UTC
The patch from Christophe does not make any changes in my case. It crashes every time when not using the pach from Yann (Comment 3)
I think that issue from Comment 6 is not related to this.

But, now figured out my "random cases" from Comment 5: 
All BalancerMember that are active at initial startup can be removed and re-added at runtime. But adding completely new BalancerMember causes the AH01186. 

Here one example (I used the patch from Yann for this):

Initail config:
<Proxy balancer://lb2>
  BalancerMember http://servera:8080
  BalancerMember http://serverb:8080
  #BalancerMember http://serverc:8080
</Proxy>

Now start Apache (Initial start, not restart)

Now remove "serverb":
<Proxy balancer://lb2>
  BalancerMember http://servera:8080
  #BalancerMember http://serverb:8080
  #BalancerMember http://serverc:8080
</Proxy>

Now restart Apache ==> OK

Now remove "servera" and re-add "serverb":
<Proxy balancer://lb2>
  #BalancerMember http://servera:8080
  BalancerMember http://serverb:8080
  #BalancerMember http://serverc:8080
</Proxy>

Now restart Apache ==> OK

Now add "serverc" (which was never active since initial start)
<Proxy balancer://lb2>
  #BalancerMember http://servera:8080
  BalancerMember http://serverb:8080
  BalancerMember http://serverc:8080
</Proxy>

Restart Apache ==> ERROR AH01186
Comment 10 Yann Ylavic 2015-08-19 16:29:44 UTC
(In reply to dnie from comment #9)
> 
> Now add "serverc" (which was never active since initial start)
> <Proxy balancer://lb2>
>   #BalancerMember http://servera:8080
>   BalancerMember http://serverb:8080
>   BalancerMember http://serverc:8080
> </Proxy>
> 
> Restart Apache ==> ERROR AH01186

Can you try to add the following directive in the <Proxy balancer://lb2> block:
  ProxySet growth 5
?

By default the growth margin is zero for the balancer members, so in your scenario there may be no room for adding a new one.
Comment 11 dnie 2015-08-20 08:16:06 UTC
This works!

The ProxySet growth=5 fixes the AH01186. 
Now, everything works with my compiled version of mod_proxy_balancer.so (with Patch from Comment 3 against 2.4.x)

I will do stresstest in my testenvironment. This will take a while. I will post the result of this test in a couple of days.

Thanks!
Comment 12 Yann Ylavic 2015-08-21 12:54:41 UTC
Thanks for testing, committed in r1696960 and backport proposed to 2.4.x.
Comment 13 dnie 2015-08-28 14:33:50 UTC
My stresstest succeded. So it is fixed (with my patched version and growth=x). 

But there is one condition: 
The number of unique BalancerMember must not exceed the count of initial BalancerMember at startup plus the value of "growth"


What are the consequences when using higher growth? In my case, I am not able to know the maximum count of possible BalancerMembers.
Comment 14 Yann Ylavic 2015-08-28 15:31:55 UTC
Created attachment 33052 [details]
Fix slotmem destroy on Windows

After some discussion on the dev@ mailing-list, attachment 32836 [details] is not the correct fix.

The shared slots used to store the balancers/members are destroyed and recreated at each (re)startuo, based on the size needed by the current configuration.
So there shouldn't be an error when some balancers/members are added, even when no growth margin is used (growth margins are useful only for dynamic configuration via the balancer manager insterface).
Hence the slots are not persistent by default on restart, the balancers start from the initial state each time, unless "BalancerPersist on" is configured and the configuration (number of balancers/members, parameters...) is the same.

This new patch fixes the order used to destroy a slot since, on Windows at least, it must be detached *before* the underlying file is removed, otherwise it is reused on restart and the size check fails whenever a balancer/member is added.
Could you please try it?
Comment 15 dnie 2015-08-31 10:36:57 UTC
Created attachment 33055 [details]
error.log after patching mod_slotmem_shm

I reverted my mod_proxy_balancer to the original version and applied the new patch to mod_slotmem_shm

Now I get the AH02599 which results in AH00020. 

I attached my error.log. (LogLevel debug)
In that log you also will find my own custom logline in line 249 with text "dnie". That line is the line before the (just moved) apr_shm_destroy.

Here my configuration when created the error.log:

<Proxy balancer://lb2>
  ProxySet growth=5
# BalancerMember http://servera:8080
  BalancerMember http://serverb:8080
  BalancerMember http://serverc:8080
</Proxy>

<Location /myWebapp2>
  ProxyPass balancer://lb2/myWebapp
  ProxyPassReverse balancer://lb2/myWebapp
</Location>

I started Apache with this config and restarted with the additional "servera".
Comment 16 Yann Ylavic 2015-08-31 10:56:59 UTC
Thanks for testing (unfortunately I have no Windows machine to test this, and that's no reproductible on linux).

Could your "dnie log" include the return values from apr_shm_destroy(), apr_shm_remove() and apr_file_remove() called in cleanup_slotmem()?
The SHM (underlying) file seems not be deleted here since it is attached after restart...
Comment 17 dnie 2015-08-31 11:57:12 UTC
Created attachment 33056 [details]
error.log with more details

Same test as before, but I have added some more Loglines in cleanup_slotmem
Comment 18 Yann Ylavic 2015-09-11 13:32:19 UTC
A new fix was committed on trunk in r1702450 (applies to 2.4.x too).

On Windows, it uses a different file name for each SHM created on restart, since the old file may still be in use by the old children (gracefully) shutting down, which prevents their removal.

This version allows to work with a clear SHM on restart (fitting the new configuration, should any balancer/member be added), and hence avoids the use of the balancer's growth parameter for configuration changes (these are meant for dynamic changes via the balancer manager only).

Could you test this new fix (the other patches proposed so far are not needed)?
It would be particularly interesting to look at the the SHM files (created in the [ServerRoot]/logs directory by default) to see if they are cleanly removed on restart, i.e. as soon as the old children stop (a generation number is used in the file names, and there shouldn't be any SHM file from the previous generation once the corresponding children have stopped).

Thanks.
Comment 19 dnie 2015-09-11 14:33:10 UTC
Created attachment 33097 [details]
The error.log with start and restart

I took the whole file mod_slotmem_shm.c from r1702450 and replaced my Version with it. I also updated my working copy to the latest revision 1702312 from 2.4.x
I have a clean working copy except this one file.

For some reason I also needed to add the following line to get this compiled: 
#include "http_core.h"

But this does not work. I made the same test as before and I get a AH02599. 

Maybe I did somethin wrong or this fix doesn't work. Can you see what went wrong? I attached the error log again.
Comment 20 Yann Ylavic 2015-09-11 15:31:25 UTC
Thanks for testing.

Does new commit r1702501 help?
Comment 21 Yann Ylavic 2015-09-14 08:03:05 UTC
Created attachment 33106 [details]
mod_slotmem_shm and mpm_winnt (children) with generation number (2.4.x)

(In reply to Yann Ylavic from comment #20)
> 
> Does new commit r1702501 help?

Probably not, AP_MPMQ_GENERATION isn't relevant either in the child process, at least until child_init() (i.e. after all the pre/post_config() stage, which mpm_winnt runs for each child too...).

Hence I think we need to make the generation number available earlier in mpm_winnt's children, so that it can be used at config stage (by any module).

The attached patch (against 2.4.x) includes all the mod_slotmem_shm changes so far, plus the mpm_winnt changes to make AP_MPMQ_GENERATION work in children.
Comment 22 dnie 2015-09-14 08:20:56 UTC
Created attachment 33107 [details]
error.log with r1702501

I used the commit r1702501 for mod_slotmem_shm.c. The shm filenames are a little bit diffrent. But I get the AH02599 as before.
I attache the new log again. I also provided some additional information within the log. 
I made two test:
1. The first is adding a BalancerMember (same test as the last ones). 
2. The second is removing a BalancerMamber. This also causes AH02599.
Comment 23 dnie 2015-09-14 10:02:40 UTC
The Patch from Comment 21 seems to work

But I had to add the Line 

    DWORD BytesRead = 0;

in winnt_rewrite_args in mpm_winnt.c(1048) to get this compiled.

Now I will do some (automated) stresstest on my compiled version and post the result later on...

In my manual test I get many shm files in my log folder. Once per restart and balancer and proxy. They will not deleted. Is there a way to delete them automatcally. They seem to be unused. I am able to manually delete the previous ones (with Windows Explorer).
Comment 24 Yann Ylavic 2015-09-14 13:15:09 UTC
(In reply to dnie from comment #23)
> 
> In my manual test I get many shm files in my log folder. Once per restart
> and balancer and proxy. They will not deleted.

This is unfortunate, DeleteFile() is documented in MSDN with:

"The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed."

I read this "does not occur until the last handle is closed" as "occurs only when the last handle is closed", but I must be naive, since:

"The DeleteFile function fails if an application attempts to delete a file that has other handles open for normal I/O or as a memory-mapped file (FILE_SHARE_DELETE must have been specified when other handles were opened)."

which seems to indicate that it fails for memory-mapped files *even if* FILE_SHARE_DELETE was used (i.e. FILE_SHARE_DELETE helps only for "other handles opened" but not for memory-mapped files)...

If that's the case, it quite complicates things since we have to address the filesystem leak too.

Let me talk about this on the dev@ mailing list for Windows gurus to help me out (FILE_FLAG_DELETE_ON_CLOSE? that would require our own slotmem_shm_create() for Windows specifics).

Anyway, the debug traces (with your personal logs in slotmem_cleanup about destroy/remove returned values) could help here to confirm what happens at deletion time.
Also, these traces with and without attachment 33052 [details] applied could possibly help.
Comment 25 dnie 2015-09-14 14:33:50 UTC
Stresstest looks good.
Intil now, it made about 270 restarts with many random configuration changes. Apache is still up an running and fully functional.
Comment 26 Yann Ylavic 2015-09-14 15:11:45 UTC
Do you still observe the filesystem leaks?
If yes, did you also apply attachment 33052 [details]?
Comment 27 dnie 2015-09-15 07:39:54 UTC
(In reply to Yann Ylavic from comment #26)
> Do you still observe the filesystem leaks?
> If yes, did you also apply attachment 33052 [details]?

With patch from attachment 33052 [details], all shm files are deleted correctly.

So, I think this issue is fixed.

Thanks!
Comment 28 Yann Ylavic 2015-09-15 14:50:04 UTC
Created attachment 33109 [details]
mod_slotmem_shm and mpm_winnt (children) with generation number (2.4.x)

Patch updated to match proposed backport to 2.4.x (r1703205).
Comment 29 Graham Leggett 2015-09-26 23:40:11 UTC
Patch backported to v2.4.17.