Bug 65313

Summary:	fix graceful restarts so it happens when new instance is ready to go.
Product:	Apache httpd-2	Reporter:	apache-bugs.a2
Component:	Core	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	NEW ---
Severity:	enhancement
Priority:	P2
Version:	2.4.47
Target Milestone:	---
Hardware:	PC
OS:	Linux
Attachments:	Enf of reload notice before stopping children Enf of reload notice before stopping children (v2)

Description apache-bugs.a2 2021-05-18 16:43:34 UTC

Apache starts slowly when it has a lot of config (> 10000 vhosts etc). On good hardware it takes about 20s to start handling http requests. That also means that graceful restart takes same time! Even when graceful restart apache httpd terminates previous instance even though new one is not ready (and won't be for 20s+) which means that end user will see page load for 20 seconds. Nginx reloads itself without this delay even when startups/reload takes a LONG time. Please fix graceful restarts so it won't start before new instance is ready!

Comment 1 Christophe JAILLET 2021-08-25 18:56:52 UTC

WIP in r1892587

Comment 2 apache-bugs.a2 2021-08-27 16:17:43 UTC

(In reply to Christophe JAILLET from comment #1)
> WIP in r1892587

I've tested this patch (r1892587) and did not found any improvement in graceful restarts disruption of handling http requests. Unfortunate http request, that were made during graceful restart, are still put on "hold" for 20+s

Comment 3 apache-bugs.a2 2021-08-27 16:18:50 UTC

Also 've tested patch in both event and prefork mode.

Comment 4 Yann Ylavic 2021-08-28 13:22:05 UTC

Created attachment 37997 [details]
Enf of reload notice before stopping children

Can you please apply this patch on top of r1892587 and show the main error_log with MPM event?

I can't reproduce what you observe in my testing, which consists in continuously running ab (apache bench) and operating periodic graceful restarts (every 20 seconds).
I also add an artificial sleep delay in httpd to simulate a loading that takes 10 seconds, like this:
Index: server/mpm/event/event.c
===================================================================
--- server/mpm/event/event.c	(revision 1892671)
+++ server/mpm/event/event.c	(working copy)
@@ -3406,6 +3406,11 @@ static int event_run(apr_pool_t * _pconf, apr_pool
         apr_pool_create(&retained->gen_pool, ap_pglobal);
     }
     else {
+        ap_log_error(APLOG_MARK, APLOG_NOTICE, 0, ap_server_conf,
+                     "Building new generation %i for 10 more seconds",
+                     retained->mpm->my_generation + 1);
+        apr_sleep(apr_time_from_sec(10));
+
         if (retained->mpm->was_graceful) {
             /* wake up the children...time to die.  But we'll have more soon */
             for (i = 0; i < num_buckets; i++) {
--

My observation is that during the reload (+ delay) the server keeps responding as usual.

So I'm wondering if it isn't one of your modules that does time consuming work in each child initialization phase (child_init hooks), in which case there is little httpd can do to make it process connections faster.

Let's see what the logs say about the delays between:
1. AH00493: SIGUSR1 received.  Doing graceful restart
2. AH99999: Gracefully stopping old generation <n-1>
3. AH00489: Apache/... -- resuming normal operations (generation <n>)
4. AH00000: Child <pid> initialization done (generation <n>)
(+ other AH00000s for all the children created for generation <n>)

If the "hold for 20+s" is not between 1. and 2. then the children are too long to set up (and r1892587 won't help).

Comment 5 Yann Ylavic 2021-08-28 13:37:23 UTC

Created attachment 37998 [details]
Enf of reload notice before stopping children (v2)

Actually I should have added a log message before child_init for better reading, like in this v2.

Please apply this one instead, the sequence should then be:
1. AH00493: SIGUSR1 received.  Doing graceful restart
2. AH99999: Gracefully stopping old generation <n-1>
3. AH00489: Apache/... -- resuming normal operations (generation <n>)
4. AH00000: Child <pid> initialization start (generation <n>)
5. AH00001: Child <pid> initialization done (generation <n>)
(+ other AH00000s for all the children created for generation <n>)

Comment 6 apache-bugs.a2 2021-08-29 13:57:12 UTC

With both unpatched http 2.4.48 and patched by your code I get http request made between 1 and 5 seconds after initiating graceful restart causes browser to load page...
With unpatched httpd 2.4.48 page is load after 5s (and on large server even 25s) of "loading" page in browser.
With patched 2.4.48 I get "503 Service Unavailable" after exacly 60-65s every time
If I were to hit ctrl+r in browser while page is loading page would get loaded immediately (new connections are handled fine).

So unfortunetly patch made things worse somehow.

[Sun Aug 29 15:12:34 2021] [notice] [sv1.] event.c(3298): [client AH00493: SIGUSR1 received.  Doing graceful restart
[Sun Aug 29 15:12:40 2021] [notice] [sv1.] event.c(3046): [client AH99999: Gracefully stopping old generation 5
[Sun Aug 29 15:12:40 2021] [notice] [sv1.] event.c(3195): [client AH00489: Apache/2.4.48 (Unix) OpenSSL/1.1.1b SVN/1.9.3 mod_fcgid/2.3.9 Phusion_Passenger/5.3.4 configured -- resuming normal operations (generation 6)
[Sun Aug 29 15:12:40 2021] [warn] [sv1.] event.c(2995): [client AH00488: long lost child came home! (pid 19567)
[Sun Aug 29 15:12:41 2021] [notice] [sv1.] event.c(2497): [client AH00000: Child 23171 initialization start (generation 6)
[Sun Aug 29 15:12:41 2021] [notice] [sv1.] event.c(2503): [client AH00001: Child 23171 initialization done (generation 6)
[Sun Aug 29 15:12:42 2021] [notice] [sv1.] event.c(2497): [client AH00000: Child 23354 initialization start (generation 6)
[Sun Aug 29 15:12:42 2021] [notice] [sv1.] event.c(2503): [client AH00001: Child 23354 initialization done (generation 6)

Comment 7 apache-bugs.a2 2021-08-29 14:24:00 UTC

Problem descrived above occurs only when request were made to page which is generated by PHP at my server. There is no problem with static htm pages.
PHP at my server is handler by mod_fcgid (https://httpd.apache.org/mod_fcgid/) witch pagch from https://bz.apache.org/bugzilla/show_bug.cgi?id=48769 fixing graceful restarts https://bz.apache.org/bugzilla/attachment.cgi?id=28262

So yeah... this is some module nuance

And now when your patch becomes mainline in httpd I will have even worse problem than initialy. Damn :)

Comment 8 apache-bugs.a2 2021-08-29 17:09:45 UTC

Unpached, stock mod_fcgid does not suffer from "503 Service Unavailable" with r1892587 patch, but there is short time window (about 1-2s) for immediate "ERR_CONNECTION_RESET" in browser which does not happen without r1892587 patch.
Also there is no improvement - http request to stock mod_fcgid resource made during graceful restart causes page "loading" in visitor browser for 5-20s (or ERR_CONNECTION_RESET in first 1-2 seconds)

Comment 9 apache-bugs.a2 2021-10-26 21:27:18 UTC

Jut to reiterate:

Patch r1892587 works but not for resources handled by mod_fcgid (php pages). In case of these resources there is NO CHANGE, graceful restart that takes 30s still causes no new mod_fcgid handlers to be spawned for 30s. Is that httpd problem or mod_fcgid problem?