Summary: | fix graceful restarts so it happens when new instance is ready to go. | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | apache-bugs.a2 |
Component: | Core | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | NEW --- | ||
Severity: | enhancement | ||
Priority: | P2 | ||
Version: | 2.4.47 | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Linux | ||
Attachments: |
Enf of reload notice before stopping children
Enf of reload notice before stopping children (v2) |
Description
apache-bugs.a2
2021-05-18 16:43:34 UTC
(In reply to Christophe JAILLET from comment #1) > WIP in r1892587 I've tested this patch (r1892587) and did not found any improvement in graceful restarts disruption of handling http requests. Unfortunate http request, that were made during graceful restart, are still put on "hold" for 20+s Also 've tested patch in both event and prefork mode. Created attachment 37997 [details] Enf of reload notice before stopping children Can you please apply this patch on top of r1892587 and show the main error_log with MPM event? I can't reproduce what you observe in my testing, which consists in continuously running ab (apache bench) and operating periodic graceful restarts (every 20 seconds). I also add an artificial sleep delay in httpd to simulate a loading that takes 10 seconds, like this: Index: server/mpm/event/event.c =================================================================== --- server/mpm/event/event.c (revision 1892671) +++ server/mpm/event/event.c (working copy) @@ -3406,6 +3406,11 @@ static int event_run(apr_pool_t * _pconf, apr_pool apr_pool_create(&retained->gen_pool, ap_pglobal); } else { + ap_log_error(APLOG_MARK, APLOG_NOTICE, 0, ap_server_conf, + "Building new generation %i for 10 more seconds", + retained->mpm->my_generation + 1); + apr_sleep(apr_time_from_sec(10)); + if (retained->mpm->was_graceful) { /* wake up the children...time to die. But we'll have more soon */ for (i = 0; i < num_buckets; i++) { -- My observation is that during the reload (+ delay) the server keeps responding as usual. So I'm wondering if it isn't one of your modules that does time consuming work in each child initialization phase (child_init hooks), in which case there is little httpd can do to make it process connections faster. Let's see what the logs say about the delays between: 1. AH00493: SIGUSR1 received. Doing graceful restart 2. AH99999: Gracefully stopping old generation <n-1> 3. AH00489: Apache/... -- resuming normal operations (generation <n>) 4. AH00000: Child <pid> initialization done (generation <n>) (+ other AH00000s for all the children created for generation <n>) If the "hold for 20+s" is not between 1. and 2. then the children are too long to set up (and r1892587 won't help). Created attachment 37998 [details]
Enf of reload notice before stopping children (v2)
Actually I should have added a log message before child_init for better reading, like in this v2.
Please apply this one instead, the sequence should then be:
1. AH00493: SIGUSR1 received. Doing graceful restart
2. AH99999: Gracefully stopping old generation <n-1>
3. AH00489: Apache/... -- resuming normal operations (generation <n>)
4. AH00000: Child <pid> initialization start (generation <n>)
5. AH00001: Child <pid> initialization done (generation <n>)
(+ other AH00000s for all the children created for generation <n>)
With both unpatched http 2.4.48 and patched by your code I get http request made between 1 and 5 seconds after initiating graceful restart causes browser to load page... With unpatched httpd 2.4.48 page is load after 5s (and on large server even 25s) of "loading" page in browser. With patched 2.4.48 I get "503 Service Unavailable" after exacly 60-65s every time If I were to hit ctrl+r in browser while page is loading page would get loaded immediately (new connections are handled fine). So unfortunetly patch made things worse somehow. [Sun Aug 29 15:12:34 2021] [notice] [sv1.] event.c(3298): [client AH00493: SIGUSR1 received. Doing graceful restart [Sun Aug 29 15:12:40 2021] [notice] [sv1.] event.c(3046): [client AH99999: Gracefully stopping old generation 5 [Sun Aug 29 15:12:40 2021] [notice] [sv1.] event.c(3195): [client AH00489: Apache/2.4.48 (Unix) OpenSSL/1.1.1b SVN/1.9.3 mod_fcgid/2.3.9 Phusion_Passenger/5.3.4 configured -- resuming normal operations (generation 6) [Sun Aug 29 15:12:40 2021] [warn] [sv1.] event.c(2995): [client AH00488: long lost child came home! (pid 19567) [Sun Aug 29 15:12:41 2021] [notice] [sv1.] event.c(2497): [client AH00000: Child 23171 initialization start (generation 6) [Sun Aug 29 15:12:41 2021] [notice] [sv1.] event.c(2503): [client AH00001: Child 23171 initialization done (generation 6) [Sun Aug 29 15:12:42 2021] [notice] [sv1.] event.c(2497): [client AH00000: Child 23354 initialization start (generation 6) [Sun Aug 29 15:12:42 2021] [notice] [sv1.] event.c(2503): [client AH00001: Child 23354 initialization done (generation 6) Problem descrived above occurs only when request were made to page which is generated by PHP at my server. There is no problem with static htm pages. PHP at my server is handler by mod_fcgid (https://httpd.apache.org/mod_fcgid/) witch pagch from https://bz.apache.org/bugzilla/show_bug.cgi?id=48769 fixing graceful restarts https://bz.apache.org/bugzilla/attachment.cgi?id=28262 So yeah... this is some module nuance And now when your patch becomes mainline in httpd I will have even worse problem than initialy. Damn :) Unpached, stock mod_fcgid does not suffer from "503 Service Unavailable" with r1892587 patch, but there is short time window (about 1-2s) for immediate "ERR_CONNECTION_RESET" in browser which does not happen without r1892587 patch. Also there is no improvement - http request to stock mod_fcgid resource made during graceful restart causes page "loading" in visitor browser for 5-20s (or ERR_CONNECTION_RESET in first 1-2 seconds) Jut to reiterate: Patch r1892587 works but not for resources handled by mod_fcgid (php pages). In case of these resources there is NO CHANGE, graceful restart that takes 30s still causes no new mod_fcgid handlers to be spawned for 30s. Is that httpd problem or mod_fcgid problem? |