Summary: | mod_fcgid performance suffers under increase in request concurrency | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | Mike <xyntrix> |
Component: | mod_fcgid | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | REOPENED --- | ||
Severity: | normal | CC: | apache, bmccart, krichy, merijnvdk |
Priority: | P2 | Keywords: | PatchAvailable |
Version: | 2.2-HEAD | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
Spawn faster under low process count
Rewrite handle_request in fcgid_bridge.c to fix 1sec delay |
Description
Mike
2012-08-10 23:47:40 UTC
This might be related to enhancement submission: https://issues.apache.org/bugzilla/show_bug.cgi?id=52174 I tried changing in modules/fcgid/fcgid_bridge.c, /* Avoid sleeping the very first time through if there are no busy processes; the problem is just that we haven't spawned anything yet, so waiting is pointless */ if (i > 0 || j > 0 || count_busy_processes(r, &fcgi_request)) { apr_sleep(apr_time_from_sec(1)); to if (i > 0 || j > 0 || count_busy_processes(r, &fcgi_request)) { apr_sleep(apr_time_from_sec(0)); and the serialization block seems to have stopped. Is this meant to just be an artificial anti-thrashing mechanism? If so, is there a better way I can prevent too many processes for trying to spin up concurrently, than adding in this 1s time delay? The 1s delay totally kills concurrent requests. Created attachment 29233 [details]
Spawn faster under low process count
The principle of not rushing into trying to spawn new processes is good: indeed in the case that we're running at max processes per class, it is the only thing that gives time to try to handle the actual requests (rather than spinning on the server). However, 1s is a very long time, and we're doing it if *any* process is currently busy. As a minimum, FcgidMinProcessesPerClass should be considered - under that level, we should be perfectly happy to launch a new process to handle a request.
Also, if launching another process is a plausible thing to do (i.e. we're below the limit), we shouldn't wait 1s before re-checking: requests should be being handled a lot more quickly than that. So instead, wait 250ms by default.
This patch is based in principle around high/low water marks for process count; at this point, these are just min and max processes per class. Picking figures between these would be better, but these should probably be configurable.
Adjusted title slightly to reflect what the OP correctly noted: that the problem is that requests that can't immediately be handled be an existing process are delayed by 1s (excepting when there are no busy processes at all). See also the comments in my patch: ultimately, additional tunables are called for, but adding options isn't something to be done lightly (given the doc impact), and certainly warrants some thought. Possibly explicit high/low, or a target process count, and possibly a configurable delay could be used - or some/all of these values could be generated at startup based on the bounds specified by existing options. a fix is commited to trunk r1377398 New protocol to avoid too much sleep, improve performance while stress testing 1. procmgr_send_spawn_cmd() now return a status code from PM, so process handler now know the spawn request is denyed or not. 2. if a new process is created, no sleep is needed. 3. if no process is created, sleep a while Ryan Pan's fix for this bug (and a couple of follow-on fixes) was reverted with http://svn.apache.org/r1529061 due to issues encountered with testing the proposal for mod_fcgid 2.3.8, which could not be released. Related discussion is in this mailing list thread: http://mail-archives.apache.org/mod_mbox/httpd-dev/201309.mbox/browser Note that at the same time a separate Windows-specific bug could result in more processes than necessary. That didn't affect other platforms and didn't explain all the bad symptoms encountered in Steffen's Windows setup. Any news on this bug? Why is lowering the sleep from 1s to f.e. 50ms not a solution? Created attachment 35611 [details]
Rewrite handle_request in fcgid_bridge.c to fix 1sec delay
I rewrote the handle_request in fcgid_bridge.c to fix 1 sec delay issue.
The original situation would wait 1 second before trying to aquire a process and then spawn one.
This has the drawback of creating 'sluggisch' feel on low traffic sites which make use of ajax calls (parrallel requests). If there is only one process available the parallel ajax request will be delayed by a second. After the second that one process will be probably free, so the request will be handled and no new process will be spawned. As a result the next request will behave exactly the same, with the same 1 sec delay because it takes more requests to actually spawn a new process.
This rewrite will throw the one second delay out of the window. It will check more often if a process is available and it will try less often to spawn a new process.
The original code would take 64 seconds of trying before it gave up and a HTTP_SERVICE_UNAVAILABLE was returned. My new code takes 60.8 seconds for this to happen, but what happens during this time is much different.
Original:
64000ms (64x spawn attempts, 128 process apply attempts)
New:
60800ms (8x spawn attempts, 148 process apply attempts)
But where the old code was linear, it would just check every second, the new code is not.
This table will show the spawn attempts and the process apply attempts. There are 8 spawn attempts. and for each spawn attempt a number of process apply attempts is done. The time between these attempts also differs, small waits at the beginning (and end) and long waits in the middle.
0) 2 x 50ms = 100ms
1) 8 x 200ms = 1600ms
2) 14 x 350ms = 4900ms
3) 20 x 500ms = 10000ms
4) 26 x 650ms = 16900ms
5) 26 x 500ms = 13000ms
6) 26 x 350ms = 9100ms
7) 26 x 200ms = 5200ms
Shortening the waits at the end will prevent long waiting requests to starve and hopefully allow less HTTP_SERVICE_UNAVAILABLE when there is a short peak/overload on the server.
We are using this patch in production for two months now after a three month test period. Both on servers with single high load sites and with low load small sites.
*** Bug 56308 has been marked as a duplicate of this bug. *** *** Bug 56719 has been marked as a duplicate of this bug. *** Any progress on this issue? Using mod_fcgid-2.3.9 is production environment and running into the same performance issues. another solution would be to reduce the polling interval, as your suggestion would cause HTTP 503 errors appear almost immediately if no slot is available instead of 60 seconds. To archive this the first step is to change the retry amount, e.g. to reduce the interval to 0.1 seconds: From: FCGID_APPLY_TRY_COUNT 2 To: FCGID_APPLY_TRY_COUNT 11 Beside of this the interval has to be reduced: if (i > 0 || j > 0 || count_busy_processes(r, &fcgi_request)) { apr_sleep(apr_time_from_sec(1)); To if (i > 0 || j > 0 || count_busy_processes(r, &fcgi_request)) { apr_sleep(apr_time_from_msec(1000 / (FCGID_APPLY_TRY_COUNT - 1)); To improve the performance it can make sense to write the requested 100ms delay in the example hardcoded `apr_time_from_msec(100)` or as precompiler directive, as it needs to be calculated several hundred times per request for high traffic environments. |