Summary: | Scoreboard full error with event/ssl | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | Alexander Strange <astrange> |
Component: | mpm_event | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | RESOLVED FIXED | ||
Severity: | major | CC: | andru, bucky, chris, daniel.lemsing, dgallowa, friesoft, gjorgjioski, gregames, info, jim, leho, mike.williams, nikke, payam_hekmat, sander, sf, stephane, tez, thomas.jarosch, toscano.luca |
Priority: | P1 | Keywords: | FixedInTrunk |
Version: | 2.4.7 | ||
Target Milestone: | --- | ||
Hardware: | Other | ||
OS: | Linux | ||
Attachments: |
close keepalive connections if process is shutting down
exit some threads early during gracful shutdown of a process Allow to use more scoreboad slots same as above, but for trunk Use all scoreboard entries up to ServerLimit, for trunk Use all scoreboard entries up to ServerLimit, for 2.4 |
Description
Alexander Strange
2012-07-17 02:21:21 UTC
This may be obvious, but the server-status page is a huge help in analyzing scoreboard full issues. Do you remember what it looked like? what state codes were most prevalent? The scoreboard can fill up quickly if a back end server stalls. We've seen AH00485: scoreboard is full, not at MaxRequestWorkers on 2.4.4 with the event MPM, no SSL involved. Haven't figured out the exact conditions yet, but involved are: * High/varying load, causing worker processes to be spawned and killed, filling up the scoreboard with G:s. * Server reloads due to config changes. I suspect the root cause is that server processes are flagged for killing, but later they're needed again but instead of reviving the existing process a new one is created. If you have a lot of slow connections (this is a file archive serving DVD-images etc) processes can add up. The scoreboard can look like this after a while: ----------8<---------------- PID Connections Threads Async connections total accepting busy idle writing keep-alive closing 14465 94 no 0 0 72 0 21 28881 132 yes 0 0 79 0 6 23632 582 no 0 0 523 0 51 32314 43 no 0 0 28 0 15 13766 577 no 0 0 564 1 2 337 42 no 0 0 28 0 13 19580 39 no 0 0 27 0 12 30603 478 no 0 0 424 0 52 32163 177 no 0 0 136 0 24 16159 429 no 0 0 374 0 54 15376 93 no 0 0 45 0 47 32478 124 no 0 0 86 0 38 30604 395 yes 2 48 390 3 0 30667 61 no 0 0 38 0 17 31569 58 no 0 0 27 0 20 19614 161 no 0 0 117 0 44 32286 253 yes 0 50 252 0 0 17643 454 yes 2 48 445 0 3 23353 49 no 0 0 27 2 20 31581 145 no 0 0 106 0 34 Sum 4386 4 146 3788 6 473 LGLGGGLLGLGLGLLLLLGLGLGLLLLLGLLLLLLLLLLLLLLGGGLGLLGGGGLGGLGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGLGLGGGGGLGLLGGGLGLLLLLLGGGLLLLLGGLGLGLLLGGGLGLLLGLGLLGL LGGLLLLGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGL GGLLGGGLLGGLGLGGGGGLLGGGGLGLLLLLLGGGGGLGGGGGGLLLLLGLLLGLLLLLLLGL LLLGLLLGLGLGGGLGLGGGGLLGLGGLLLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GLGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGG GGLLLLLGLLLLGLLLLGLGLLLLGGLGLLLLLGLLGLLLLLLLLLLGGLLLGLGGGGGGGGGG GGGGLGGGGLGGGLGGGGGGGLGGGGGGGGGGGGGGGGGGGGLGGGGGLLGGGGLLGGGLGLLG GGGLGGLLGGGGLGGLGGLGLGGL____________________WW__________________ __________GGGGGGGGGGGGGGGGGGGGGGGGGGLGGLGLGGGGGGGGGGGGGGGGGGLLGG LGLLGLGLGGGLLGLGGLLLLGLGGGLGLLGGLGLLGLGLLGLGGGLGGGGGGGGGGGGGLGGG GLGGLGGGGGLGGGGGGGGGGLGLGGLLGLGG________________________________ ____________________W___W_______________________________________ ____GLGLLLLLLLGGGLLGGLLLGGLLLLLLGGLGLLGGLLGGGGLGLLLGGGLLGGLGLGGG LLGLGGLLLLGLGLLGGGGGGLLGGGGGGLLGGGGLGLGL ----------8<---------------- (In reply to Niklas Edmundsson from comment #2) > We've seen AH00485: scoreboard is full, not at MaxRequestWorkers on 2.4.4 > with the event MPM, no SSL involved. > PID Connections Threads Async connections > total accepting busy idle writing keep-alive closing > 14465 94 no 0 0 72 0 21 > 28881 132 yes 0 0 79 0 6 > 23632 582 no 0 0 523 0 51 > 32314 43 no 0 0 28 0 15 > 13766 577 no 0 0 564 1 2 > LGLGGGLLGLGLGLLLLLGLGLGLLLLLGLLLLLLLLLLLLLLGGGLGLLGGGGLGGLGGGGGG > GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGG OK, there are many worker processes that hang while trying to shut down, probably due to traffic fluctuations. The only two states we see in the scoreboard are G and L. The G should be transient and can probably be ignored. The Ls look like the cause of the hangs. L means the threads are hung while trying to write to the log. Normally you never see this with logs on a reasonably fast local hard drive. Are the log files NFS mounted or something like that? Greg > OK, there are many worker processes that hang while trying to shut down, > probably due to traffic fluctuations. The only two states we see in the > scoreboard are G and L. The G should be transient and can probably be > ignored. The Ls look like the cause of the hangs. Transient for the G:s can mean days in this case, think slooow ADSL connection downloading a DVD image... > L means the threads are hung while trying to write to the log. Normally you > never see this with logs on a reasonably fast local hard drive. Are the log > files NFS mounted or something like that? No, local filesystem. But I'll have to double check that we're not doing anything overly clever on the log front... Greg, I didn't check the code, but to me it seems that a "G" letter does not mean there's no more work going on. The server-status on our own www.(eu|us).apache.org shows the same G plus L mixture for about a minute (varying) whenever a process dies due to MaxConnectionsPerChild. When I checked such processes, they had open client connections and were still sending data to the client. So it was correct they were still aorund, but the status letters "G" or "L" for those gracefully exiting children are not showing those details. I looked at apache.org and the code. The Ls are normal when a gracefully exiting process had an active thread. Sorry for jumping to conclusions. close_listeners sets all the G states during graceful shutdown. (Unfortunately this means we can no longer see which threads are active vs. idle - not sure having the G state is worth it.) Any active threads which finish their requests will log and set the L state before exiting. The Gs that remain could represent exited threads or active requests - we can't tell from server-status. The processes that didn't exit have active connections. If they due to slow downloads, maybe the thing to do is to tune for less or no graceful process terminations when the traffic drops by raising MaxSpareThreads. Recently hit this error in a high traffic production web server (Apache 2.4.6) leading to an outage. Has anyone had success in overcoming this issue by amending Apache configuration ? If so, what did you change ... Also, can anyone offer any suggestions on what triggers this issue ? Being a production server, rolling back to 2.2.22 is not preferable. One of the gotchas with this is that the scoreboard seems to be sized to cater for MaxRequestWorkers, with no margins for server reloads etc. In our case, when it can take days for processes to exit if people are downloading large files over slow connections, we can easily have the situation where multiple server reloads (due to config changes etc) causes the scoreboard to fill up with old server processes in graceful-shutdown mode and no space for new processes to do some actual work. I can see a few ways to work around this: 1) Simply make the scoreboard bigger. I'd like a default size-multiplier of 2 for the event MPM, but configurable so we can set it to 4 or something for our setup. An alternative is to set a ridiculously large MaxRequestWorkers to get a big enough scoreboard, but one DOS and we're out of scoreboard anyway. 2) Kill off the oldest gracefully-exiting processes when we can't spawn a new process to do useful work. The ideal solution is probably a mix of these two. Also, I'm wondering if this is also somehow related to the "server dies for a while when doing reload" issue. We're still at httpd 2.4.6 though, so I can't say for certain that some of these issues aren't already fixed. In case it matters any, this problem appears to be specific to the Event MPM. I had it happening on a server, and when I switched it to the Worker MPM, it stopped. However, what I did notice is that the same server periodically had all of its workers taken up with requests, so that may be relevant to the problem as well. I have a similar behavior as described here (with no ssl involved) with httpd 2.4.9. I got a lot of AH00485: "scoreboard is full, not at MaxRequestWorkers", httpd is still serving requests, however one worker is in graceful finishing state and is taking 100% CPU. The worker was in this stat for about 24h, until I kill(1)ed it. Threads stats: __________________W_____________________________________________ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG Unfortunately I don't have any other info from the status page. strace of the worker shows an epoll_wait infinite loop: [...] epoll_wait(10, {}, 128, 100) = 0 epoll_wait(10, {}, 128, 100) = 0 epoll_wait(10, {}, 128, 100) = 0 [...] mpm event config: StartServers 1 ServerLimit 4 MinSpareThreads 4 MaxRequestWorkers 128 ThreadsPerChild 64 ThreadLimit 64 AsyncRequestWorkerFactor 4 Apache 2.4.10 on Slackware Linux 14.1 x86_64 platform. I am seeing this about once a minute in the logs: AH00485: scoreboard is full, not at MaxRequestWorkers I was able to recover only by a forced restart (stop then start). After migrating from worker MPM to event MPM with Apache 2.4.7 we are seeing this same problem. Server version: Apache/2.4.7 (Ubuntu) Ubuntu Trusty 14.04.2 LTS Linux 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux We explicitly moved to event MPM for this workload, which is a proxy of thousands of mostly-idle HTTP Keep-Alive connections - since event MPM doesn't require a thread per Keep-Alive connection. Although our number of clients is fairly consistent, and we have MaxConnectionsPerChild=0, we observe Apache processes going into GGGGGG state until eventually Apache no longer accepts connections. If we set MinSpareThreads and MaxSpareThreads equal to MaxRequestWorkers (so Apache doesn't attempt to scale down processes), the issue goes away (as expected, but validates (maybe?) this has to do with Apache scale-down). Since client connections can be connected for hours or days, Apache processes stay in this state for a very long time, eventually rejecting client connections and becoming wedged. Our clients are not browsers - Apache is being used for a mid-tier load balancer/proxy with client connections that are very long lived (long Keep-Alive times). 248 requests/sec - 0.7 MB/second - 3114 B/request 2 requests currently being processed, 38 idle workers PID Connections Threads Async connections total accepting busy idle writing keep-alive closing 28483 1642 no 0 0 0 1642 0 29672 553 yes 1 19 0 552 0 29696 9 no 0 0 0 9 0 29588 173 no 0 0 0 173 0 29618 1 no 0 0 0 1 0 29644 6 no 0 0 0 6 0 29719 30 no 0 0 0 30 0 29743 237 yes 1 19 0 236 0 Sum 2651 2 38 0 2649 0 GGGGGGGGGGGGGGGGGGGG________W___________GGGGGGGGGGGGGGGGGGGGGGWG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGG________W___________................................ ........ We are having the symptoms here : Server Version: Apache/2.4.7 (Ubuntu) SVN/1.8.8 mod_jk/1.2.37 OpenSSL/1.0.1f Ubuntu 14.04.2 LTS Linux 3.13.0-52-generic #86-Ubuntu SMP Mon May 4 04:32:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Many logs : [mpm_event:error] [pid 6332:tid 140558940702592] AH00485: scoreboard is full, not at MaxRequestWorkers From the server status Right after start : __RR___________R________________________W__________________W____ ___________..................................................... ...................... After one hour : ___________________W_____GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGG_______W___W_____________............................ ...................... Two hours later : GGGGGGGGGGGGGGGGGGGGGGGGGW_W_____W________W____W__GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGG Is there anything we can provide to help in the diagnostic of the issue ? Do you know of any workaround through configuration ? In case others find it useful, the approach we used to mitigate this was several things: 1. Increased MinSpareThreads and MaxSpareThreads, as well as the range between them. By making Apache less aggressive about scaling the number of servers down, it's less likely to run into this issue. Our new values are: MinSpareThreads = MaxRequestWorkers / 4 MaxSpareThreads = MinSpareThreads * 3 2. Lowered MaxKeepAliveRequests. By looking at a histogram of request counts per connection on an equivalent Apache running with worker MPM (first value in Acc column), I found a very long tail of few connections out to our old value, but a clear cluster at the lower end. Our new MaxKeepAliveRequests is a bit beyond the critical-mass cluster, but significantly lower than the old value. This will allow servers to recycle quicker when they scale down, but not cause any significant impact to client connections, since the relative number of connections we'll close early is small. 3. Increased AsyncWorkerFactor. When Apache servers are scaling down (in Gracefully Finishing state), this allows other servers to pick up the slack by handling a larger number of total client connections (in HTTP Keep-Alive, this does not increase the number of workers), where before these processes had reached their limit of connections and were rejecting new ones. Event MPM does a reasonably good job of spreading load between processes, and with our larger spare threads range we now tend to have more alive processes as well. We also considered lowering KeepAliveTimeout, but using a similar histogram as I did for KeepAliveRequests from a worker MPM configuration (using the SS column as a reasonable analog). That histogram showed a nice distribution for us, so lowering this would have affected clients and not helped for this workload. These are the values that worked for us, with our workload, to mitigate this issue. Of course your workload and values will be different, but this may be a reasonable strategy to try as well. 2.4.16 and the following configuration hits scoreboard full with 3-4 reloads StartServers 2 MinSpareThreads 50 MaxSpareThreads 150 ThreadsPerChild 25 MaxRequestWorkers 200 MaxConnectionsPerChild 10000 Any advice? This is certainly a bug and not a configuration issue. I have had this error happen with the default (Debian) configuration and other people online report the same. I have had this happen with mpm_event and mpm_worker. It's very reproducible. It happens with almost any thread related settings I have tried. It stops new requests from being served and is a serious problem. There is some bug with the way Apache handles its servers/threads. This is not something that can be fixed by tweaking the configuration. At best it might be mitigated by setting: StartServers 1 ServerLimit X ThreadsPerChild XXX ThreadLimit <ThreadsPerChild> MaxRequestWorkers <ServerLimit * ThreadLimit> MinSpareThreads <MaxRequestWorkers> MaxSpareThreads <MaxRequestWorkers> MaxRequestsPerChild 0 In other words, make it so a thread stays alive forever and therefore the buggy part of the code that is responsible for killing and reusing threads is never hit. Of course this requires always using the maximum amount of RAM since threads never die even when there is no traffic. (In reply to gobbledance from comment #16) > This is certainly a bug and not a configuration issue. I have had this error > happen with the default (Debian) configuration and other people online > report the same. I have had this happen with mpm_event and mpm_worker. > > It's very reproducible. It happens with almost any thread related settings I > have tried. It stops new requests from being served and is a serious problem. I have found no way around it with a variety of worker configuration parameters. Looks like the best bet would be to have fail2ban or similar monitor the error_log and restart the server when scoreboard hits the DoS condition. I'm also affected by this bug running Apache/2.4.7 (Ubuntu) on 14.04. I setup a logfile watch daemon that force restarts apache2 if the line shows up in the error.log as a hotfix. Has anyone tested this with the current stable release 2.4.16? (In reply to ScottE from comment #12) > Our clients are not browsers - Apache is being used for a mid-tier load > balancer/proxy with client connections that are very long lived (long > Keep-Alive times). This seems to be a problem that should not be too difficult to fix. When a process is shutting down, it should close its keepalive connections. Can you please check if the attached patch helps? The case where long-running transfers are keeping a process from shutting down is much more difficult to fix. Created attachment 33154 [details]
close keepalive connections if process is shutting down
Created attachment 33158 [details]
exit some threads early during gracful shutdown of a process
The attached diff against the 2.4.x branch makes unneeded threads exit earlier during graceful shutdown of a process. This then allows new processes to use the freed scoreboard slots.
I am interested in real-live experiences with this patch. It has two known problems, though:
- If httpd is shut down (ungracefully) while there are some old processes around serving long lasting requests, those processes won't die peacefully but will be SIGKILLed by the parent after 10 seconds.
- server-status shows incomplete information (that is, even more incomplete than in 2.4 ;) )
I have applied the patch on our own production server, which experiences this problem sometimes twice a day, and sometimes not for a week or so. So now we wait. I will report immediately if the problem recurs, and I will also report in a week if the problem does not recur. PS: If "Graceful, but sigkill after 10 seconds" were an actual option, I would probably use it all the time. (In reply to Stefan Fritsch from comment #21) > > - If httpd is shut down (ungracefully) while there are some old processes > around serving long lasting requests, those processes won't die peacefully > but will be SIGKILLed by the parent after 10 seconds. Wasn't that already the case for ungraceful stop/restart? > > - server-status shows incomplete information (that is, even more incomplete > than in 2.4 ;) ) How about not setting SERVER_GRACEFUL in close_listeners() and worker_thread()? The old generation's state could be relevent, since the new generation does not "steal" the scoreboard now (until the old worker exits). (In reply to bucky from comment #22) > I have applied the patch on our own production server, which experiences > this problem sometimes twice a day, and sometimes not for a week or so. Thanks for that already. (In reply to Yann Ylavic from comment #23) > (In reply to Stefan Fritsch from comment #21) > > - If httpd is shut down (ungracefully) while there are some old processes > > around serving long lasting requests, those processes won't die peacefully > > but will be SIGKILLed by the parent after 10 seconds. > > Wasn't that already the case for ungraceful stop/restart? Normally, those child process should react to the SIGTERM that is sent first. But that is currently broken by my patch. > > - server-status shows incomplete information (that is, even more incomplete > > than in 2.4 ;) ) > > How about not setting SERVER_GRACEFUL in close_listeners() and > worker_thread()? > The old generation's state could be relevent, since the new generation does > not "steal" the scoreboard now (until the old worker exits). Yes, that would proabaly be better, I'll have to test that. But it would not fix the incompleteness I was referring to: The old and the new process have only one process slot in the scoreboard, which makes the async overview table show sometimes the info from the old and sometimes from the new process, depending on who updated it last. (In reply to Stefan Fritsch from comment #24) > > (In reply to Yann Ylavic from comment #23) > > How about not setting SERVER_GRACEFUL in close_listeners() and > > worker_thread()? > > The old generation's state could be relevent, since the new generation does > > not "steal" the scoreboard now (until the old worker exits). > > Yes, that would proabaly be better, I'll have to test that. But it would not > fix the incompleteness I was referring to: The old and the new process have > only one process slot in the scoreboard, which makes the async overview > table show sometimes the info from the old and sometimes from the new > process, depending on who updated it last. It seems to me that the new generation's worker threads are not started now unless their scoreboard slot is marked SERVER_DEAD (was also SERVER_GRACEFUL before attachment 33158 [details]). So AIUI, there shouldn't be two workers using the same slot. (In reply to Yann Ylavic from comment #25) This technical discussion has been moved to the dev mailing list. It's been a week. The scoreboard errors haven't stopped altogether. Every so often I still get one a second for a short time, but now they last for about 1 or 2 minutes, and that's it. I haven't gotten any lockups since I applied the patch. mod_h2 did some significant cleanups for resource handling in the 0.9.x branch. "Scoreboard full" errors seem to have been completely eliminated for me. Uptime of several weeks goes with no issues now. So looks like external modules' individual cleanup abilities are directly related to this issue. I'm confused. To my knowledge, mod_h2 is a 3rd party module. It it somehow an integral part of the latest httpd (2.4.16)? (In reply to bucky from comment #29) > I'm confused. To my knowledge, mod_h2 is a 3rd party module. It it somehow > an integral part of the latest httpd (2.4.16)? Yes, it is already part of trunk and backported to 2.4.x. This may be related (In reply to Leho Kraav @lkraav from comment #28) > mod_h2 did some significant cleanups for resource handling in the 0.9.x > branch. "Scoreboard full" errors seem to have been completely eliminated for > me. mod_http2 (being released in 2.4.17) has its own connection handling (somehow appart from the MPM, for now), and shouldn't be seen as a workaround to this issue. The more testing on Stefan's proposed patch (regarding MPM event), without mod_http2, the quicker it will be backported in a release. The fixes I did in mod_http2, mentioned by Leho, were just related to the fact that early 0.9.x version of that module did not properly mark connections for reclaiming, so cleanup work was not run all the time, leading to memory loss and scoreboard handle waste. That has been fixed in mod_http2 alone and does not affect other connections. Since the bug happens without the module as well, its presence is not mitigation. If the patch by Stefan does not fix it, we should review again if there are races that prevent cleanup from happening in the HTTP/1.1 cases. WE got into a situation where the users of our product were stuck with G. We've got severe performance issues in those cases. We've tried patch https://bz.apache.org/bugzilla/attachment.cgi?id=33158&action=diff on a couple of installs and it made things much much better. On one install it would get stuck with 2000 clients coming in at roughly the same time. Now it can handle 10K gracefully. Hope that helps. I'm hitting this on a production server with 2.4.18 now. Can't apply custom patches here. ServerLimit 30 MaxRequestWorkers 30 MaxConnectionsPerChild 600 KeepAlive On KeepAliveTimeout 1 MaxKeepAliveRequests 20 Timeout 50 mod_h2 isn't enabled here. From above discussion, I can't get a clear indiciation if any core developers have confirmed this to be a bug or a configuration issue? After applying the patch I ran into "No space left on device: AH00023: Couldn't create the proxy mutex" I haven't seen that issue without the patch. Log says: [Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid 140551243081696] (28)No space left on device: AH00023: Couldn't create the proxy mutex [Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid 140551243081696] (28)No space left on device: AH02478: failed to create proxy mutex AH00016: Configuration Failed # ipcs -s ------ Semaphore Arrays -------- key semid owner perms nsems 0x00000000 0 root 600 1 0x00000000 65537 root 600 1 0x00000000 131074 apache 600 1 0x7a00179d 59899907 zabbix 600 13 0x00000000 3866628 apache 600 1 0x00000000 3899397 apache 600 1 0x00000000 3932166 apache 600 1 0x00000000 21397511 apache 600 1 0x00000000 21495816 apache 600 1 0x00000000 21528585 apache 600 1 0x00000000 21561354 apache 600 1 0x00000000 21594123 apache 600 1 0x00000000 21626892 apache 600 1 0x00000000 21659661 apache 600 1 0x00000000 29294606 apache 600 1 0x00000000 29327375 apache 600 1 0x00000000 29360144 apache 600 1 0x00000000 29392913 apache 600 1 0x00000000 29425682 apache 600 1 0x00000000 29458451 apache 600 1 0x00000000 29884436 apache 600 1 0x00000000 29917205 apache 600 1 0x00000000 29949974 apache 600 1 0x00000000 29982743 apache 600 1 0x00000000 30015512 apache 600 1 0x00000000 30048281 apache 600 1 0x00000000 30310426 apache 600 1 0x00000000 30343195 apache 600 1 0x00000000 30375964 apache 600 1 0x00000000 30408733 apache 600 1 0x00000000 30441502 apache 600 1 0x00000000 30474271 apache 600 1 0x00000000 30736416 apache 600 1 0x00000000 30769185 apache 600 1 0x00000000 30801954 apache 600 1 0x00000000 30834723 apache 600 1 0x00000000 30867492 apache 600 1 0x00000000 30900261 apache 600 1 0x00000000 30998566 apache 600 1 0x00000000 31031335 apache 600 1 0x00000000 31064104 apache 600 1 0x00000000 31096873 apache 600 1 0x00000000 31129642 apache 600 1 0x00000000 31162411 apache 600 1 0x00000000 31260716 apache 600 1 0x00000000 31293485 apache 600 1 0x00000000 31326254 apache 600 1 0x00000000 31359023 apache 600 1 0x00000000 31391792 apache 600 1 0x00000000 31424561 apache 600 1 0x00000000 37257266 apache 600 1 0x00000000 37290035 apache 600 1 0x00000000 37322804 apache 600 1 0x00000000 37355573 apache 600 1 0x00000000 37388342 apache 600 1 0x00000000 37421111 apache 600 1 0x00000000 37519416 apache 600 1 0x00000000 37552185 apache 600 1 0x00000000 37584954 apache 600 1 0x00000000 37617723 apache 600 1 0x00000000 37650492 apache 600 1 0x00000000 37683261 apache 600 1 0x00000000 37781566 apache 600 1 0x00000000 37814335 apache 600 1 0x00000000 37847104 apache 600 1 0x00000000 37879873 apache 600 1 0x00000000 37912642 apache 600 1 0x00000000 37945411 apache 600 1 0x00000000 38043716 apache 600 1 0x00000000 38076485 apache 600 1 0x00000000 38109254 apache 600 1 0x00000000 38142023 apache 600 1 0x00000000 38174792 apache 600 1 0x00000000 38207561 apache 600 1 0x00000000 41091146 apache 600 1 0x00000000 41123915 apache 600 1 0x00000000 41156684 apache 600 1 0x00000000 41189453 apache 600 1 0x00000000 41222222 apache 600 1 0x00000000 41254991 apache 600 1 0x00000000 44466256 apache 600 1 0x00000000 44499025 apache 600 1 0x00000000 44531794 apache 600 1 0x00000000 44564563 apache 600 1 0x00000000 44597332 apache 600 1 0x00000000 44630101 apache 600 1 0x00000000 49315926 apache 600 1 0x00000000 49348695 apache 600 1 0x00000000 49381464 apache 600 1 0x00000000 49414233 apache 600 1 0x00000000 49447002 apache 600 1 0x00000000 49479771 apache 600 1 0x00000000 49578076 apache 600 1 0x00000000 49610845 apache 600 1 0x00000000 49643614 apache 600 1 0x00000000 49676383 apache 600 1 0x00000000 49709152 apache 600 1 0x00000000 49741921 apache 600 1 0x00000000 55574626 apache 600 1 0x00000000 55607395 apache 600 1 0x00000000 55640164 apache 600 1 0x00000000 55672933 apache 600 1 0x00000000 55705702 apache 600 1 0x00000000 55738471 apache 600 1 0x00000000 58785896 apache 600 1 0x00000000 58818665 apache 600 1 0x00000000 58851434 apache 600 1 0x00000000 58884203 apache 600 1 0x00000000 58916972 apache 600 1 0x00000000 58949741 apache 600 1 0x00000000 61571182 apache 600 1 0x00000000 61603951 apache 600 1 0x00000000 61636720 apache 600 1 0x00000000 61669489 apache 600 1 0x00000000 61702258 apache 600 1 0x00000000 61735027 apache 600 1 0x00000000 63635572 apache 600 1 0x00000000 63668341 apache 600 1 0x00000000 63701110 apache 600 1 0x00000000 63733879 apache 600 1 0x00000000 63766648 apache 600 1 0x00000000 63799417 apache 600 1 0x00000000 65372282 apache 600 1 0x00000000 65405051 apache 600 1 0x00000000 65437820 apache 600 1 0x00000000 65470589 apache 600 1 0x00000000 65503358 apache 600 1 (In reply to Sander Hoentjen from comment #35) > After applying the patch I ran into "No space left on device: AH00023: > Couldn't create the proxy mutex" I haven't seen that issue without the patch. Hi Sander, I don't believe this is related to the patch - I've seen this happen (on vanilla 2.4.7) with a bad configuration and something like daemontools constantly restarting Apache. This is likely a valid bug, where Apache can leak mutexes under some conditions, but I don't think it's caused by the patch. (In reply to ScottE from comment #36) > (In reply to Sander Hoentjen from comment #35) > > After applying the patch I ran into "No space left on device: AH00023: > > Couldn't create the proxy mutex" I haven't seen that issue without the patch. > > Hi Sander, I don't believe this is related to the patch - I've seen this > happen (on vanilla 2.4.7) with a bad configuration and something like > daemontools constantly restarting Apache. This is likely a valid bug, where > Apache can leak mutexes under some conditions, but I don't think it's caused > by the patch. Well, we have apache 2.4 in event model on tens of servers and besides the bug in this ticket they are doing fine. On one of them we applied the patch (no other changes) and got AH00023 so while I believe there are other ways to trigger it, it seems that the patch also can play a role in it. (In reply to Thierry Bastian from comment #33) > WE got into a situation where the users of our product were stuck with G. > We've got severe performance issues in those cases. We've tried patch > https://bz.apache.org/bugzilla/attachment.cgi?id=33158&action=diff on a > couple of installs and it made things much much better. On one install it > would get stuck with 2000 clients coming in at roughly the same time. Now it > can handle 10K gracefully. > Hope that helps. I've been trying that today after an update from 2.2.something to 2.4.18. Still get the "scoreboard is full, ..." error though. One server looks like this when emitting the "scoreboard is full, ..." error, a few moments before becoming entirely unresponsive. 179 requests currently being processed, 461 idle workers PID Connections Threads Async connections total accepting busy idle writing keep-alive closing 25580 205 no 15 49 0 147 44 21331 293 no 0 0 0 0 292 19389 1 yes 0 0 0 0 0 25924 164 no 12 52 0 151 0 23217 432 no 15 49 0 146 270 23361 457 no 18 46 0 140 298 24175 458 no 13 51 0 149 297 20428 246 yes 0 0 0 0 244 21641 439 no 17 47 0 145 283 21739 435 no 16 48 0 143 277 23506 448 no 18 46 0 139 293 26180 30 yes 41 23 0 3 0 20174 2 no 0 0 0 0 1 20527 209 no 0 0 0 0 208 22470 448 no 14 50 0 149 287 20551 209 no 0 0 0 0 209 Sum 4476 179 461 0 1312 3003 R_R_R______R_R___R_________W_______R__R_R_WR________R__R___R____ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG ______W_R_R____R__R__R________________R__RR________R_____R_R____ ___R_RR________WRR____R__R_R______R__________WR______R____R___R_ R________R______RR__R__RR___R______RR___RRR______R__R___RR_R____ RR______________W__R_______R_________R_____RRW_____R____RWR_____ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG _____R___R_____R_R_R_RR_R___R_W_R__R__R___R______R____R_R_______ R______RR__R_R__RR_________R____R___R___RRR________R_______R___R _________R__RR_______RR__R___R___R_____RRR____R_R_RR___R____R_W_ R___R_RRW___RRRR_RRRRRRRR_WRR_RR_RRRRRRRRRRRRRR__RRRRR__________ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG _____W______R__R____R_________R_____R_RWR_RR_R_______R____R_____ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG Shortly afterwards all the Gs are cleared and it gets back to doing useful work for a while. Sometimes "a while" can be 15 minutes, other times less than 1 second. As a summary, the problem is that old processes that are shutting down but are still processing some long lasting connetions take up all open scoreboard slots. It may be triggered in two ways: a) when doing a graceful restart (apachectl graceful) b) when the server load goes down in a way that causes httpd to stop some processes. This is particularily problematic because when the load increases again, httpd will try to start more processes. If the pattern repeats, the number of processes can rise quite a bit. I think two things should be done: 1) Allow to use some extra scoreboard slots for processes that are gracefully shutting down. This is necessary to fix a) and will help a bit with b). To avoid these extra processes taking too much resources, they should try to free resources to the OS as soon as possible. 2) When some process is doing idle shutdown in situation b) and httpd wants more active processes due to rising load, it should not start new processes but rather tell the finishing processes to abort shutdown and resume full operation. This helps with b) but not with a). It is also a lot more invasive to implement than 1). My previous patch https://bz.apache.org/bugzilla/attachment.cgi?id=33158 did 1) to some extent by allowing re-use of some scoreboard slots. I will post a new patch in a minute. As configuration, I recommend (this one is true even if not using any patch): MaxspareThreads - MinSpareThreads >= 2 * ThreadsPerChild Higher values of the difference may work better. This reduces the likelyhood of situation b) appearing. Created attachment 33749 [details]
Allow to use more scoreboad slots
The new patch goes a step further and allows in total 10 times as many
processes as configured by MaxRequestWorkers / ThreadsPerChild , though
ServerLimit is still honored. The number 10 is currently hard-coded but would
probably be configurable in the end.
If using the patch, you should also set
ServerLimit >= 10 * MaxRequestWorkers / ThreadsPerChild
Though a smaller value may make sense if you are short of RAM.
(In reply to Sander Hoentjen from comment #35) > After applying the patch I ran into "No space left on device: AH00023: > Couldn't create the proxy mutex" I haven't seen that issue without the patch. > > Log says: > [Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid > 140551243081696] (28)No space left on device: AH00023: Couldn't create the > proxy mutex > [Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid > 140551243081696] (28)No space left on device: AH02478: failed to create > proxy mutex > AH00016: Configuration Failed > You could try using different Mutex types. On Linux, pthread may work best. Or you may try to increase the allowed ressources, possibly shared memory. How that is done depends on your OS. Created attachment 33750 [details] same as above, but for trunk Attaching the same patch, but for trunk. (In reply to Stefan Fritsch from comment #40) > Created attachment 33749 [details] > Allow to use more scoreboad slots That patch is for 2.4 and also includes these commits from trunk: https://svn.apache.org/r1703241 https://svn.apache.org/r1705922 https://svn.apache.org/r1706523 https://svn.apache.org/r1738464 https://svn.apache.org/r1738466 https://svn.apache.org/r1738486 https://svn.apache.org/r1738631 https://svn.apache.org/r1738632 https://svn.apache.org/r1738633 https://svn.apache.org/r1738635 (In reply to Stefan Fritsch from comment #41) > (In reply to Sander Hoentjen from comment #35) > > After applying the patch I ran into "No space left on device: AH00023: > > Couldn't create the proxy mutex" I haven't seen that issue without the patch. > > > > Log says: > > [Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid > > 140551243081696] (28)No space left on device: AH00023: Couldn't create the > > proxy mutex > > [Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid > > 140551243081696] (28)No space left on device: AH02478: failed to create > > proxy mutex > > AH00016: Configuration Failed > > > > You could try using different Mutex types. On Linux, pthread may work best. > Or you may try to increase the allowed ressources, possibly shared memory. > How that is done depends on your OS. But is there anything in the patch that changes this? Because without your patch we never ran into that issue. Would the new patch behave differently in this regard? (In reply to Sander Hoentjen from comment #43) > Would the new patch behave differently in this regard? Your issue is probably not related to the patch. It is usually caused by an unclean shutdown of httpd (eg. kill -9), or a crash of the parent process (you should see this in the system logs), possibly if you upgraded the binaries while httpd was still running. The number of IPC SysV semaphores is limited on the system, if the previous ones were not cleanly deleted on shutdown, the new startup won't complete. As suggested by Stefan, you could use another Mutex mechanism (pthread) which does not leak on unclean shutdown (even if httpd is killed). I was able to manage this issue by reducing GracefulShutdownTimeout value and increasing MaxClients / MaxRequestWorkers value to make more room for Apache scoreboard . Also I reduce no of MaxKeepAliveRequests Apache global level. For more info :- https://www.tectut.com/2016/04/workaround-for-scoreboard-is-full-not-at-maxrequestworkers Hitting me as well and making lot of troubles. When is this going to be fixed? What it the recommendation for production server? Is it better if upgrade to 2.4.18? 2.4.10 backport? or going back to which one is the best for 14.04.5 LTS ? (In reply to Valentin Gjorgjioski from comment #46) > Hitting me as well and making lot of troubles. > Is it better if upgrade to 2.4.18? 2.4.10 backport? Upgrading to 2.4.18 hasn't helped everyone, but it did help me. The "centos-sclo-rh" repository was a solution in my situation. (In reply to Valentin Gjorgjioski from comment #46) > Hitting me as well and making lot of troubles. Hi Valentin, can you give us a bit more details about your use case? Does the max scoreboard issue happens regularly after certain events or randomly? What is your configuration (if you can share it) and httpd version? It would help a lot :) Luca Hi, This started happening after recent upgrade of Ubuntu. Apache was the same, and now it is the same. Ubuntu is 14.04.5 LTS, Apache is 2.4.7. This is high load, production server. Working for 1.5 year without any problems so far. Here is some log of that update, when the problem started: [UPGRADE] apache2:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 [UPGRADE] apache2-bin:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 [UPGRADE] apache2-data:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 [UPGRADE] apache2-mpm-worker:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 [UPGRADE] apache2-utils:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 [INSTALL] php5-mysqlnd:amd64 [UPGRADE] php5-cli:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-common:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-curl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-fpm:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-gd:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-intl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-pgsql:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-pspell:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-readline:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-recode:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-sqlite:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-tidy:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-xmlrpc:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5-xsl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 [UPGRADE] php5:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 Here is what I nailed it down to: 1. After this upgrade I needed to DISABLE the opcache in PHP, because problems started with fatal errors and segmentation faults with wordpress. 2. Because of the 1. the server got even higher load. 3. Higher load caused full scoreboard, and maxRequestWorkersk. What I found were two problems: 1. When high load occurs and MaxReqeustWorkers is hit, the apache stops responding (dies). It should slow down, should not accept new requests until free slot, but it shouldn't stop responding. I think I saw this reported somewhere else, e.g.: https://www.digitalocean.com/community/questions/apache2-crash-on-ubuntu-14-04-maxrequestworkers-issue 2. When I found a way to solve the problem with high load (enable wp cache plugins), now the second problem started, mainly on apache reload (log rotation) or even on regular basis WHEN MaxConnectionsPerChild is different from 0, and/or when pm.max_requests is different from 0. Why this is a problem - because children are dying after certain numbers of requests, and then they get stuck into "G" state, and never completing. This is filling your scoreboard and you are ending with that error. Once you set these to 0, problem more or less disappears. Workaround is setting these to 0, and hoping all scripts are good, no memory leaks, lowering memory usage in php.ini, and restaring the server each day (on logrotate restart and not reload). Very important trick that I learned in during this is also this one: ALWAYS restart php-fpm and apache together. Failing to do so leads to some instabilities. For me that workaround work, but I would like to hear why this happens, and how we can prevent it (especially the problem when Apache dies when MaxRequestWorkers is readched). Thanks a lot for the details Valentin, will try to add my thoughts inline: (In reply to Valentin Gjorgjioski from comment #49) > This started happening after recent upgrade of Ubuntu. Apache was the same, > and now it is the same. Ubuntu is 14.04.5 LTS, Apache is 2.4.7. This is a very old version of httpd, so if you could if would be really great to upgrade Trusty to something more recent to see the differences. > This is high load, production server. Working for 1.5 year without any > problems so far. > > Here is some log of that update, when the problem started: > > [UPGRADE] apache2:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 > [UPGRADE] apache2-bin:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 > [UPGRADE] apache2-data:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 > [UPGRADE] apache2-mpm-worker:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 > [UPGRADE] apache2-utils:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13 > [INSTALL] php5-mysqlnd:amd64 > [UPGRADE] php5-cli:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-common:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-curl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-fpm:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-gd:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-intl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-pgsql:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-pspell:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-readline:amd64 5.5.9+dfsg-1ubuntu4.14 -> > 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-recode:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-sqlite:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-tidy:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-xmlrpc:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5-xsl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > [UPGRADE] php5:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19 > > > Here is what I nailed it down to: > 1. After this upgrade I needed to DISABLE the opcache in PHP, because > problems started with fatal errors and segmentation faults with wordpress. > 2. Because of the 1. the server got even higher load. > 3. Higher load caused full scoreboard, and maxRequestWorkersk. Stating the obvious but the httpd issue seems to be a consequence of all the php upgrades happened at the same time. Have you tried to rollback the last upgrade to see if the issue persists? > > What I found were two problems: > > 1. When high load occurs and MaxReqeustWorkers is hit, the apache stops > responding (dies). It should slow down, should not accept new requests until > free slot, but it shouldn't stop responding. I think I saw this reported > somewhere else, e.g.: > https://www.digitalocean.com/community/questions/apache2-crash-on-ubuntu-14- > 04-maxrequestworkers-issue Would you mind to include the logs and/or more details about this? Again it would be really great to know if the problem is the same with a more recent version of httpd. > > 2. When I found a way to solve the problem with high load (enable wp cache > plugins), now the second problem started, mainly on apache reload (log > rotation) or even on regular basis WHEN MaxConnectionsPerChild is different > from 0, and/or when pm.max_requests is different from 0. Why this is a > problem - because children are dying after certain numbers of requests, and > then they get stuck into "G" state, and never completing. This is filling > your scoreboard and you are ending with that error. Once you set these to 0, > problem more or less disappears. Do you have long timeouts (proxy, etc..) in your httpd configuration? This would be a useful information for us, it happened in the past that long proxy timeouts where exacerbating the issue that you described. > > Workaround is setting these to 0, and hoping all scripts are good, no memory > leaks, lowering memory usage in php.ini, and restaring the server each day > (on logrotate restart and not reload). > > Very important trick that I learned in during this is also this one: ALWAYS > restart php-fpm and apache together. Failing to do so leads to some > instabilities. > > For me that workaround work, but I would like to hear why this happens, and > how we can prevent it (especially the problem when Apache dies when > MaxRequestWorkers is readched). As written above it would be great to know more about the "Apache dies" part. Any detail that you could share with us would be really appreciated. Thanks! Luca Hi Luca, at the moment upgrading to trusty is not really an option, scared mostly from PHP7, and compatibility issues that might arise. Maybe next year. Haven't tried to rollback, was not even sure how to do that, and if that is easy. the link to digitalocean is another user, but I'm experiencing exactly. Unfortunately nothing in the log. Except the message stated there. I'm not sure what long timeout is, but probably default of (300seconds?!) for php-fpm using sockets is long. And yes, I guess this exacerbating the issue. No proxies defined. To me it seems like when some processes hang on php side, they are not getting killed on the apache side and connection is not released. Not even after those 5minutes. It gets stuck there and that's it. Apache dies means - apache processes are there, using no cpu, accepting no connections, and only restart helps. Nothing in the logs. I just went to prefork. I think it will be stable for now. I had tons of problems these 5 days, I don't know why I didn't switch to prefork earlier. It seems like e good workaround for me right now. Hi, now I believe I have clear picture what it is going on: 1. I'm using FastCGI, obviously dead project and not supported ?! 2. I'm not sure whether there is a directive such as connect timeout (fcgid has this). It seems either there is no timeout or it is quite big. 3. When Apache get hardly hit, then php-fpm get hardly hit as well. In my case PHP-FPM started having problems to do its job when I disabled the opcache mentioned earlier. So it get stuck with a longer and longer queue. Then apache continue sending processes to php-gpm even when php-fpm reached the limit (pm.max_children). In such scenario php-fpm stops opening new processes, but somehow old processes get stuck. Then apache continue doing this until full scoreboard. And now CPU usage is very low, it seems like some I/O block, many apache processes (1500?! ) waiting to open socket, but the socket is not available. However, at this point it is not very clear to me why Apache builds up the queue and the queue is not getting emptied - there is no high processor usage, it seems that php-fpm/apache got stuck and nothing can be done. Could be this apache not handling sockets properly? 4. Even with prefork this happens, it's not the mpm_event problem in this case. Workaround for the next month or so: Optimize work of PHP, lower the load so PHP-FPM can handl timely. Also, ubuntu upgrade and including more stable php opcache will help towards this. Long time solution: There must be a solution for this problem in general. Either it is time to move to nginx, or it is time to move to better module for fastcgi. By the way, what will you sugest at this point, what is the easier migration path from fastcgi to another apache module? > However, at this point it is not very clear to me why Apache builds up the > queue and the queue is not getting emptied - there is no high processor > usage, it seems that php-fpm/apache got stuck and nothing can be done. Could > be this apache not handling sockets properly? I'd suggest starting a thread on users@httpd.apache.org. If you can get this error, you should be able to find some processes trying to exit but hanging on the way out waiting for requests to complete. Showing their backtrace with gdb (or pstack) will tell us exactly what they're doing. Your MPM configuration will also tell us if you have unnecessary process churn. Created attachment 34201 [details]
Use all scoreboard entries up to ServerLimit, for trunk
New patch: This time use the whole scoreboard up to the configured ServerLimit. Also fixed some issues with the previous patch.
Created attachment 34202 [details] Use all scoreboard entries up to ServerLimit, for 2.4 Same as above, but for 2.4. This contains the trunk patch plus these commits from trunk: r1705922 r1706523 r1738464 r1738466 r1738486 r1738628 r1738631 r1738632 r1738633 r1738635 r1756848 r1757009 r1757011 r1757029 r1757030 r1757031 r1757056 r1757061 It would be really nice if someone could give this a try in a real-life setup. from what I understand, it seems that Apache can't do anything about this, it seems correct behavior. It waits on the socket for its output. Timeouts are high (30 seconds) so on a busy server if all php-fpm processes working on that socket are occupied (not returning result), queue is getting bigger and bigger. And indeed every-time this crashed happened I found timeout in error logs (just for certain web sites), which I have missed previously. It seems like the problem is in php-fpm, that started with my recent upgrade. Problems with the opcache started also there. And I replaced mysql with mysqlnd in that update. So many changes, something was broken, but I think there is nothing wrong with apache. Problem should be either in php-fpm or php-mysqlnd or maybe in the web-sites themselves. At the end it will be great if apache provides ability to limit number of processes per virtual host (as php-gpm allows this). This way it will be also much easier to isolate/solve the problem. Hi Stefan, thanks for trying to solve the "scoreboard full" issue :) I've been hit by it badly today, the affected machine is a forward proxy and stalls the traffic almost completely. Some background info: - event mpm on httpd 2.4.23 - forward proxy setup via mod_proxy - 280 real users + other machines. ~370 clients - server load is around 0.2, plenty of free RAM - file descriptor limit is 1024 - logrotate sends a graceful restart every hour If the problem occurs, httpd doesn't even respond to the /server-status page reliably. A small script logs the /server-status page every 30s to disk. Specific case: logrotate sends a "graceful restart" at 13h. /server-status output at 13:04:24h: ------------------- Total accesses: 8801 - Total Traffic: 74.6 MB 75 requests currently being processed, 125 idle workers +---------------------------------------------------------------------------+ | | Connections | Threads | Async connections | | PID |-------------------+-------------+---------------------------------| | | total | accepting | busy | idle | writing | keep-alive | closing || |-------+-------+-----------+------+------+---------+------------+---------|| | 14906 | 7 | yes | 6 | 44 | 0 | 1 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 14959 | 9 | yes | 9 | 41 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15014 | 3 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15015 | 49 | yes | 50 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15329 | 3 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15893 | 15 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 17762 | 11 | yes | 10 | 40 | 0 | 1 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | Sum | 97 | | 75 | 125 | 0 | 2 | 0 || +---------------------------------------------------------------------------+ _________R_____R__________________R___R___R__R________R______R_R R_____R__R_________________R__R____RGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRR RRRRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGR__________R__R_____ _______R_RR_________R_RR_R____ ------------------- /server-status output at 13:15:25h: ------------------- Total accesses: 12929 - Total Traffic: 90.9 MB 87 requests currently being processed, 63 idle workers +---------------------------------------------------------------------------+ | | Connections | Threads | Async connections | | PID |-------------------+-------------+---------------------------------| | | total | accepting | busy | idle | writing | keep-alive | closing || |-------+-------+-----------+------+------+---------+------------+---------|| | 14906 | 18 | yes | 16 | 34 | 0 | 2 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 14959 | 27 | yes | 26 | 24 | 0 | 2 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15014 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15015 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15329 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 18564 | 45 | yes | 45 | 5 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 17762 | 39 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 18078 | 44 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | Sum | 179 | | 87 | 63 | 0 | 4 | 0 || +---------------------------------------------------------------------------+ _____R__R___R_RR_RR_R_RR__R_____R_R___R_R_____R___W_RR__RR_RR__R RR__R_RR____RRRRR_R_RR___R_RR_RR____GGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGRRRRRR RRRRRRRRR_RRRRRRRRR_RRRR_RRRRRRRRRRR_R_RRRRRGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGG ------------------- /server-status at 13:25:20h: (httpd hardly responding anymore): ------------------- Total accesses: 14630 - Total Traffic: 97.4 MB 50 requests currently being processed, 0 idle workers +---------------------------------------------------------------------------+ | | Connections | Threads | Async connections | | PID |-------------------+-------------+---------------------------------| | | total | accepting | busy | idle | writing | keep-alive | closing || |-------+-------+-----------+------+------+---------+------------+---------|| | 14906 | 36 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 14959 | 2 | yes | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15014 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15015 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 15329 | 2 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 18564 | 50 | yes | 50 | 0 | 0 | 1 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 17762 | 3 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | 18078 | 1 | no | 0 | 0 | 0 | 0 | 0 || |-------+-------+-----------+------+------+---------+------------+---------|| | Sum | 98 | | 50 | 0 | 0 | 1 | 0 || +---------------------------------------------------------------------------+ GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGWRRRRR RRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGG ------------------- I can provide more /server-status output if needed. After around 30 mins, the external "mon" watchdog kills httpd and restarts it. Traffic continues to flow. httpd config: ------------------- Timeout 300 KeepAliveTimeout 300 <IfModule mpm_event_module> # Number of concurrent connections is: ServerLimit * ThreadsPerChild # Result: 16 * 50 -> 800 # StartServers 1 ServerLimit 16 ThreadLimit 50 ThreadsPerChild 50 MaxConnectionsPerChild 1000 </IfModule> No other performance related settings. ------------------- I've now increased ServerLimit to 32 and disabled logrotate as a quick fix. It holds so far. Occasionally I still see the "scoreboard full" message, even though there are just ~160 active connections and some processes are (still?) in the graceful shutdown state. I'll put the patch from #55 on the productive machine tomorrow :o) It already runs on my own proxy and the one from my department. Anything else to watch out for? I can provide gdb backtraces if you tell me to look for something specific, too. Triggering a graceful restart during peak traffic might be a good test... Cheers, Thomas Another info about my setup: There are two other httpd instances running on different ports. One is using the event MPM, the other one prefork MPM. I didn't configure an explicit ScoreBoardFile, so the scoreboard is in anonymous shared memory. Could there be cross-talk of those three httpds? Hi Stefan, the patch from #55 seems to make things scale a lot better. Also the status output is very helpful. ServerLimit was changed back to 16 before the tests. I did a graceful restart at 13:09:35h. /server-status at 14:19:36h (*before* the next graceful restart): ----------------------- Total accesses: 23693 - Total Traffic: 200.0 MB 100 requests currently being processed, 150 idle workers +--------------------------------------------------------------------------------------------+ | | | | Connections | Threads | Async connections | | Slot | PID | Stopping |-------------------+-------------+--------------------------------| | | | | total | accepting | busy | idle | writing | keep-alive | closing | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |0 |19952 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |1 |20006 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |2 |20060 |yes (old |5 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |3 |20160 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |4 |20224 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |5 |20725 |no |2 |yes |2 |48 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |6 |27470 |no |50 |yes |50 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |7 |24389 |yes |3 |no |0 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |8 |27104 |no |18 |yes |18 |32 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |9 |27346 |no |3 |yes |3 |47 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |10 |22579 |yes |2 |no |0 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |11 |27674 |no |29 |yes |27 |23 |0 |3 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |13 |25055 |yes |8 |no |0 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |14 |25350 |yes |2 |no |0 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |15 |25475 |yes |5 |no |0 |0 |0 |0 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |Sum |15 |10 |137 | |100 |150 |0 |3 |0 | +--------------------------------------------------------------------------------------------+ .G.G...............G............................................ ..............G.....G.....G.........G..............G............ .........G.....G...G..................GG........................ ...........................G........G.....................______ ___________R_______________R________________RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR.....................G............ .....G...G......___R____R_RR_R__R______RRRRR__R__R______RR__R_R_ _R________________R__________________R_____________RGG__RRRRRRR_ _RRRR___R____RR__RR____R__R_W__RRRRR_RRRGGGGGGGGGGGGGGG ----------------------- As you can see, there are still processes from "old gen" after one hour. This is due to long running HTTP CONNECT requests to google / dropbox / etc. Probably GracefulShutdownTimeout will help here, may be having a default value of one hour might make sense for httpd in general? Next graceful restart at 14:19:51h. Errors start to appear in the log two seconds later: [Wed Oct 26 14:19:53.926229 2016] [mpm_event:error] [pid 19951:tid 3071850240] AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. /server-status at 14:20:06h: ----------------------- Total accesses: 23744 - Total Traffic: 200.9 MB 8 requests currently being processed, 42 idle workers +--------------------------------------------------------------------------------------------+ | | | | Connections | Threads | Async connections | | Slot | PID | Stopping |-------------------+-------------+--------------------------------| | | | | total | accepting | busy | idle | writing | keep-alive | closing | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |0 |19952 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |1 |20006 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |2 |20060 |yes (old |5 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |3 |20160 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |4 |20224 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |5 |20725 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |6 |27470 |yes (old |42 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |7 |24389 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |8 |27104 |yes (old |18 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |9 |27346 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |10 |22579 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |11 |27674 |yes (old |24 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |12 |28054 |no |9 |yes |8 |42 |0 |2 |0 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |13 |25055 |yes (old |8 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |14 |25350 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |15 |25475 |yes (old |5 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |Sum |16 |15 |133 | |8 |42 |0 |2 |0 | +--------------------------------------------------------------------------------------------+ .G.G...............G............................................ ..............G.....G.....G.........G..............G............ .........G.....G...G..................GG........................ ...........................G........G........................... ...........G...............G................G.GGGGG.G.G..GGGGGG. GGGGGGGGGGGGGG.GGGGGGGGGGG.GGG.....................G............ .....G...G......GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG_ ______RRRR____RRRW_______________________________GGGGGGGGGGGGGGG ----------------------- The forward proxy became unresponsive again. /server-status at 14:29:16h: ----------------------- Total accesses: 24453 - Total Traffic: 226.8 MB 50 requests currently being processed, 0 idle workers +--------------------------------------------------------------------------------------------+ | | | | Connections | Threads | Async connections | | Slot | PID | Stopping |-------------------+-------------+--------------------------------| | | | | total | accepting | busy | idle | writing | keep-alive | closing | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |0 |19952 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |1 |20006 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |2 |20060 |yes (old |5 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |3 |20160 |yes (old |1 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |4 |20224 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |5 |20725 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |6 |27470 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |7 |24389 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |8 |27104 |yes (old |1 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |9 |27346 |yes (old |1 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |10 |22579 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |11 |27674 |yes (old |3 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |12 |28054 |no |51 |yes |50 |0 |0 |0 |1 | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |13 |25055 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |14 |25350 |yes (old |2 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |15 |25475 |yes (old |4 |no |0 |0 |0 |0 |0 | | | |gen) | | | | | | | | |------+-------+----------+-------+-----------+------+------+---------+------------+---------| |Sum |16 |15 |86 | |50 |0 |0 |0 |1 | +--------------------------------------------------------------------------------------------+ .G.G...............G............................................ ..............G.....G.....G.........G..............G............ .........G.....G...G...................G........................ ...........................G........G........................... ...........G...............G.................................... ...........G.............G.........................G............ .....G..........GGGGGGGRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR RRRRRRRWRGGGGGGGG ----------------------- As you can see, there was plenty of room in the scoreboard now, but the process list slots were used up by old processes serving just a handful of connections. One option would be to increase ServerLimit to let's say 128, but that also raises the resource limits during normal operation. If I raise ServerLimit too much, I have to lower the thread count again. Sounds a bit like the prefork mpm... Another option would be to add a config setting to ignore processes for the ServerLimit calculation if they are in graceful shutdown mode. They probably don't consume a lot of resources and we can have a GracefulShutdownTimeout of one hour to expire them, too. Third option (preferred one): Have an own GracefulShutdownLimit that's separate from ServerLimit. If we have too many processes, start killing of oldest process from the graceful shutdown list. Process in graceful shutdown mode don't count for ServerLimit. I've raised ServerLimit to 32 on the box again. The users can't be annoyed too much ;) Cheers, Thomas PS: Forget about the idea about cross-talk of anonymous shared memory segments from #58. It's not the case. (In reply to Thomas Jarosch from comment #59) > the patch from #55 seems to make things scale a lot better. > Also the status output is very helpful. Glad to hear that and thanks for testing it. > As you can see, there are still processes from "old gen" after one hour. > This is due to long running HTTP CONNECT requests to google / dropbox / etc. There is no way to determine if such connections can be "safely" interrupted or if they are in the middle of a long download. > > Probably GracefulShutdownTimeout will help here, may be > having a default value of one hour might make sense > for httpd in general? Currently the children won't honor GracefulShutdownTimeout. But that should be added. > As you can see, there was plenty of room in the scoreboard now, > but the process list slots were used up by old processes > serving just a handful of connections. > > > One option would be to increase ServerLimit to let's say 128, > but that also raises the resource limits during normal operation. > If I raise ServerLimit too much, I have to lower the thread count again. > Sounds a bit like the prefork mpm... During normal operation, the number of threads will be limited by MaxRequestWorkers. The idea of my patch is that you can increase Serverlimit quite a bit without using too many ressources. The processes serving old connections should terminate most of their threads and free most of their memory, so the resource usage should not be too much. But it of course depends on how may old connections are still open. > Another option would be to add a config setting to ignore > processes for the ServerLimit calculation if they are > in graceful shutdown mode. They probably don't consume > a lot of resources and we can have a GracefulShutdownTimeout > of one hour to expire them, too. You are confusing ServerLimit with MaxRequestWorkers here. While the latter is a number of threads and not processes, it does what you think ServerLimit should do. > Third option (preferred one): Have an own GracefulShutdownLimit > that's separate from ServerLimit. If we have too many processes, > start killing of oldest process from the graceful shutdown list. > Process in graceful shutdown mode don't count for ServerLimit. Yes, we could do that, too. But first I need something like GracefulShutdownTimeout to work for the old child processes. If you have any more experiences with the patch I am certainly interested. Even if it has simply run for some time without (new) bugs exposed. Cheers, Stefan Some quick note about the patch (unfortunately I could not carry out my testing since a colleague reused the machine, resetting my local patches/work altogether...). Anyway, there is possibly an issue with retained->total_daemons which is incremented (unconditionally) whenever a child is created (make_child), but not always decremented when one finishes (server_main_loop, depending on whether or not it died smoothly and it still uses a scoreboard slot). IOW, I think this hunk: ps->quiescing = 0; + retained->total_daemons--; should probably be moved up here: ap_wait_or_timeout(&exitwhy, &status, &pid, pconf, ap_server_conf); if (pid.pid != -1) { + retained->total_daemons--; Will restart my tests ASAP... (In reply to Yann Ylavic from comment #61) > Anyway, there is possibly an issue with retained->total_daemons which is > incremented (unconditionally) whenever a child is created (make_child), but > not always decremented when one finishes (server_main_loop, depending on > whether or not it died smoothly and it still uses a scoreboard slot). > > IOW, I think this hunk: > ps->quiescing = 0; > + retained->total_daemons--; > > should probably be moved up here: > ap_wait_or_timeout(&exitwhy, &status, &pid, pconf, ap_server_conf); > if (pid.pid != -1) { > + retained->total_daemons--; No, I think the code in the patch is correct: There is only one case where the code will return from the function before reaching the "if (child_slot >= 0) {" block which contains the "retained->total_daemons--;" line. And in this case the whole server will exit, so correct counting is not an issue any more. On the other hand, total_daemons must not be decremented if child_slot < 0, because in this case the dead process was not a worker process (but e.g. a cgid-process). But this should be made clearer, either by rearranging the code or by adding some comments. We have successfully used patch in #55 for 50 days now on mid-sized production server with 1-2 million hits per day. No issues encountered. Previous issues disappeared (we think the original bug had been abused in DoS attack, but we might be wrong on this). Comment on attachment 34202 [details]
Use all scoreboard entries up to ServerLimit, for 2.4
This looks good. Should be proposed for back port!!
Fixed in 2.4.25 Hi Stefan, (In reply to Stefan Fritsch from comment #60) > > the patch from #55 seems to make things scale a lot better. > > Also the status output is very helpful. > > Glad to hear that and thanks for testing it. Sorry, I didn't see your reply as bugzilla didn't add me to CC: automatically. Which is rather odd since it's the default setting. Back to the topic: > > Probably GracefulShutdownTimeout will help here, may be > > having a default value of one hour might make sense > > for httpd in general? > > Currently the children won't honor GracefulShutdownTimeout. But that should > be added. very nice. > > Third option (preferred one): Have an own GracefulShutdownLimit > > that's separate from ServerLimit. If we have too many processes, > > start killing of oldest process from the graceful shutdown list. > > Process in graceful shutdown mode don't count for ServerLimit. > > Yes, we could do that, too. But first I need something like > GracefulShutdownTimeout to work for the old child processes. ok. In the meantime I've decreased the ServerLimit/ThreadLimit to 5 and increased the ServerLimit 160 and more. The results with these settings are very good, no more user complaints (see below). Otherwise those long running HTTP CONNECT sessions were still maxing out the total number of allowed processes. > If you have any more experiences with the patch I am certainly interested. > Even if it has simply run for some time without (new) bugs exposed. the patch had been deployed to about ~3.000 servers since November 2016 with different work loads from 10 users to 400+ users. After applying your patch + the ThreadLimit change, there were no more complaints :) I've also diffed httpd 2.4.23 + the patch with the version of the code that landed in 2.4.25 and it's exactly the same. I'm soon going to roll out 2.4.25 to those boxes. Thanks again! Thomas *** Bug 56101 has been marked as a duplicate of this bug. *** |