Bug 53555

Summary:	Scoreboard full error with event/ssl
Product:	Apache httpd-2	Reporter:	Alexander Strange <astrange>
Component:	mpm_event	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	RESOLVED FIXED
Severity:	major	CC:	andru, bucky, chris, daniel.lemsing, dgallowa, friesoft, gjorgjioski, gregames, info, jim, leho, mike.williams, nikke, payam_hekmat, sander, sf, stephane, tez, thomas.jarosch, toscano.luca
Priority:	P1	Keywords:	FixedInTrunk
Version:	2.4.7
Target Milestone:	---
Hardware:	Other
OS:	Linux
Attachments:	close keepalive connections if process is shutting down exit some threads early during gracful shutdown of a process Allow to use more scoreboad slots same as above, but for trunk Use all scoreboard entries up to ServerLimit, for trunk Use all scoreboard entries up to ServerLimit, for 2.4

Description Alexander Strange 2012-07-17 02:21:21 UTC

A high-traffic web server using event MPM and mostly receiving HTTPS requests frequently got the error "scoreboard is full, not at MaxRequestWorkers" and showed very bad performance.

We fixed the issue by reverting from 2.4.2 to 2.2.22, still using event MPM.

Related httpd.conf settings:

 StartServers 16
 MinSpareThreads 4
 MaxSpareThreads 4
 ListenBacklog 4096
 Timeout 5

Unfortunately don't have a capture of the server status page and increasing the log level didn't seem to show much.

Comment 1 Greg Ames 2013-01-31 18:54:25 UTC

This may be obvious, but the server-status page is a huge help in analyzing scoreboard full issues.  Do you remember what it looked like?  what state codes were most prevalent?  The scoreboard can fill up quickly if a back end server stalls.

Comment 2 Niklas Edmundsson 2013-05-05 21:33:14 UTC

We've seen AH00485: scoreboard is full, not at MaxRequestWorkers on 2.4.4 with the event MPM, no SSL involved.

Haven't figured out the exact conditions yet, but involved are:
* High/varying load, causing worker processes to be spawned and killed,
  filling up the scoreboard with G:s.
* Server reloads due to config changes.

I suspect the root cause is that server processes are flagged for killing, but later they're needed again but instead of reviving the existing process a new one is created. If you have a lot of slow connections (this is a file archive serving DVD-images etc) processes can add up.

The scoreboard can look like this after a while:

----------8<----------------
PID	Connections 	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
14465	94	no	0	0	72	0	21
28881	132	yes	0	0	79	0	6
23632	582	no	0	0	523	0	51
32314	43	no	0	0	28	0	15
13766	577	no	0	0	564	1	2
337	42	no	0	0	28	0	13
19580	39	no	0	0	27	0	12
30603	478	no	0	0	424	0	52
32163	177	no	0	0	136	0	24
16159	429	no	0	0	374	0	54
15376	93	no	0	0	45	0	47
32478	124	no	0	0	86	0	38
30604	395	yes	2	48	390	3	0
30667	61	no	0	0	38	0	17
31569	58	no	0	0	27	0	20
19614	161	no	0	0	117	0	44
32286	253	yes	0	50	252	0	0
17643	454	yes	2	48	445	0	3
23353	49	no	0	0	27	2	20
31581	145	no	0	0	106	0	34
Sum	4386	 	4	146	3788	6	473

LGLGGGLLGLGLGLLLLLGLGLGLLLLLGLLLLLLLLLLLLLLGGGLGLLGGGGLGGLGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGLGLGGGGGLGLLGGGLGLLLLLLGGGLLLLLGGLGLGLLLGGGLGLLLGLGLLGL
LGGLLLLGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGL
GGLLGGGLLGGLGLGGGGGLLGGGGLGLLLLLLGGGGGLGGGGGGLLLLLGLLLGLLLLLLLGL
LLLGLLLGLGLGGGLGLGGGGLLGLGGLLLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GLGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGG
GGLLLLLGLLLLGLLLLGLGLLLLGGLGLLLLLGLLGLLLLLLLLLLGGLLLGLGGGGGGGGGG
GGGGLGGGGLGGGLGGGGGGGLGGGGGGGGGGGGGGGGGGGGLGGGGGLLGGGGLLGGGLGLLG
GGGLGGLLGGGGLGGLGGLGLGGL____________________WW__________________
__________GGGGGGGGGGGGGGGGGGGGGGGGGGLGGLGLGGGGGGGGGGGGGGGGGGLLGG
LGLLGLGLGGGLLGLGGLLLLGLGGGLGLLGGLGLLGLGLLGLGGGLGGGGGGGGGGGGGLGGG
GLGGLGGGGGLGGGGGGGGGGLGLGGLLGLGG________________________________
____________________W___W_______________________________________
____GLGLLLLLLLGGGLLGGLLLGGLLLLLLGGLGLLGGLLGGGGLGLLLGGGLLGGLGLGGG
LLGLGGLLLLGLGLLGGGGGGLLGGGGGGLLGGGGLGLGL
----------8<----------------

Comment 3 Greg Ames 2013-06-18 15:42:46 UTC

(In reply to Niklas Edmundsson from comment #2)
> We've seen AH00485: scoreboard is full, not at MaxRequestWorkers on 2.4.4
> with the event MPM, no SSL involved.

> PID	Connections 	Threads	Async connections
> total	accepting	busy	idle	writing	keep-alive	closing
> 14465	94	no	0	0	72	0	21
> 28881	132	yes	0	0	79	0	6
> 23632	582	no	0	0	523	0	51
> 32314	43	no	0	0	28	0	15
> 13766	577	no	0	0	564	1	2

> LGLGGGLLGLGLGLLLLLGLGLGLLLLLGLLLLLLLLLLLLLLGGGLGLLGGGGLGGLGGGGGG
> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGLGGGGGGGGGGGGGGGGGGGGGGGGG

OK, there are many worker processes that hang while trying to shut down, probably due to traffic fluctuations. The only two states we see in the scoreboard are G and L.  The G should be transient and can probably be ignored.  The Ls look like the cause of the hangs.

L means the threads are hung while trying to write to the log.  Normally you never see this with logs on a reasonably fast local hard drive.  Are the log files NFS mounted or something like that?

Greg

Comment 4 Niklas Edmundsson 2013-06-18 18:21:55 UTC

> OK, there are many worker processes that hang while trying to shut down,
> probably due to traffic fluctuations. The only two states we see in the
> scoreboard are G and L.  The G should be transient and can probably be
> ignored.  The Ls look like the cause of the hangs.

Transient for the G:s can mean days in this case, think slooow ADSL connection downloading a DVD image...

> L means the threads are hung while trying to write to the log.  Normally you
> never see this with logs on a reasonably fast local hard drive.  Are the log
> files NFS mounted or something like that?

No, local filesystem. But I'll have to double check that we're not doing anything overly clever on the log front...

Comment 5 Rainer Jung 2013-06-18 19:22:44 UTC

Greg,

I didn't check the code, but to me it seems that a "G" letter does not mean there's no more work going on. The server-status on our own www.(eu|us).apache.org shows the same G plus L mixture for about a minute (varying) whenever a process dies due to MaxConnectionsPerChild. When I checked such processes, they had open client connections and were still sending data to the client. So it was correct they were still aorund, but the status letters "G" or "L" for those gracefully exiting children are not showing those details.

Comment 6 Greg Ames 2013-06-19 13:46:35 UTC

I looked at apache.org and the code.  The Ls are normal when a gracefully exiting process had an active thread.  Sorry for jumping to conclusions.

close_listeners sets all the G states during graceful shutdown.  (Unfortunately this means we can no longer see which threads are active vs. idle - not sure having the G state is worth it.) Any active threads which finish their requests will log and set the L state before exiting.  The Gs that remain could represent exited threads or active requests - we can't tell from server-status.

The processes that didn't exit have active connections.  If they due to slow downloads, maybe the thing to do is to tune for less or no graceful process terminations when the traffic drops by raising MaxSpareThreads.

Comment 7 Daniel Lemsing 2014-01-24 01:01:42 UTC

Recently hit this error in a high traffic production web server (Apache 2.4.6) leading to an outage.

Has anyone had success in overcoming this issue by amending Apache configuration ?

If so, what did you change ...

Also, can anyone offer any suggestions on what triggers this issue ?

Being a production server, rolling back to 2.2.22 is not preferable.

Comment 8 Niklas Edmundsson 2014-01-24 08:19:07 UTC

One of the gotchas with this is that the scoreboard seems to be sized to cater for MaxRequestWorkers, with no margins for server reloads etc.

In our case, when it can take days for processes to exit if people are downloading large files over slow connections, we can easily have the situation where multiple server reloads (due to config changes etc) causes the scoreboard to fill up with old server processes in graceful-shutdown mode and no space for new processes to do some actual work.

I can see a few ways to work around this:

1) Simply make the scoreboard bigger. I'd like a default size-multiplier of 2 for the event MPM, but configurable so we can set it to 4 or something for our setup. An alternative is to set a ridiculously large MaxRequestWorkers to get a big enough scoreboard, but one DOS and we're out of scoreboard anyway.

2) Kill off the oldest gracefully-exiting processes when we can't spawn a new process to do useful work.

The ideal solution is probably a mix of these two.

Also, I'm wondering if this is also somehow related to the "server dies for a while when doing reload" issue. We're still at httpd 2.4.6 though, so I can't say for certain that some of these issues aren't already fixed.

Comment 9 Ryan Egesdahl 2014-04-05 03:00:13 UTC

In case it matters any, this problem appears to be specific to the Event MPM. I had it happening on a server, and when I switched it to the Worker MPM, it stopped. However, what I did notice is that the same server periodically had all of its workers taken up with requests, so that may be relevant to the problem as well.

Comment 10 anonymous 2014-06-20 11:26:48 UTC

I have a similar behavior as described here (with no ssl involved) with httpd 2.4.9.

I got a lot of AH00485: "scoreboard is full, not at MaxRequestWorkers", httpd is still serving requests, however one worker is in graceful finishing state and is taking 100% CPU.

The worker was in this stat for about 24h, until I kill(1)ed it.

Threads stats:

__________________W_____________________________________________
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Unfortunately I don't have any other info from the status page.

strace of the worker shows an epoll_wait infinite loop:

    [...]
    epoll_wait(10, {}, 128, 100)            = 0
    epoll_wait(10, {}, 128, 100)            = 0
    epoll_wait(10, {}, 128, 100)            = 0
    [...]

mpm event config:

    StartServers         1
    ServerLimit          4
    MinSpareThreads      4
    MaxRequestWorkers    128
    ThreadsPerChild      64
    ThreadLimit          64
    AsyncRequestWorkerFactor 4

Comment 11 Andrei Boros 2014-12-08 12:17:56 UTC

Apache 2.4.10 on Slackware Linux 14.1 x86_64 platform.

I am seeing this about once a minute in the logs:
AH00485: scoreboard is full, not at MaxRequestWorkers

I was able to recover only by a forced restart (stop then start).

Comment 12 ScottE 2015-06-03 00:42:35 UTC

After migrating from worker MPM to event MPM with Apache 2.4.7 we are seeing this same problem.

Server version: Apache/2.4.7 (Ubuntu)
Ubuntu Trusty 14.04.2 LTS
Linux 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

We explicitly moved to event MPM for this workload, which is a proxy of thousands of mostly-idle HTTP Keep-Alive connections - since event MPM doesn't require a thread per Keep-Alive connection. Although our number of clients is fairly consistent, and we have MaxConnectionsPerChild=0, we observe Apache processes going into GGGGGG state until eventually Apache no longer accepts connections.

If we set MinSpareThreads and MaxSpareThreads equal to MaxRequestWorkers (so Apache doesn't attempt to scale down processes), the issue goes away (as expected, but validates (maybe?) this has to do with Apache scale-down).

Since client connections can be connected for hours or days, Apache processes stay in this state for a very long time, eventually rejecting client connections and becoming wedged.

Our clients are not browsers - Apache is being used for a mid-tier load balancer/proxy with client connections that are very long lived (long Keep-Alive times).

248 requests/sec - 0.7 MB/second - 3114 B/request
2 requests currently being processed, 38 idle workers
PID	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
28483	1642	no	0	0	0	1642	0
29672	553	yes	1	19	0	552	0
29696	9	no	0	0	0	9	0
29588	173	no	0	0	0	173	0
29618	1	no	0	0	0	1	0
29644	6	no	0	0	0	6	0
29719	30	no	0	0	0	30	0
29743	237	yes	1	19	0	236	0
Sum	2651	 	2	38	0	2649	0
GGGGGGGGGGGGGGGGGGGG________W___________GGGGGGGGGGGGGGGGGGGGGGWG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGG________W___________................................
........

Comment 13 Olivier Jaquemet 2015-06-05 11:36:29 UTC

We are having the symptoms here :

Server Version: Apache/2.4.7 (Ubuntu) SVN/1.8.8 mod_jk/1.2.37 OpenSSL/1.0.1f
Ubuntu 14.04.2 LTS
Linux 3.13.0-52-generic #86-Ubuntu SMP Mon May 4 04:32:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Many logs : 
[mpm_event:error] [pid 6332:tid 140558940702592] AH00485: scoreboard is full, not at MaxRequestWorkers

From the server status 

Right after start : 
__RR___________R________________________W__________________W____
___________.....................................................
......................

After one hour : 

___________________W_____GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGG_______W___W_____________............................
......................

Two hours later : 

GGGGGGGGGGGGGGGGGGGGGGGGGW_W_____W________W____W__GGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGG

Is there anything we can provide to help in the diagnostic of the issue ?

Do you know of any workaround through configuration ?

Comment 14 ScottE 2015-06-05 16:38:39 UTC

In case others find it useful, the approach we used to mitigate this was several things:

1. Increased MinSpareThreads and MaxSpareThreads, as well as the range between them. By making Apache less aggressive about scaling the number of servers down, it's less likely to run into this issue. Our new values are:

MinSpareThreads = MaxRequestWorkers / 4
MaxSpareThreads = MinSpareThreads * 3

2. Lowered MaxKeepAliveRequests. By looking at a histogram of request counts per connection on an equivalent Apache running with worker MPM (first value in Acc column), I found a very long tail of few connections out to our old value, but a clear cluster at the lower end. Our new MaxKeepAliveRequests is a bit beyond the critical-mass cluster, but significantly lower than the old value. This will allow servers to recycle quicker when they scale down, but not cause any significant impact to client connections, since the relative number of connections we'll close early is small.

3. Increased AsyncWorkerFactor. When Apache servers are scaling down (in Gracefully Finishing state), this allows other servers to pick up the slack by handling a larger number of total client connections (in HTTP Keep-Alive, this does not increase the number of workers), where before these processes had reached their limit of connections and were rejecting new ones. Event MPM does a reasonably good job of spreading load between processes, and with our larger spare threads range we now tend to have more alive processes as well.

We also considered lowering KeepAliveTimeout, but using a similar histogram as I did for KeepAliveRequests from a worker MPM configuration (using the SS column as a reasonable analog). That histogram showed a nice distribution for us, so lowering this would have affected clients and not helped for this workload.

These are the values that worked for us, with our workload, to mitigate this issue. Of course your workload and values will be different, but this may be a reasonable strategy to try as well.

Comment 15 Leho Kraav @lkraav 2015-08-08 13:29:43 UTC

2.4.16 and the following configuration hits scoreboard full with 3-4 reloads

    StartServers        2
    MinSpareThreads     50
    MaxSpareThreads     150
    ThreadsPerChild     25
    MaxRequestWorkers   200
    MaxConnectionsPerChild  10000

Any advice?

Comment 16 gobbledance 2015-08-25 15:09:19 UTC

This is certainly a bug and not a configuration issue. I have had this error happen with the default (Debian) configuration and other people online report the same. I have had this happen with mpm_event and mpm_worker.

It's very reproducible. It happens with almost any thread related settings I have tried. It stops new requests from being served and is a serious problem.

There is some bug with the way Apache handles its servers/threads. This is not something that can be fixed by tweaking the configuration. At best it might be mitigated by setting:

StartServers           1
ServerLimit            X
ThreadsPerChild        XXX
ThreadLimit            <ThreadsPerChild>
MaxRequestWorkers      <ServerLimit * ThreadLimit>
MinSpareThreads        <MaxRequestWorkers>
MaxSpareThreads        <MaxRequestWorkers>
MaxRequestsPerChild    0

In other words, make it so a thread stays alive forever and therefore the buggy part of the code that is responsible for killing and reusing threads is never hit. Of course this requires always using the maximum amount of RAM since threads never die even when there is no traffic.

Comment 17 Leho Kraav @lkraav 2015-08-25 15:50:48 UTC

(In reply to gobbledance from comment #16)
> This is certainly a bug and not a configuration issue. I have had this error
> happen with the default (Debian) configuration and other people online
> report the same. I have had this happen with mpm_event and mpm_worker.
> 
> It's very reproducible. It happens with almost any thread related settings I
> have tried. It stops new requests from being served and is a serious problem.

I have found no way around it with a variety of worker configuration parameters. Looks like the best bet would be to have fail2ban or similar monitor the error_log and restart the server when scoreboard hits the DoS condition.

Comment 18 Peter 2015-09-09 10:06:25 UTC

I'm also affected by this bug running Apache/2.4.7 (Ubuntu) on 14.04. I setup a logfile watch daemon that force restarts apache2 if the line shows up in the error.log as a hotfix.

Has anyone tested this with the current stable release 2.4.16?

Comment 19 Stefan Fritsch 2015-09-29 22:17:45 UTC

(In reply to ScottE from comment #12)
> Our clients are not browsers - Apache is being used for a mid-tier load
> balancer/proxy with client connections that are very long lived (long
> Keep-Alive times).

This seems to be a problem that should not be too difficult to fix. When a process is shutting down, it should close its keepalive connections. Can you please check if the attached patch helps?


The case where long-running transfers are keeping a process from shutting down is much more difficult to fix.

Comment 20 Stefan Fritsch 2015-09-29 22:18:15 UTC

Created attachment 33154 [details]
close keepalive connections if process is shutting down

Comment 21 Stefan Fritsch 2015-10-03 15:31:46 UTC

Created attachment 33158 [details]
exit some threads early during gracful shutdown of a process

The attached diff against the 2.4.x branch makes unneeded threads exit earlier during graceful shutdown of a process. This then allows new processes to use the freed scoreboard slots.

I am interested in real-live experiences with this patch. It has two known problems, though:

- If httpd is shut down (ungracefully) while there are some old processes around serving long lasting requests, those processes won't die peacefully but will be SIGKILLed by the parent after 10 seconds.

- server-status shows incomplete information (that is, even more incomplete than in 2.4 ;) )

Comment 22 bucky 2015-10-03 23:28:29 UTC

I have applied the patch on our own production server, which experiences this problem sometimes twice a day, and sometimes not for a week or so.

So now we wait. I will report immediately if the problem recurs, and I will also report in a week if the problem does not recur.

PS: If "Graceful, but sigkill after 10 seconds" were an actual option, I would probably use it all the time.

Comment 23 Yann Ylavic 2015-10-05 08:37:17 UTC

(In reply to Stefan Fritsch from comment #21)
> 
> - If httpd is shut down (ungracefully) while there are some old processes
> around serving long lasting requests, those processes won't die peacefully
> but will be SIGKILLed by the parent after 10 seconds.

Wasn't that already the case for ungraceful stop/restart?

> 
> - server-status shows incomplete information (that is, even more incomplete
> than in 2.4 ;) )

How about not setting SERVER_GRACEFUL in close_listeners() and worker_thread()?
The old generation's state could be relevent, since the new generation does not "steal" the scoreboard now (until the old worker exits).

Comment 24 Stefan Fritsch 2015-10-05 22:25:35 UTC

(In reply to bucky from comment #22)
> I have applied the patch on our own production server, which experiences
> this problem sometimes twice a day, and sometimes not for a week or so.

Thanks for that already.


(In reply to Yann Ylavic from comment #23)
> (In reply to Stefan Fritsch from comment #21)
> > - If httpd is shut down (ungracefully) while there are some old processes
> > around serving long lasting requests, those processes won't die peacefully
> > but will be SIGKILLed by the parent after 10 seconds.
> 
> Wasn't that already the case for ungraceful stop/restart?

Normally, those child process should react to the SIGTERM that is sent first. But that is currently broken by my patch.


> > - server-status shows incomplete information (that is, even more incomplete
> > than in 2.4 ;) )
> 
> How about not setting SERVER_GRACEFUL in close_listeners() and
> worker_thread()?
> The old generation's state could be relevent, since the new generation does
> not "steal" the scoreboard now (until the old worker exits).

Yes, that would proabaly be better, I'll have to test that. But it would not fix the incompleteness I was referring to: The old and the new process have only one process slot in the scoreboard, which makes the async overview table show sometimes the info from the old and sometimes from the new process, depending on who updated it last.

Comment 25 Yann Ylavic 2015-10-05 22:40:15 UTC

(In reply to Stefan Fritsch from comment #24)
> 
> (In reply to Yann Ylavic from comment #23)
> > How about not setting SERVER_GRACEFUL in close_listeners() and
> > worker_thread()?
> > The old generation's state could be relevent, since the new generation does
> > not "steal" the scoreboard now (until the old worker exits).
> 
> Yes, that would proabaly be better, I'll have to test that. But it would not
> fix the incompleteness I was referring to: The old and the new process have
> only one process slot in the scoreboard, which makes the async overview
> table show sometimes the info from the old and sometimes from the new
> process, depending on who updated it last.

It seems to me that the new generation's worker threads are not started now unless their scoreboard slot is marked SERVER_DEAD (was also SERVER_GRACEFUL before attachment 33158 [details]).
So AIUI, there shouldn't be two workers using the same slot.

Comment 26 Stefan Fritsch 2015-10-05 23:00:30 UTC

(In reply to Yann Ylavic from comment #25)

This technical discussion has been moved to the dev mailing list.

Comment 27 bucky 2015-10-10 22:59:30 UTC

It's been a week.

The scoreboard errors haven't stopped altogether. Every so often I still get one a second for a short time, but now they last for about 1 or 2 minutes, and that's it.

I haven't gotten any lockups since I applied the patch.

Comment 28 Leho Kraav @lkraav 2015-10-11 07:33:53 UTC

mod_h2 did some significant cleanups for resource handling in the 0.9.x branch. "Scoreboard full" errors seem to have been completely eliminated for me. Uptime of several weeks goes with no issues now. So looks like external modules' individual cleanup abilities are directly related to this issue.

Comment 29 bucky 2015-10-11 22:40:15 UTC

I'm confused. To my knowledge, mod_h2 is a 3rd party module. It it somehow an integral part of the latest httpd (2.4.16)?

Comment 30 Leho Kraav @lkraav 2015-10-12 05:44:27 UTC

(In reply to bucky from comment #29)
> I'm confused. To my knowledge, mod_h2 is a 3rd party module. It it somehow
> an integral part of the latest httpd (2.4.16)?

Yes, it is already part of trunk and backported to 2.4.x.

Comment 31 Yann Ylavic 2015-10-12 07:18:05 UTC

This may be related (In reply to Leho Kraav @lkraav from comment #28)
> mod_h2 did some significant cleanups for resource handling in the 0.9.x
> branch. "Scoreboard full" errors seem to have been completely eliminated for
> me.

mod_http2 (being released in 2.4.17) has its own connection handling (somehow appart from the MPM, for now), and shouldn't be seen as a workaround to this issue.
The more testing on Stefan's proposed patch (regarding MPM event), without mod_http2, the quicker it will be backported in a release.

Comment 32 Stefan Eissing 2015-10-12 07:36:41 UTC

The fixes I did in mod_http2, mentioned by Leho, were just related to the fact that early 0.9.x version of that module did not properly mark connections for reclaiming, so cleanup work was not run all the time, leading to memory loss and scoreboard handle waste.

That has been fixed in mod_http2 alone and does not affect other connections. Since the bug happens without the module as well, its presence is not mitigation.

If the patch by Stefan does not fix it, we should review again if there are races that prevent cleanup from happening in the HTTP/1.1 cases.

Comment 33 Thierry Bastian 2015-11-02 18:28:43 UTC

WE got into a situation where the users of our product were stuck with G. We've got severe performance issues in those cases. We've tried patch https://bz.apache.org/bugzilla/attachment.cgi?id=33158&action=diff on a couple of installs and it made things much much better. On one install it would get stuck with 2000 clients coming in at roughly the same time. Now it can handle 10K gracefully.
Hope that helps.

Comment 34 Leho Kraav @lkraav 2016-03-25 17:20:42 UTC

I'm hitting this on a production server with 2.4.18 now. Can't apply custom patches here.

ServerLimit 30
MaxRequestWorkers 30
MaxConnectionsPerChild 600
KeepAlive On
KeepAliveTimeout 1
MaxKeepAliveRequests 20
Timeout 50

mod_h2 isn't enabled here.

From above discussion, I can't get a clear indiciation if any core developers have confirmed this to be a bug or a configuration issue?

Comment 35 Sander Hoentjen 2016-03-29 07:35:20 UTC

After applying the patch I ran into "No space left on device: AH00023: Couldn't create the proxy mutex" I haven't seen that issue without the patch.

Log says:
[Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid 140551243081696] (28)No space left on device: AH00023: Couldn't create the proxy mutex
[Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid 140551243081696] (28)No space left on device: AH02478: failed to create proxy mutex
AH00016: Configuration Failed

# ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 0          root       600        1
0x00000000 65537      root       600        1
0x00000000 131074     apache     600        1
0x7a00179d 59899907   zabbix     600        13
0x00000000 3866628    apache     600        1
0x00000000 3899397    apache     600        1
0x00000000 3932166    apache     600        1
0x00000000 21397511   apache     600        1
0x00000000 21495816   apache     600        1
0x00000000 21528585   apache     600        1
0x00000000 21561354   apache     600        1
0x00000000 21594123   apache     600        1
0x00000000 21626892   apache     600        1
0x00000000 21659661   apache     600        1
0x00000000 29294606   apache     600        1
0x00000000 29327375   apache     600        1
0x00000000 29360144   apache     600        1
0x00000000 29392913   apache     600        1
0x00000000 29425682   apache     600        1
0x00000000 29458451   apache     600        1
0x00000000 29884436   apache     600        1
0x00000000 29917205   apache     600        1
0x00000000 29949974   apache     600        1
0x00000000 29982743   apache     600        1
0x00000000 30015512   apache     600        1
0x00000000 30048281   apache     600        1
0x00000000 30310426   apache     600        1
0x00000000 30343195   apache     600        1
0x00000000 30375964   apache     600        1
0x00000000 30408733   apache     600        1
0x00000000 30441502   apache     600        1
0x00000000 30474271   apache     600        1
0x00000000 30736416   apache     600        1
0x00000000 30769185   apache     600        1
0x00000000 30801954   apache     600        1
0x00000000 30834723   apache     600        1
0x00000000 30867492   apache     600        1
0x00000000 30900261   apache     600        1
0x00000000 30998566   apache     600        1
0x00000000 31031335   apache     600        1
0x00000000 31064104   apache     600        1
0x00000000 31096873   apache     600        1
0x00000000 31129642   apache     600        1
0x00000000 31162411   apache     600        1
0x00000000 31260716   apache     600        1
0x00000000 31293485   apache     600        1
0x00000000 31326254   apache     600        1
0x00000000 31359023   apache     600        1
0x00000000 31391792   apache     600        1
0x00000000 31424561   apache     600        1
0x00000000 37257266   apache     600        1
0x00000000 37290035   apache     600        1
0x00000000 37322804   apache     600        1
0x00000000 37355573   apache     600        1
0x00000000 37388342   apache     600        1
0x00000000 37421111   apache     600        1
0x00000000 37519416   apache     600        1
0x00000000 37552185   apache     600        1
0x00000000 37584954   apache     600        1
0x00000000 37617723   apache     600        1
0x00000000 37650492   apache     600        1
0x00000000 37683261   apache     600        1
0x00000000 37781566   apache     600        1
0x00000000 37814335   apache     600        1
0x00000000 37847104   apache     600        1
0x00000000 37879873   apache     600        1
0x00000000 37912642   apache     600        1
0x00000000 37945411   apache     600        1
0x00000000 38043716   apache     600        1
0x00000000 38076485   apache     600        1
0x00000000 38109254   apache     600        1
0x00000000 38142023   apache     600        1
0x00000000 38174792   apache     600        1
0x00000000 38207561   apache     600        1
0x00000000 41091146   apache     600        1
0x00000000 41123915   apache     600        1
0x00000000 41156684   apache     600        1
0x00000000 41189453   apache     600        1
0x00000000 41222222   apache     600        1
0x00000000 41254991   apache     600        1
0x00000000 44466256   apache     600        1
0x00000000 44499025   apache     600        1
0x00000000 44531794   apache     600        1
0x00000000 44564563   apache     600        1
0x00000000 44597332   apache     600        1
0x00000000 44630101   apache     600        1
0x00000000 49315926   apache     600        1
0x00000000 49348695   apache     600        1
0x00000000 49381464   apache     600        1
0x00000000 49414233   apache     600        1
0x00000000 49447002   apache     600        1
0x00000000 49479771   apache     600        1
0x00000000 49578076   apache     600        1
0x00000000 49610845   apache     600        1
0x00000000 49643614   apache     600        1
0x00000000 49676383   apache     600        1
0x00000000 49709152   apache     600        1
0x00000000 49741921   apache     600        1
0x00000000 55574626   apache     600        1
0x00000000 55607395   apache     600        1
0x00000000 55640164   apache     600        1
0x00000000 55672933   apache     600        1
0x00000000 55705702   apache     600        1
0x00000000 55738471   apache     600        1
0x00000000 58785896   apache     600        1
0x00000000 58818665   apache     600        1
0x00000000 58851434   apache     600        1
0x00000000 58884203   apache     600        1
0x00000000 58916972   apache     600        1
0x00000000 58949741   apache     600        1
0x00000000 61571182   apache     600        1
0x00000000 61603951   apache     600        1
0x00000000 61636720   apache     600        1
0x00000000 61669489   apache     600        1
0x00000000 61702258   apache     600        1
0x00000000 61735027   apache     600        1
0x00000000 63635572   apache     600        1
0x00000000 63668341   apache     600        1
0x00000000 63701110   apache     600        1
0x00000000 63733879   apache     600        1
0x00000000 63766648   apache     600        1
0x00000000 63799417   apache     600        1
0x00000000 65372282   apache     600        1
0x00000000 65405051   apache     600        1
0x00000000 65437820   apache     600        1
0x00000000 65470589   apache     600        1
0x00000000 65503358   apache     600        1

Comment 36 ScottE 2016-03-30 16:53:49 UTC

(In reply to Sander Hoentjen from comment #35)
> After applying the patch I ran into "No space left on device: AH00023:
> Couldn't create the proxy mutex" I haven't seen that issue without the patch.

Hi Sander, I don't believe this is related to the patch - I've seen this happen (on vanilla 2.4.7) with a bad configuration and something like daemontools constantly restarting Apache. This is likely a valid bug, where Apache can leak mutexes under some conditions, but I don't think it's caused by the patch.

Comment 37 Sander Hoentjen 2016-03-31 08:11:19 UTC

(In reply to ScottE from comment #36)
> (In reply to Sander Hoentjen from comment #35)
> > After applying the patch I ran into "No space left on device: AH00023:
> > Couldn't create the proxy mutex" I haven't seen that issue without the patch.
> 
> Hi Sander, I don't believe this is related to the patch - I've seen this
> happen (on vanilla 2.4.7) with a bad configuration and something like
> daemontools constantly restarting Apache. This is likely a valid bug, where
> Apache can leak mutexes under some conditions, but I don't think it's caused
> by the patch.

Well, we have apache 2.4 in event model on tens of servers and besides the bug in this ticket they are doing fine. On one of them we applied the patch (no other changes) and got AH00023 so while I believe there are other ways to trigger it, it seems that the patch also can play a role in it.

Comment 38 Mike Williams 2016-04-07 14:36:03 UTC

(In reply to Thierry Bastian from comment #33)
> WE got into a situation where the users of our product were stuck with G.
> We've got severe performance issues in those cases. We've tried patch
> https://bz.apache.org/bugzilla/attachment.cgi?id=33158&action=diff on a
> couple of installs and it made things much much better. On one install it
> would get stuck with 2000 clients coming in at roughly the same time. Now it
> can handle 10K gracefully.
> Hope that helps.


I've been trying that today after an update from 2.2.something to 2.4.18.
Still get the "scoreboard is full, ..." error though.


One server looks like this when emitting the "scoreboard is full, ..." error, a few moments before becoming entirely unresponsive.


179 requests currently being processed, 461 idle workers
PID	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
25580	205	no	15	49	0	147	44
21331	293	no	0	0	0	0	292
19389	1	yes	0	0	0	0	0
25924	164	no	12	52	0	151	0
23217	432	no	15	49	0	146	270
23361	457	no	18	46	0	140	298
24175	458	no	13	51	0	149	297
20428	246	yes	0	0	0	0	244
21641	439	no	17	47	0	145	283
21739	435	no	16	48	0	143	277
23506	448	no	18	46	0	139	293
26180	30	yes	41	23	0	3	0
20174	2	no	0	0	0	0	1
20527	209	no	0	0	0	0	208
22470	448	no	14	50	0	149	287
20551	209	no	0	0	0	0	209
Sum	4476	 	179	461	0	1312	3003

R_R_R______R_R___R_________W_______R__R_R_WR________R__R___R____
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
______W_R_R____R__R__R________________R__RR________R_____R_R____
___R_RR________WRR____R__R_R______R__________WR______R____R___R_
R________R______RR__R__RR___R______RR___RRR______R__R___RR_R____
RR______________W__R_______R_________R_____RRW_____R____RWR_____
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
_____R___R_____R_R_R_RR_R___R_W_R__R__R___R______R____R_R_______
R______RR__R_R__RR_________R____R___R___RRR________R_______R___R
_________R__RR_______RR__R___R___R_____RRR____R_R_RR___R____R_W_
R___R_RRW___RRRR_RRRRRRRR_WRR_RR_RRRRRRRRRRRRRR__RRRRR__________
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
_____W______R__R____R_________R_____R_RWR_RR_R_______R____R_____
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG


Shortly afterwards all the Gs are cleared and it gets back to doing useful work for a while.
Sometimes "a while" can be 15 minutes, other times less than 1 second.

Comment 39 Stefan Fritsch 2016-04-11 20:56:58 UTC

As a summary, the problem is that old processes that are shutting down but are still processing some long lasting connetions take up all open scoreboard
slots. It may be triggered in two ways:

a) when doing a graceful restart (apachectl graceful)

b) when the server load goes down in a way that causes httpd to stop some processes. This is particularily problematic because when the load increases again, httpd will try to start more processes. If the pattern repeats, the number of processes can rise quite a bit.

I think two things should be done:

1) Allow to use some extra scoreboard slots for processes that are gracefully shutting down. This is necessary to fix a) and will help a bit with b). To avoid these extra processes taking too much resources, they should try to free resources to the OS as soon as possible.

2) When some process is doing idle shutdown in situation b) and httpd wants more active processes due to rising load, it should not start new processes but rather tell the finishing processes to abort shutdown and resume full
operation. This helps with b) but not with a). It is also a lot more invasive
to implement than 1).

My previous patch https://bz.apache.org/bugzilla/attachment.cgi?id=33158 did 1) to some extent by allowing re-use of some scoreboard slots. I will post a new patch in a minute.

As configuration, I recommend (this one is true even if not using any patch):

MaxspareThreads - MinSpareThreads >= 2 * ThreadsPerChild

Higher values of the difference may work better. This reduces the likelyhood of situation b) appearing.

Comment 40 Stefan Fritsch 2016-04-11 20:59:05 UTC

Created attachment 33749 [details]
Allow to use more scoreboad slots

The new patch goes a step further and allows in total 10 times as many
processes as configured by MaxRequestWorkers / ThreadsPerChild , though
ServerLimit is still honored. The number 10 is currently hard-coded but would
probably be configurable in the end.


If using the patch, you should also set

ServerLimit >= 10 * MaxRequestWorkers / ThreadsPerChild

Though a smaller value may make sense if you are short of RAM.

Comment 41 Stefan Fritsch 2016-04-11 21:05:09 UTC

(In reply to Sander Hoentjen from comment #35)
> After applying the patch I ran into "No space left on device: AH00023:
> Couldn't create the proxy mutex" I haven't seen that issue without the patch.
> 
> Log says:
> [Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid
> 140551243081696] (28)No space left on device: AH00023: Couldn't create the
> proxy mutex
> [Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid
> 140551243081696] (28)No space left on device: AH02478: failed to create
> proxy mutex
> AH00016: Configuration Failed
> 

You could try using different Mutex types. On Linux, pthread may work best. Or  you may try to increase the allowed ressources, possibly shared memory. How that is done depends on your OS.

Comment 42 Stefan Fritsch 2016-04-11 21:18:39 UTC

Created attachment 33750 [details]
same as above, but for trunk

Attaching the same patch, but for trunk.


(In reply to Stefan Fritsch from comment #40)
> Created attachment 33749 [details]
> Allow to use more scoreboad slots

That patch is for 2.4 and also includes these commits from trunk:

https://svn.apache.org/r1703241
https://svn.apache.org/r1705922
https://svn.apache.org/r1706523
https://svn.apache.org/r1738464
https://svn.apache.org/r1738466
https://svn.apache.org/r1738486
https://svn.apache.org/r1738631
https://svn.apache.org/r1738632
https://svn.apache.org/r1738633
https://svn.apache.org/r1738635

Comment 43 Sander Hoentjen 2016-04-12 07:16:04 UTC

(In reply to Stefan Fritsch from comment #41)
> (In reply to Sander Hoentjen from comment #35)
> > After applying the patch I ran into "No space left on device: AH00023:
> > Couldn't create the proxy mutex" I haven't seen that issue without the patch.
> > 
> > Log says:
> > [Sat Mar 26 07:00:34.857694 2016] [core:emerg] [pid 787770:tid
> > 140551243081696] (28)No space left on device: AH00023: Couldn't create the
> > proxy mutex
> > [Sat Mar 26 07:00:34.857764 2016] [proxy:crit] [pid 787770:tid
> > 140551243081696] (28)No space left on device: AH02478: failed to create
> > proxy mutex
> > AH00016: Configuration Failed
> > 
> 
> You could try using different Mutex types. On Linux, pthread may work best.
> Or  you may try to increase the allowed ressources, possibly shared memory.
> How that is done depends on your OS.

But is there anything in the patch that changes this? Because without your patch we never ran into that issue.
Would the new patch behave differently in this regard?

Comment 44 Yann Ylavic 2016-04-12 07:49:37 UTC

(In reply to Sander Hoentjen from comment #43)
> Would the new patch behave differently in this regard?

Your issue is probably not related to the patch.
It is usually caused by an unclean shutdown of httpd (eg. kill -9), or a crash of the parent process (you should see this in the system logs), possibly if you upgraded the binaries while httpd was still running.
The number of IPC SysV semaphores is limited on the system, if the previous ones were not cleanly deleted on shutdown, the new startup won't complete.
As suggested by Stefan, you could use another Mutex mechanism (pthread) which does not leak on unclean shutdown (even if httpd is killed).

Comment 45 mbs 2016-04-30 05:36:48 UTC

I was able to manage this issue by reducing GracefulShutdownTimeout value and increasing MaxClients / MaxRequestWorkers value to make more room for Apache scoreboard . 

Also I reduce no of MaxKeepAliveRequests Apache global level. 

For more info :- https://www.tectut.com/2016/04/workaround-for-scoreboard-is-full-not-at-maxrequestworkers

Comment 46 Valentin Gjorgjioski 2016-09-02 11:59:02 UTC

Hitting me as well and making lot of troubles. 

When is this going to be fixed? 

What it the recommendation for production server? 

Is it better if upgrade to 2.4.18? 2.4.10 backport? 

or going back to which one is the best for 14.04.5 LTS ?

Comment 47 bucky 2016-09-02 15:24:08 UTC

(In reply to Valentin Gjorgjioski from comment #46)
> Hitting me as well and making lot of troubles. 
> Is it better if upgrade to 2.4.18? 2.4.10 backport? 

Upgrading to 2.4.18 hasn't helped everyone, but it did help me. The "centos-sclo-rh" repository was a solution in my situation.

Comment 48 Luca Toscano 2016-09-02 15:48:45 UTC

(In reply to Valentin Gjorgjioski from comment #46)
> Hitting me as well and making lot of troubles. 

Hi Valentin,

can you give us a bit more details about your use case? Does the max scoreboard issue happens regularly after certain events or randomly? What is your configuration (if you can share it) and httpd version? It would help a lot :)

Luca

Comment 49 Valentin Gjorgjioski 2016-09-02 16:38:42 UTC

Hi,


This started happening after recent upgrade of Ubuntu. Apache was the same, and now it is the same.  Ubuntu is 14.04.5 LTS, Apache is 2.4.7. 

This is high load, production server. Working for 1.5 year without any problems so far. 

Here is some log of that update, when the problem started: 

[UPGRADE] apache2:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
[UPGRADE] apache2-bin:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
[UPGRADE] apache2-data:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
[UPGRADE] apache2-mpm-worker:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
[UPGRADE] apache2-utils:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
[INSTALL] php5-mysqlnd:amd64
[UPGRADE] php5-cli:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-common:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-curl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-fpm:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-gd:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-intl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-pgsql:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-pspell:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-readline:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-recode:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-sqlite:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-tidy:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-xmlrpc:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5-xsl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
[UPGRADE] php5:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19


Here is what I nailed it down to: 
1. After this upgrade I needed to DISABLE the opcache in PHP, because problems started with fatal errors and segmentation faults with wordpress. 
2. Because of the 1. the server got even higher load. 
3. Higher load caused full scoreboard, and maxRequestWorkersk. 

What I found were two problems: 

1. When high load occurs and MaxReqeustWorkers is hit, the apache stops responding (dies). It should slow down, should not accept new requests until free slot, but it shouldn't stop responding.  I think I saw this reported somewhere else, e.g.: 
https://www.digitalocean.com/community/questions/apache2-crash-on-ubuntu-14-04-maxrequestworkers-issue

2. When I found a way to solve the problem with high load (enable wp cache plugins), now the second problem started, mainly on apache reload (log rotation) or even on regular basis WHEN MaxConnectionsPerChild is different from 0,  and/or when pm.max_requests is different from 0. Why this is a problem - because children are dying after certain numbers of requests, and then they get stuck into "G" state, and never completing. This is filling your scoreboard and you are ending with that error. Once you set these to 0, problem more or less disappears. 

Workaround is setting these to 0, and hoping all scripts are good, no memory leaks, lowering memory usage in php.ini, and restaring the server each day (on logrotate restart and not reload). 


Very important trick that I learned in during this is also this one: ALWAYS restart php-fpm and apache together. Failing to do so leads to some instabilities. 

For me that workaround work, but I would like to hear why this happens, and how we can prevent it (especially the problem when Apache dies when MaxRequestWorkers is readched).

Comment 50 Luca Toscano 2016-09-05 13:18:21 UTC

Thanks a lot for the details Valentin, will try to add my thoughts inline:

(In reply to Valentin Gjorgjioski from comment #49)

> This started happening after recent upgrade of Ubuntu. Apache was the same,
> and now it is the same.  Ubuntu is 14.04.5 LTS, Apache is 2.4.7. 

This is a very old version of httpd, so if you could if would be really great to upgrade Trusty to something more recent to see the differences.

> This is high load, production server. Working for 1.5 year without any
> problems so far. 
> 
> Here is some log of that update, when the problem started: 
> 
> [UPGRADE] apache2:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
> [UPGRADE] apache2-bin:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
> [UPGRADE] apache2-data:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
> [UPGRADE] apache2-mpm-worker:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
> [UPGRADE] apache2-utils:amd64 2.4.7-1ubuntu4.9 -> 2.4.7-1ubuntu4.13
> [INSTALL] php5-mysqlnd:amd64
> [UPGRADE] php5-cli:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-common:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-curl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-fpm:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-gd:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-intl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-pgsql:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-pspell:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-readline:amd64 5.5.9+dfsg-1ubuntu4.14 ->
> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-recode:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-sqlite:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-tidy:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-xmlrpc:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5-xsl:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> [UPGRADE] php5:amd64 5.5.9+dfsg-1ubuntu4.14 -> 5.5.9+dfsg-1ubuntu4.19
> 
> 
> Here is what I nailed it down to: 
> 1. After this upgrade I needed to DISABLE the opcache in PHP, because
> problems started with fatal errors and segmentation faults with wordpress. 
> 2. Because of the 1. the server got even higher load. 
> 3. Higher load caused full scoreboard, and maxRequestWorkersk.

Stating the obvious but the httpd issue seems to be a consequence of all the php upgrades happened at the same time. Have you tried to rollback the last upgrade to see if the issue persists?


>
> What I found were two problems: 
> 
> 1. When high load occurs and MaxReqeustWorkers is hit, the apache stops
> responding (dies). It should slow down, should not accept new requests until
> free slot, but it shouldn't stop responding.  I think I saw this reported
> somewhere else, e.g.: 
> https://www.digitalocean.com/community/questions/apache2-crash-on-ubuntu-14-
> 04-maxrequestworkers-issue

Would you mind to include the logs and/or more details about this? Again it would be really great to know if the problem is the same with a more recent version of httpd. 

> 
> 2. When I found a way to solve the problem with high load (enable wp cache
> plugins), now the second problem started, mainly on apache reload (log
> rotation) or even on regular basis WHEN MaxConnectionsPerChild is different
> from 0,  and/or when pm.max_requests is different from 0. Why this is a
> problem - because children are dying after certain numbers of requests, and
> then they get stuck into "G" state, and never completing. This is filling
> your scoreboard and you are ending with that error. Once you set these to 0,
> problem more or less disappears. 

Do you have long timeouts (proxy, etc..) in your httpd configuration? This would be a useful information for us, it happened in the past that long proxy timeouts where exacerbating the issue that you described.

> 
> Workaround is setting these to 0, and hoping all scripts are good, no memory
> leaks, lowering memory usage in php.ini, and restaring the server each day
> (on logrotate restart and not reload). 

> 
> Very important trick that I learned in during this is also this one: ALWAYS
> restart php-fpm and apache together. Failing to do so leads to some
> instabilities. 
> 
> For me that workaround work, but I would like to hear why this happens, and
> how we can prevent it (especially the problem when Apache dies when
> MaxRequestWorkers is readched).


As written above it would be great to know more about the "Apache dies" part. Any detail that you could share with us would be really appreciated.

Thanks!

Luca

Comment 51 Valentin Gjorgjioski 2016-09-05 15:57:47 UTC

Hi Luca,

at the moment upgrading to trusty is not really an option, scared mostly from PHP7, and compatibility issues that might arise. Maybe next year. 

Haven't tried to rollback, was not even sure how to do that, and if that is easy. 

the link to digitalocean is another user, but I'm experiencing exactly. Unfortunately nothing in the log. Except the message stated there. 

I'm not sure what long timeout is, but probably default of (300seconds?!) for php-fpm using sockets is long. And yes, I guess this exacerbating the issue.  No proxies defined. To me it seems like when some processes hang on php side, they are not getting killed on the apache side and connection is not released. Not even after those 5minutes. It gets stuck there and that's it. 

Apache dies means - apache processes are there, using no cpu, accepting no connections, and only restart helps. Nothing in the logs.

I just went to prefork. I think it will be stable for now. I had tons of problems these 5 days, I don't know why I didn't switch to prefork earlier. It seems like e good workaround for me right now.

Comment 52 Valentin Gjorgjioski 2016-09-05 19:20:46 UTC

Hi,

now I believe I have clear picture what it is going on:

1. I'm using FastCGI, obviously dead project and not supported ?!

2. I'm not sure whether there is a directive such as connect timeout (fcgid has this). It seems either there is no timeout or it is quite big.

3. When Apache get hardly hit, then php-fpm get hardly hit as well. In my case PHP-FPM started having problems to do its job when I disabled the opcache mentioned earlier. So it get stuck with a longer and longer queue. Then apache continue sending processes to php-gpm even when php-fpm reached the limit (pm.max_children). In such scenario php-fpm stops opening new processes, but somehow old processes get stuck. Then apache continue doing this until full scoreboard. And now CPU usage is very low, it seems like some I/O block, many apache processes (1500?! ) waiting to open socket, but the socket is not available.

However, at this point it is not very clear to me why Apache builds up the queue and the queue is not getting emptied - there is no high processor usage, it seems that php-fpm/apache got stuck and nothing can be done. Could be this apache not handling sockets properly?

4. Even with prefork this happens, it's not the mpm_event problem in this case.

Workaround for the next month or so: Optimize work of PHP, lower the load so PHP-FPM can handl timely. Also, ubuntu upgrade and including more stable php opcache will help towards this.

Long time solution: There must be a solution for this problem in general. Either it is time to move to nginx, or it is time to move to better module for fastcgi. By the way, what will you sugest at this point, what is the easier migration path from fastcgi to another apache module?

Comment 53 Eric Covener 2016-09-05 20:48:21 UTC

> However, at this point it is not very clear to me why Apache builds up the
> queue and the queue is not getting emptied  - there is no high processor
> usage, it seems that php-fpm/apache got stuck and nothing can be done. Could
> be this apache not handling sockets properly? 


I'd suggest starting a thread on users@httpd.apache.org.

If you can get this error, you should be able to find some processes trying to exit but hanging on the way out waiting for requests to complete.   Showing their backtrace with gdb (or pstack) will tell us exactly what they're doing.

Your MPM configuration will also tell us if you have unnecessary process churn.

Comment 54 Stefan Fritsch 2016-09-05 21:45:56 UTC

Created attachment 34201 [details]
Use all scoreboard entries up to ServerLimit, for trunk

New patch: This time use the whole scoreboard up to the configured ServerLimit. Also fixed some issues with the previous patch.

Comment 55 Stefan Fritsch 2016-09-05 21:50:31 UTC

Created attachment 34202 [details]
Use all scoreboard entries up to ServerLimit, for 2.4

Same as above, but for 2.4.

This contains the trunk patch plus these commits from trunk:

r1705922
r1706523
r1738464
r1738466
r1738486
r1738628
r1738631
r1738632
r1738633
r1738635
r1756848
r1757009
r1757011
r1757029
r1757030
r1757031
r1757056
r1757061

It would be really nice if someone could give this a try in a real-life setup.

Comment 56 Valentin Gjorgjioski 2016-09-05 21:58:13 UTC

from what I understand, it seems that Apache can't do anything about this, it seems correct behavior. It waits on the socket for its output. Timeouts are high (30 seconds) so on a busy server if all php-fpm processes working on that socket are occupied (not returning result), queue is getting bigger and bigger. 

And indeed every-time this crashed happened I found timeout in error logs (just for certain web sites), which I have missed previously.  

It seems like the problem is in php-fpm, that started with my recent upgrade. Problems with the opcache started also there. And I replaced mysql with mysqlnd in that update. So many changes, something was broken, but I think there is nothing wrong with apache. Problem should be either in php-fpm or php-mysqlnd or maybe in the web-sites themselves. 


At the end it will be great if apache provides ability to limit number of processes per virtual host (as php-gpm allows this). This way it will be also much easier to isolate/solve the problem.

Comment 57 Thomas Jarosch 2016-10-25 21:46:14 UTC

Hi Stefan,

thanks for trying to solve the "scoreboard full" issue :)

I've been hit by it badly today, the affected machine
is a forward proxy and stalls the traffic almost completely.

Some background info:
- event mpm on httpd 2.4.23
- forward proxy setup via mod_proxy
- 280 real users + other machines. ~370 clients
- server load is around 0.2, plenty of free RAM
- file descriptor limit is 1024
- logrotate sends a graceful restart every hour

If the problem occurs, httpd doesn't even respond
to the /server-status page reliably.

A small script logs the /server-status page every 30s to disk.
Specific case: logrotate sends a "graceful restart" at 13h.

/server-status output at 13:04:24h:
-------------------
Total accesses: 8801 - Total Traffic: 74.6 MB
75 requests currently being processed, 125 idle workers
+---------------------------------------------------------------------------+
|       |    Connections    |   Threads   |        Async connections        |
|  PID  |-------------------+-------------+---------------------------------|
|       | total | accepting | busy | idle | writing | keep-alive | closing ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14906 | 7     | yes       | 6    | 44   | 0       | 1          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14959 | 9     | yes       | 9    | 41   | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15014 | 3     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15015 | 49    | yes       | 50   | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15329 | 3     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15893 | 15    | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 17762 | 11    | yes       | 10   | 40   | 0       | 1          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| Sum   | 97    |           | 75   | 125  | 0       | 2          | 0       ||
+---------------------------------------------------------------------------+

_________R_____R__________________R___R___R__R________R______R_R
R_____R__R_________________R__R____RGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGRRRRRRRRRRRRRRRRRRRRRRRRRRRWRRRRRRRRRRRRRR
RRRRRRRRGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGR__________R__R_____
_______R_RR_________R_RR_R____
-------------------


/server-status output at 13:15:25h:
-------------------
Total accesses: 12929 - Total Traffic: 90.9 MB
87 requests currently being processed, 63 idle workers
+---------------------------------------------------------------------------+
|       |    Connections    |   Threads   |        Async connections        |
|  PID  |-------------------+-------------+---------------------------------|
|       | total | accepting | busy | idle | writing | keep-alive | closing ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14906 | 18    | yes       | 16   | 34   | 0       | 2          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14959 | 27    | yes       | 26   | 24   | 0       | 2          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15014 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15015 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15329 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 18564 | 45    | yes       | 45   | 5    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 17762 | 39    | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 18078 | 44    | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| Sum   | 179   |           | 87   | 63   | 0       | 4          | 0       ||
+---------------------------------------------------------------------------+

_____R__R___R_RR_RR_R_RR__R_____R_R___R_R_____R___W_RR__RR_RR__R
RR__R_RR____RRRRR_R_RR___R_RR_RR____GGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGRRRRRR
RRRRRRRRR_RRRRRRRRR_RRRR_RRRRRRRRRRR_R_RRRRRGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGG
-------------------


/server-status at 13:25:20h:
(httpd hardly responding anymore):
-------------------
Total accesses: 14630 - Total Traffic: 97.4 MB
50 requests currently being processed, 0 idle workers
+---------------------------------------------------------------------------+
|       |    Connections    |   Threads   |        Async connections        |
|  PID  |-------------------+-------------+---------------------------------|
|       | total | accepting | busy | idle | writing | keep-alive | closing ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14906 | 36    | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 14959 | 2     | yes       | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15014 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15015 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 15329 | 2     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 18564 | 50    | yes       | 50   | 0    | 0       | 1          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 17762 | 3     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| 18078 | 1     | no        | 0    | 0    | 0       | 0          | 0       ||
|-------+-------+-----------+------+------+---------+------------+---------||
| Sum   | 98    |           | 50   | 0    | 0       | 1          | 0       ||
+---------------------------------------------------------------------------+

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGWRRRRR
RRRRWRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGG
-------------------

I can provide more /server-status output if needed.

After around 30 mins, the external "mon" watchdog
kills httpd and restarts it. Traffic continues to flow.


httpd config:
-------------------
Timeout 300
KeepAliveTimeout 300

<IfModule mpm_event_module>
  # Number of concurrent connections is: ServerLimit * ThreadsPerChild
  # Result: 16 * 50 -> 800
  #
  StartServers 1
  ServerLimit 16
  ThreadLimit 50
  ThreadsPerChild 50
  MaxConnectionsPerChild  1000
</IfModule>

No other performance related settings.

-------------------

I've now increased ServerLimit to 32 and disabled
logrotate as a quick fix. It holds so far.
Occasionally I still see the "scoreboard full" message,
even though there are just ~160 active connections and some processes
are (still?) in the graceful shutdown state.


I'll put the patch from #55 on the productive machine tomorrow :o)
It already runs on my own proxy and the one from my department.

Anything else to watch out for?

I can provide gdb backtraces if you tell
me to look for something specific, too.

Triggering a graceful restart during peak traffic might be a good test...

Cheers,
Thomas

Comment 58 Thomas Jarosch 2016-10-26 06:39:03 UTC

Another info about my setup:

There are two other httpd instances running on different ports.
One is using the event MPM, the other one prefork MPM.

I didn't configure an explicit ScoreBoardFile, so the scoreboard is in anonymous shared memory. Could there be cross-talk of those three httpds?

Comment 59 Thomas Jarosch 2016-10-26 13:44:54 UTC

Hi Stefan,

the patch from #55 seems to make things scale a lot better.
Also the status output is very helpful.

ServerLimit was changed back to 16 before the tests.
I did a graceful restart at 13:09:35h.

/server-status at 14:19:36h (*before* the next graceful restart):
-----------------------
Total accesses: 23693 - Total Traffic: 200.0 MB
100 requests currently being processed, 150 idle workers
+--------------------------------------------------------------------------------------------+
|      |       |          |    Connections    |   Threads   |       Async connections        |
| Slot |  PID  | Stopping |-------------------+-------------+--------------------------------|
|      |       |          | total | accepting | busy | idle | writing | keep-alive | closing |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|0     |19952  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|1     |20006  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|2     |20060  |yes (old  |5      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|3     |20160  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|4     |20224  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|5     |20725  |no        |2      |yes        |2     |48    |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|6     |27470  |no        |50     |yes        |50    |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|7     |24389  |yes       |3      |no         |0     |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|8     |27104  |no        |18     |yes        |18    |32    |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|9     |27346  |no        |3      |yes        |3     |47    |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|10    |22579  |yes       |2      |no         |0     |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|11    |27674  |no        |29     |yes        |27    |23    |0        |3           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|13    |25055  |yes       |8      |no         |0     |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|14    |25350  |yes       |2      |no         |0     |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|15    |25475  |yes       |5      |no         |0     |0     |0        |0           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|Sum   |15     |10        |137    |           |100   |150   |0        |3           |0        |
+--------------------------------------------------------------------------------------------+

.G.G...............G............................................
..............G.....G.....G.........G..............G............
.........G.....G...G..................GG........................
...........................G........G.....................______
___________R_______________R________________RRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR.....................G............
.....G...G......___R____R_RR_R__R______RRRRR__R__R______RR__R_R_
_R________________R__________________R_____________RGG__RRRRRRR_
_RRRR___R____RR__RR____R__R_W__RRRRR_RRRGGGGGGGGGGGGGGG

-----------------------

As you can see, there are still processes from "old gen" after one hour.
This is due to long running HTTP CONNECT requests to google / dropbox / etc.

Probably GracefulShutdownTimeout will help here, may be
having a default value of one hour might make sense
for httpd in general?


Next graceful restart at 14:19:51h.

Errors start to appear in the log two seconds later:

[Wed Oct 26 14:19:53.926229 2016] [mpm_event:error] [pid 19951:tid 3071850240] AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.


/server-status at 14:20:06h:
-----------------------
Total accesses: 23744 - Total Traffic: 200.9 MB
8 requests currently being processed, 42 idle workers
+--------------------------------------------------------------------------------------------+
|      |       |          |    Connections    |   Threads   |       Async connections        |
| Slot |  PID  | Stopping |-------------------+-------------+--------------------------------|
|      |       |          | total | accepting | busy | idle | writing | keep-alive | closing |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|0     |19952  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|1     |20006  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|2     |20060  |yes (old  |5      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|3     |20160  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|4     |20224  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|5     |20725  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|6     |27470  |yes (old  |42     |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|7     |24389  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|8     |27104  |yes (old  |18     |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|9     |27346  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|10    |22579  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|11    |27674  |yes (old  |24     |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|12    |28054  |no        |9      |yes        |8     |42    |0        |2           |0        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|13    |25055  |yes (old  |8      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|14    |25350  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|15    |25475  |yes (old  |5      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|Sum   |16     |15        |133    |           |8     |42    |0        |2           |0        |
+--------------------------------------------------------------------------------------------+

.G.G...............G............................................
..............G.....G.....G.........G..............G............
.........G.....G...G..................GG........................
...........................G........G...........................
...........G...............G................G.GGGGG.G.G..GGGGGG.
GGGGGGGGGGGGGG.GGGGGGGGGGG.GGG.....................G............
.....G...G......GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG_
______RRRR____RRRW_______________________________GGGGGGGGGGGGGGG
-----------------------



The forward proxy became unresponsive again.
/server-status at 14:29:16h:
-----------------------
Total accesses: 24453 - Total Traffic: 226.8 MB
50 requests currently being processed, 0 idle workers
+--------------------------------------------------------------------------------------------+
|      |       |          |    Connections    |   Threads   |       Async connections        |
| Slot |  PID  | Stopping |-------------------+-------------+--------------------------------|
|      |       |          | total | accepting | busy | idle | writing | keep-alive | closing |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|0     |19952  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|1     |20006  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|2     |20060  |yes (old  |5      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|3     |20160  |yes (old  |1      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|4     |20224  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|5     |20725  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|6     |27470  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|7     |24389  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|8     |27104  |yes (old  |1      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|9     |27346  |yes (old  |1      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|10    |22579  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|11    |27674  |yes (old  |3      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|12    |28054  |no        |51     |yes        |50    |0     |0        |0           |1        |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|13    |25055  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|14    |25350  |yes (old  |2      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|15    |25475  |yes (old  |4      |no         |0     |0     |0        |0           |0        |
|      |       |gen)      |       |           |      |      |         |            |         |
|------+-------+----------+-------+-----------+------+------+---------+------------+---------|
|Sum   |16     |15        |86     |           |50    |0     |0        |0           |1        |
+--------------------------------------------------------------------------------------------+

.G.G...............G............................................
..............G.....G.....G.........G..............G............
.........G.....G...G...................G........................
...........................G........G...........................
...........G...............G....................................
...........G.............G.........................G............
.....G..........GGGGGGGRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRWRGGGGGGGG
-----------------------

As you can see, there was plenty of room in the scoreboard now,
but the process list slots were used up by old processes
serving just a handful of connections.


One option would be to increase ServerLimit to let's say 128,
but that also raises the resource limits during normal operation.
If I raise ServerLimit too much, I have to lower the thread count again.
Sounds a bit like the prefork mpm...

Another option would be to add a config setting to ignore
processes for the ServerLimit calculation if they are
in graceful shutdown mode. They probably don't consume
a lot of resources and we can have a GracefulShutdownTimeout
of one hour to expire them, too.

Third option (preferred one): Have an own GracefulShutdownLimit
that's separate from ServerLimit. If we have too many processes,
start killing of oldest process from the graceful shutdown list.
Process in graceful shutdown mode don't count for ServerLimit.


I've raised ServerLimit to 32 on the box again.
The users can't be annoyed too much ;)

Cheers,
Thomas

PS: Forget about the idea about cross-talk of anonymous shared memory segments from #58. It's not the case.

Comment 60 Stefan Fritsch 2016-11-04 13:57:23 UTC

(In reply to Thomas Jarosch from comment #59)
> the patch from #55 seems to make things scale a lot better.
> Also the status output is very helpful.

Glad to hear that and thanks for testing it.

> As you can see, there are still processes from "old gen" after one hour.
> This is due to long running HTTP CONNECT requests to google / dropbox / etc.

There is no way to determine if such connections can be "safely" interrupted or if they are in the middle of a long download.

> 
> Probably GracefulShutdownTimeout will help here, may be
> having a default value of one hour might make sense
> for httpd in general?

Currently the children won't honor GracefulShutdownTimeout. But that should be added.

> As you can see, there was plenty of room in the scoreboard now,
> but the process list slots were used up by old processes
> serving just a handful of connections.
> 
> 
> One option would be to increase ServerLimit to let's say 128,
> but that also raises the resource limits during normal operation.
> If I raise ServerLimit too much, I have to lower the thread count again.
> Sounds a bit like the prefork mpm...

During normal operation, the number of threads will be limited by MaxRequestWorkers. The idea of my patch is that you can increase Serverlimit quite a bit without using too many ressources. The processes serving old connections should terminate most of their threads and free most of their memory, so the resource usage should not be too much. But it of course depends on how may old connections are still open.

> Another option would be to add a config setting to ignore
> processes for the ServerLimit calculation if they are
> in graceful shutdown mode. They probably don't consume
> a lot of resources and we can have a GracefulShutdownTimeout
> of one hour to expire them, too.

You are confusing ServerLimit with MaxRequestWorkers here. While the latter is a number of threads and not processes, it does what you think ServerLimit should do.

> Third option (preferred one): Have an own GracefulShutdownLimit
> that's separate from ServerLimit. If we have too many processes,
> start killing of oldest process from the graceful shutdown list.
> Process in graceful shutdown mode don't count for ServerLimit.

Yes, we could do that, too. But first I need something like GracefulShutdownTimeout to work for the old child processes.


If you have any more experiences with the patch I am certainly interested. Even if it has simply run for some time without (new) bugs exposed.

Cheers,
Stefan

Comment 61 Yann Ylavic 2016-11-04 23:04:30 UTC

Some quick note about the patch (unfortunately I could not carry out my testing since a colleague reused the machine, resetting my local patches/work altogether...).

Anyway, there is possibly an issue with retained->total_daemons which is incremented (unconditionally) whenever a child is created (make_child), but not always decremented when one finishes (server_main_loop, depending on whether or not it died smoothly and it still uses a scoreboard slot).

IOW, I think this hunk:
                 ps->quiescing = 0;
+                retained->total_daemons--;

should probably be moved up here:
         ap_wait_or_timeout(&exitwhy, &status, &pid, pconf, ap_server_conf);
         if (pid.pid != -1) {
+            retained->total_daemons--;

Will restart my tests ASAP...

Comment 62 Stefan Fritsch 2016-11-06 20:36:40 UTC

(In reply to Yann Ylavic from comment #61)
> Anyway, there is possibly an issue with retained->total_daemons which is
> incremented (unconditionally) whenever a child is created (make_child), but
> not always decremented when one finishes (server_main_loop, depending on
> whether or not it died smoothly and it still uses a scoreboard slot).
> 
> IOW, I think this hunk:
>                  ps->quiescing = 0;
> +                retained->total_daemons--;
> 
> should probably be moved up here:
>          ap_wait_or_timeout(&exitwhy, &status, &pid, pconf, ap_server_conf);
>          if (pid.pid != -1) {
> +            retained->total_daemons--;

No, I think the code in the patch is correct: There is only one case where the code will return from the function before reaching the "if (child_slot >= 0) {" block which contains the "retained->total_daemons--;" line. And in this case the whole server will exit, so correct counting is not an issue any more.

On the other hand, total_daemons must not be decremented if child_slot < 0, because in this case the dead process was not a worker process (but e.g. a cgid-process).

But this should be made clearer, either by rearranging the code or by adding some comments.

Comment 63 Christian Folini 2016-11-07 10:24:40 UTC

We have successfully used patch in #55 for 50 days now on mid-sized production server with 1-2 million hits per day. No issues encountered. Previous issues disappeared (we think the original bug had been abused in DoS attack, but we might be wrong on this).

Comment 64 Jim Jagielski 2016-11-21 20:21:36 UTC

Comment on attachment 34202 [details]
Use all scoreboard entries up to ServerLimit, for 2.4

This looks good. Should be proposed for back port!!

Comment 65 Stefan Fritsch 2016-11-21 20:48:11 UTC

Rest of the trunk patch committed as

r1770750
r1770752

Comment 66 Eric Covener 2016-12-31 00:18:31 UTC

Fixed in 2.4.25

Comment 67 Thomas Jarosch 2017-01-25 12:00:44 UTC

Hi Stefan,

(In reply to Stefan Fritsch from comment #60)
> > the patch from #55 seems to make things scale a lot better.
> > Also the status output is very helpful.
> 
> Glad to hear that and thanks for testing it.

Sorry, I didn't see your reply as bugzilla
didn't add me to CC: automatically. Which is rather
odd since it's the default setting. 

Back to the topic:

> > Probably GracefulShutdownTimeout will help here, may be
> > having a default value of one hour might make sense
> > for httpd in general?
> 
> Currently the children won't honor GracefulShutdownTimeout. But that should
> be added.

very nice.

> > Third option (preferred one): Have an own GracefulShutdownLimit
> > that's separate from ServerLimit. If we have too many processes,
> > start killing of oldest process from the graceful shutdown list.
> > Process in graceful shutdown mode don't count for ServerLimit.
> 
> Yes, we could do that, too. But first I need something like
> GracefulShutdownTimeout to work for the old child processes.

ok. 

In the meantime I've decreased the ServerLimit/ThreadLimit to 5 and increased the ServerLimit 160 and more. The results with these settings are very good, no more user complaints (see below).

Otherwise those long running HTTP CONNECT sessions were still maxing out the total number of allowed processes.

> If you have any more experiences with the patch I am certainly interested.
> Even if it has simply run for some time without (new) bugs exposed.

the patch had been deployed to about ~3.000 servers since November 2016 with different work loads from 10 users to 400+ users. After applying your patch + the ThreadLimit change, there were no more complaints :)

I've also diffed httpd 2.4.23 + the patch with the version of the code that landed in 2.4.25 and it's exactly the same. I'm soon going to roll out 2.4.25 to those boxes.

Thanks again!
Thomas

Comment 68 Luca Toscano 2017-01-31 10:06:49 UTC

*** Bug 56101 has been marked as a duplicate of this bug. ***