Bug 61551 - Event MPM loses track of connections not reported as processed by modules
Summary: Event MPM loses track of connections not reported as processed by modules
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_event (show other bugs)
Version: 2.4.27
Hardware: PC Linux
: P2 normal with 1 vote (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2017-09-20 16:30 UTC by Nic Jansma
Modified: 2018-01-18 14:02 UTC (History)
3 users (show)



Attachments
Cleanup connection on pollset failure (5.05 KB, patch)
2017-09-22 08:28 UTC, Yann Ylavic
Details | Diff
Work around process_connection hooks not updating the state (2.00 KB, patch)
2017-11-09 11:32 UTC, Yann Ylavic
Details | Diff
Patches for 2.4.27 (22.43 KB, application/gzip)
2018-01-04 01:54 UTC, Yann Ylavic
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nic Jansma 2017-09-20 16:30:49 UTC
I've been having an issue with Apache's event MPM and Slots being taken up by PIDs that never seem to exit the "yes (old gen)" ("Gracefully finishing") state.

I first thought this was related to https://bz.apache.org/bugzilla/show_bug.cgi?id=53555, but that was fixed in 2.4.24, and I'm running 2.4.27.

The problem started 3 days ago when I switched from prefork to event MPM.  Since then, I've had 3 occurrences of Apache stopping accepting any new connections.  After a few minutes, my monitoring scripts detect this and restart httpd.

I'm still debugging exactly what the state of Apache is at that time, but I see this in the event log:

AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

After reading bug # 53555, I've tweaked my server settings a little bit (specifically, increased ServerLimit to be more than MaxRequestWorkers/ThreadsPerChild so there is Scoreboard room available after a reload when some slots are taken by shutting-down processes).

I'm not sure if that will fix the issue of server-not-responding long-term, but I've noticed that Apache is often getting into a state where there are processes stopping, but stuck in "yes (old gen)" state indefinitely.

These processes don't appear to ever exit, until I hard restart Apache.

These "yes (old gen)" processes are Accepting Connections=no, and 0 Threads busy/idle, and Async connections writing/keep-alive/closing are all 0 too.

An example scoreboard is below:

	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	19022	no	29	yes	5	59	0	18	8
1	5502	yes (old gen)	44	no	0	0	0	0	0
2	8233	no	5	yes	0	64	0	4	1
3	5557	yes (old gen)	19	no	0	0	0	0	0
4	5597	yes (old gen)	19	no	0	0	0	0	0
Sum	5	3	116	 	5	123	0	22	9

You can see there are 3 processes that are "yes (old gen)" with 0 active things going on.

The scoreboard shows many things in the "G" state:

_______W_____W___________R_____________________________WW_______
G...G.GGGGGGG.GGGGGGG..GGGG.G.G.G..GGG.GGGGGGGGG..GGGG.G.GG...GG
________________________________________________________________
G.....G...GG.......GG..G...G.GG......G.GGG...G...GG...G.......G.
.....G.GGG...GGG.....GG....G..G.GGG......GG.......G........G...G
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................

I'm not sure if there's a configuration setting I'm missing to ensure those "yes (old gen)" processes actually exit earlier, or if this is a bug.
Comment 1 Eric Covener 2017-09-20 17:57:01 UTC
Can you tell w/ pstack/gdb what remaining threads in that process are doing?
Comment 2 Nic Jansma 2017-09-20 23:26:25 UTC
Full bt of all threads here:

https://pastebin.com/vK0rBSpP

Summary:

* 95 threads look like this:

Thread 47 (Thread 0x7fce21f66700 (LWP 5999)):
#0  0x00007fce62a24945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fce5bd13598 in get_next (slot=0x7fce652494b8) at h2_workers.c:206
#2  slot_run (thread=0x7fce65492c58, wctx=0x7fce652494b8) at h2_workers.c:225
#3  0x00007fce62a20e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fce6254a34d in clone () from /lib64/libc.so.6

* 43 threads look like this:

#0  0x00007fce62a24945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fce563d7115 in ap_queue_pop_something (queue=0x7fce651ee5c0, sd=0x7fcdfc7bfd30, ecs=0x7fcdfc7bfd38, p=0x7fcdfc7bfd40, te_out=0x7fcdfc7bfd48) at fdqueue.c:451
#2  0x00007fce563d35be in worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1807
#3  0x00007fce62a20e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fce6254a34d in clone () from /lib64/libc.so.6

* Then these two threads:

Thread 2 (Thread 0x7fcdfbfbf700 (LWP 6275)):
#0  0x00007fce6254a923 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fce62c58453 in impl_pollset_poll (pollset=0x7fce1c0009b0, timeout=<optimized out>, num=0x7fcdfbfbed2c, descriptors=0x7fcdfbfbed48) at poll/unix/epoll.c:266
#2  0x00007fce563d546c in listener_thread (thd=0x7fce1c04ca38, dummy=<optimized out>) at event.c:1460
#3  0x00007fce62a20e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fce6254a34d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fce63f188c0 (LWP 5502)):
#0  0x00007fce62a21f57 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fce62c5cecb in apr_thread_join (retval=retval@entry=0x7ffe4f92f704, thd=thd@entry=0x7fce1c04ca38) at threadproc/unix/thread.c:217
#2  0x00007fce563d083c in join_workers (listener=0x7fce1c04ca38, threads=threads@entry=0x7fce654c9b00) at event.c:2084
#3  0x00007fce563d0c11 in child_main (child_num_arg=child_num_arg@entry=1, child_bucket=child_bucket@entry=0) at event.c:2282
#4  0x00007fce563d40d7 in make_child (s=0x7fce6515b400, slot=slot@entry=1, bucket=0) at event.c:2353
#5  0x00007fce563d4174 in startup_children (number_to_start=4, number_to_start@entry=5) at event.c:2377
#6  0x00007fce563d4e80 in event_run (_pconf=<optimized out>, plog=0x7fce65161388, s=0x7fce6515b400) at event.c:2725
#7  0x00007fce63f8556e in ap_run_mpm (pconf=pconf@entry=0x7fce65134168, plog=0x7fce65161388, s=0x7fce6515b400) at mpm_common.c:94
#8  0x00007fce63f7df03 in main (argc=2, argv=0x7ffe4f92fba8) at main.c:792
A debugging session is active.
Comment 3 Luca Toscano 2017-09-21 07:15:40 UTC
Interesting, would it be possible for you to either upgrade mod_http2 to the latest version (https://github.com/icing/mod_h2) or just to disable HTTP2 and see if the issue re-occur?
Comment 4 Nic Jansma 2017-09-21 17:08:10 UTC
Hi Luca,

Good idea.  I disabled mod_http2 (commented out #LoadModule) and verified it's not being loaded anymore via httpd -M.

Unfortunately I'm still seeing the same behavior.  After a "reload" of httpd, I see multiple PIDs stuck in Stopping "yes (old gen)" with 0 connections.  These processes haven't exited for several hours.

I've done several "reload"s today.  Sometimes the old processes will transition to Stopping and then exit.  Sometimes those processes will hang around forever.  I currently have 4 processes in "yes (old gen)":

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	31094	yes (old gen)	64	no	0	0	0	0	0
1	26989	yes (old gen)	27	no	0	0	0	0	0
2	21910	yes (old gen)	83	no	0	0	0	0	0
3	31289	yes (old gen)	45	no	0	0	0	0	0
5	3085	no	6	yes	2	62	0	6	0
8	3086	no	5	yes	0	64	0	5	0
9	3089	no	41	yes	4	60	0	30	8
Sum	7	4	271	 	6	186	0	41	8
Comment 5 Luca Toscano 2017-09-21 18:05:45 UTC
Thanks for the test! Would you mind to also get a 'thread apply all bt' from the stuck processes as you did before?
Comment 6 Nic Jansma 2017-09-21 23:44:25 UTC
Sure -- here's one of the 4 processes stuck in "yes (old gen)":

https://pastebin.com/i6uHDG3b

Main difference is the Thread 1 has this stack:

Thread 1 (Thread 0x7f524768a8c0 (LWP 31094)):
#0  0x00007f5246193f57 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f52463ceecb in apr_thread_join (retval=retval@entry=0x7ffcfc5c87b4, thd=thd@entry=0x7f523005e338) at threadproc/unix/thread.c:217
#2  0x00007f523b37c83c in join_workers (listener=0x7f523005e338, threads=threads@entry=0x7f5248a37c80) at event.c:2084
#3  0x00007f523b37cc11 in child_main (child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0) at event.c:2282
#4  0x00007f523b3800d7 in make_child (s=0x7f52487a5400, slot=slot@entry=0, bucket=bucket@entry=0) at event.c:2353
#5  0x00007f523b38099a in perform_idle_server_maintenance (num_buckets=<optimized out>, child_bucket=<optimized out>) at event.c:2536
#6  server_main_loop (num_buckets=1, remaining_children_to_start=0) at event.c:2664
#7  event_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at event.c:2744
#8  0x00007f52476f756e in ap_run_mpm (pconf=pconf@entry=0x7f524877e168, plog=0x7f52487ab388, s=0x7f52487a5400) at mpm_common.c:94
#9  0x00007f52476eff03 in main (argc=2, argv=0x7ffcfc5c8c28) at main.c:792
Comment 7 Eric Covener 2017-09-22 00:07:45 UTC
the other threads in the pastebin are still mod_h2
Comment 8 Nic Jansma 2017-09-22 00:11:11 UTC
Sorry, please start at line 1004.  Lines previous to that were from the first dump, it appears.

Lines 1004-1486 (66 Threads) are from "now".
Comment 9 Eric Covener 2017-09-22 00:18:26 UTC
From the pair of dumps, two different sets of code seem to be unaware the server is trying to exit (while the listener thread is aware)
Comment 10 Eric Covener 2017-09-22 00:22:14 UTC
process, rather
Comment 11 Yann Ylavic 2017-09-22 08:28:18 UTC
Created attachment 35360 [details]
Cleanup connection on pollset failure

Looks really weird that there is no busy worker but actives conections (the listener is waiting for connection_count == 0).
The only thing I can think of is some pollset error missing the cleanup.

Nic, does that patch (on 2.4.27) help?
There would be logged errors but no hang..
Comment 12 Nic Jansma 2017-09-24 02:12:47 UTC
Comment on attachment 35360 [details]
Cleanup connection on pollset failure

Built and installed!  Testing and will let you know.

BTW, since yesterday the server has gotten itself "stuck" twice.  Here was the scoreboard before I installed this patch:

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	31094	yes (old gen)	64	no	0	0	0	0	0
1	26989	yes (old gen)	27	no	0	0	0	0	0
2	21910	yes (old gen)	83	no	0	0	0	0	0
3	31289	yes (old gen)	45	no	0	0	0	0	0
4	20918	yes	31	no	0	0	0	0	0
5	19797	yes	29	no	0	0	0	0	0
6	23859	yes	25	no	0	0	0	0	0
7	23932	yes	29	no	0	0	0	0	0
8	24010	yes	24	no	0	0	0	0	0
9	24079	yes	44	no	0	0	0	0	0
10	24147	yes	46	no	0	0	0	0	0
11	24148	yes	48	no	0	0	0	0	0
12	24283	yes	27	no	0	0	0	0	0
13	24352	yes	31	no	0	0	0	0	0
14	24422	yes	45	no	0	0	0	0	0
15	24423	yes	46	no	0	0	0	0	0
16	24562	yes	17	no	0	0	0	0	0
17	24631	yes	30	no	0	0	0	0	0
18	24632	yes	35	no	0	0	0	0	0
19	24795	yes	55	no	0	0	0	0	0
20	23025	yes	45	no	0	0	0	0	0
21	25484	yes	26	no	0	0	0	0	0
22	30994	yes	55	no	0	0	0	0	0
23	1568	no	5	yes	3	61	0	4	0
24	1784	no	38	yes	7	57	0	26	6
Sum	25	23	950	 	10	118	0	30	6
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
..G.G.GG.G...GG.....G..G..G....G....G..GGGGGGGG.GG.......GG.G.GG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
G.GGGGGG.GGGG.GGGG.GGGG..GGGG..GG.GGGG..G.GG.G.G.GGG...G.GGGGGGG
G..GGGG..GGG.GG.G.GG.G...GGGG.GGG..G.....GG.GG....G....G.G.G...G
G.G........GG...G..G.G...G.GG.G.GG...GG.GG....GGG....G.GG.GGGGGG
G......G..GGG..GG.GG...G....GG...GG.....G.GG.......GG.G.GGG.G.G.
..G.G....GGGG.G.GGG...G.G.......G..G...G.G..GGGG.G.GGGGG.G.G.G..
.GGG..GGG..GGG.......G..GG.GG....GG..........G..G..G..G.G..GG.G.
GGGG..GGGGGG..GGGGGGG.G.GGG.GGGGG.G.G.G.G..G....GGGGGG.GGG.G.GGG
G.GGG.G.G...G.G.GG..GGGGG..GGG.GGGGGGGGGGGGGGGGG..GGGG..GGGGG.GG
.GGGGGG.GGGGG..GGG.GGG.GG.GGG.GGGG..GGGG.GG.GG..GGGGG.GGGGGGGG.G
....GG....G...G.GG..GGG..G.GGG.....G...G.GGG...GG..G....GGGGG.G.
..GGG..G...GG.........GGGGG.GG.......GG.GGGG.GGG..G.G..GGGGG..GG
..GGG....G..GGGGGGGGG.GGGGGGG.GGGGGGG.GGGGGGGG.GG.G.GGGG.G.G...G
.G..G.GGGGGGGG.G.GG..GGGGGGGG.GGGGG.GGGG.GG.G.GGG.G..GGG.G.GGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGR___________________R____________________________________
___W_______________W______________R_____________WW____________W_
__W___W
Comment 13 Nic Jansma 2017-09-24 16:33:37 UTC
So far, so good.  The patch has been running for ~12 hours and I don't see any processes stuck in "yes (old gen)" with 0 active connections.

Prior to the patch, I would almost always see processes in that stuck state after Apache reloaded when logs were rotated.

I've issued a couple "systemctl reload httpd" commands to try to force the issue, and so far with this patch, I'm not seeing any processes get stuck.

Will continue running, but so far it's looking promising.
Comment 14 Nic Jansma 2017-09-26 13:05:31 UTC
It was looking good for about 36 hours!  No processes stuck at all.

Unfortunately this morning, at about 48 hours, I see 3 processes stuck:

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	27258	yes (old gen)	47	no	0	0	0	0	0
1	25666	yes (old gen)	18	no	0	0	0	0	0
2	2346	yes (old gen)	17	no	0	0	0	0	0
3	17013	no	35	yes	4	60	0	23	7
4	2929	no	2	yes	0	64	0	2	0
Sum	5	3	119	 	4	124	0	25	7
G.GGG.GGG.GGGGGG.GG.GG.GGGG.GG.GG..GG.GGG.GGGG..G..GG.GGGGGGGGGG
GG.G..............G.GGGGG......GG.....GGG.G.............G.G.G...
..........G....G.G......GGGG.GG.......G.......GG...G.GG...G....G
_______R________________________W____________________R________W_
________________________________________________________________
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
Comment 15 Yann Ylavic 2017-09-26 15:42:15 UTC
Thanks Nic for testing.

Any [error] in error_log?
Could you please provide the full backtrace (thread apply all bt) of a process stuck?
Comment 16 Nic Jansma 2017-09-27 13:00:01 UTC
Unfortunately no [errors] in the error_log.

Now I do have mod_security2 installed, and the problem is possibly correlated with getting Slow Loris-style attacked.

Some of the times that I'm seeing processes stuck happen after an attack such as:

[Tue Sep 26 18:25:46.367095 2017] [:warn] [pid 17763:tid 140504981088000] [client 204.12.234.170:63613] ModSecurity: Access denied with code 400. Too many threads [51] of 50 allowed in WRITE state from x.x.x.x - Possible DoS Consumption Attack [Rejected]

Here's the latest thread dump of a process (pid 27258) stuck in "yes (old gen)" since yesterday:

https://pastebin.com/9Pda9wG9

Scoreboard at the time:

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	27258	yes (old gen)	47	no	0	0	0	0	0
1	25666	yes (old gen)	18	no	0	0	0	0	0
2	2346	yes (old gen)	17	no	0	0	0	0	0
3	28733	no	59	yes	7	57	0	39	12
4	17763	yes (old gen)	4	no	0	0	0	0	0
5	20520	yes (old gen)	6	no	0	0	0	0	0
6	25471	no	3	yes	0	64	0	3	0
Sum	7	5	154	 	7	121	0	42	12
G.GGG.GGG.GGGGGG.GG.GG.GGGG.GG.GG..GG.GGG.GGGG..G..GG.GGGGGGGGGG
GG.G..............G.GGGGG......GG.....GGG.G.............G.G.G...
..........G....G.G......GGGG.GG.......G.......GG...G.GG...G....G
_____W_______R_________R_______R______R__R____________R_________
..............................................G........G.GG.....
.....GG..............G.........GG...........................G...
________________________________________________________________
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................

Not sure if ModSecurity could play a role here.  I could temporarily disable it if you think that would help debugging things.
Comment 17 Nic Jansma 2017-10-01 12:33:58 UTC
It's been 4 days of running with ModSecurity off, and I haven't had a repro of the issue at all.  I'm fairly confident at this point, due to the correlation of the timing of the Slow-Loris attack and when these stuck threads happen, that something in ModSecurity's connection-rejectioning may be triggering this bad behavior.

e.g. this code executes during the attack and those connections get "dropped"

https://github.com/nicjansma/ModSecurity/blob/master/apache2/mod_security2.c#L1459

Should I open an issue in their Github?

This behavior can be triggered by slowhttptest -H or -X: https://github.com/shekyan/slowhttptest

SecConnEngine On
SecConnReadStateLimit 10
SecConnWriteStateLimit 10
Comment 18 Eric Covener 2017-10-01 12:50:08 UTC
Maybe a bit premature to ask mod_Security to make a change API wise.

Looks like a process_connection() hook could complete without changing
the state and there is no catch-all for that in process_socket(). Yann, probably up your alley how we could ditch that connection.  Hard for 
mod_security to know we expect it to do anything (set it to linger? How about allowing errors returned by process_connection?
Comment 19 Nic Jansma 2017-10-25 02:18:31 UTC
The good news is I haven't had a repro of the issue in the last 24 days since turning off ModSecurity2.

Anything I can help with to help debug / move this issue?
Comment 20 Silvan 2017-11-08 13:11:01 UTC
Hi,

We are facing the same issue since we upgraded to 2.4.27 and 2.4.29 later on.
After a few days we see the same behaviour, scoreboard filling up with stuck processes.

Scoreboard looks like this with 2.4.29-1 running:

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	59291	yes (old gen)	5	no	0	0	0	0	0
1	5519	yes (old gen)	11	no	0	0	0	0	0
2	15481	no	25	yes	5	59	0	18	0
3	2000	yes (old gen)	2	no	0	0	0	0	0
4	5527	yes (old gen)	6	no	0	0	0	0	0
5	65458	no	15	yes	6	58	0	12	0
6	15873	no	38	yes	11	53	0	28	0
7	15891	no	33	yes	13	51	0	25	0
Sum	8	4	135	 	35	221	0	83	0
....G..........GG.................G.....G.......................
..G..GG.G....G...G................G.......GG......G............G
_________________________________W___WWW_________________W______
.......GG.......................................................
......................G............G........GG.........G.G......
_____________W___W____W_________________WW________W_____________
__W_____W__W___W__________W____W________W___W_________W___W___W_
_WW___WW_WW________WW_W____W____________W__W________W___________
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
Comment 21 Yann Ylavic 2017-11-08 17:58:23 UTC
(In reply to Silvan from comment #20)
> 
> We are facing the same issue since we upgraded to 2.4.27 and 2.4.29 later on.
> After a few days we see the same behaviour, scoreboard filling up with stuck
> processes.

Does it involve mod_security too?
Comment 22 Silvan 2017-11-09 08:37:36 UTC
I am trying to figure out what processes are involved, how could we see it?
We are using mpm_event with mod_jk at least.
Comment 23 Silvan 2017-11-09 08:38:13 UTC
we are not using mod_security
Comment 24 Yann Ylavic 2017-11-09 10:34:07 UTC
(In reply to Silvan from comment #22)
> I am trying to figure out what processes are involved, how could we see it?
> We are using mpm_event with mod_jk at least.

You could run gdb on one of the PID in G state and enter "thread apply all bt" in gdb.
Comment 25 Yann Ylavic 2017-11-09 10:41:45 UTC
(In reply to Silvan from comment #23)
> we are not using mod_security

This may not be the same issue then, it seems to me that mod_jk has no process_connection hook. Also, Nic is reporting an issue with 2.4.27, while yours started with 2.4.29.

Can you please open another report with the gdb output?
Comment 26 Yann Ylavic 2017-11-09 11:32:07 UTC
Created attachment 35510 [details]
Work around process_connection hooks not updating the state

This patch should address the mod_security case (or any module returning a connection to MPM event with no instruction on what to do with it).

Nic could you please test it? (without attachment 35360 [details] which is not needed a priori).
Comment 27 Markus Wennrich 2017-11-18 19:07:01 UTC
JFYI: We have the same problem, if we set MaxConnectionsPerChild to something other then 0.
Then dead threads pile up, even with simple empty tcp connects (like sh -c ' while :; do echo "" | nc -N localhost 443; sleep 1; done')

With MaxConnectionsPerChild=0 the number of threads are stable.
Comment 28 Yann Ylavic 2017-11-23 11:11:48 UTC
(In reply to Markus Wennrich from comment #27)
> JFYI: We have the same problem,

This PR as been renamed.

Is your issue with version 2.4.27 and with a module that acts like mod_security's slowloris prevention (i.e. not reporting dropped connections as handled)?

Otherwise (2.4.29 with "vanilla" httpd), please give more informations in Bug 61786 (opened by Sylvan) as asked there.
Comment 29 dferradal 2017-11-23 11:27:30 UTC
We have had the exact issue with another third-party module. All kinds of different versions of weblogic module mod_wl_24.

I resorted to set MaxConnectionsPerchild 0 and/or not ever do graceful restarts and set MaxSpareThreads to the same value as MaxRequestWorkers to avoid any kind of process/thread recycling and managed to work-around this issue successfully.

I hope I get get as good info as Nic has provided soon but I can't reproduce it that easily.

Although you guys seem to have caught the issue already. Perhaps a parameter for timeout for gracefully finishing threads could be considered if possible for the future for buggy third party modules? Something like (in seconds):

GracefulTimeOut 300
Comment 30 Yann Ylavic 2017-11-23 11:57:57 UTC
(In reply to dferradal from comment #29)
> We have had the exact issue with another third-party module. All kinds of
> different versions of weblogic module mod_wl_24.

Daniel, is that with 2.4.27?
I keep asking this because there is a significant change in mpm_event starting with 2.4.29 and related to how workers and graceful restart events have to wake up the listener thread when it's polling (possibly indefinitely) for connections.

This (current) PR is not related to the above change (because it happens with 2.4.27), but probably due to some third party module's processing hook which does not report an expected connection state to mpm_event, while that used to work with mpm_worker for instance.

So for your case to be relared to this PR, mod_wl (which I know nothing about) should have a process_connection hook acting like this, and attachment 35510 [details] should fix it.

> I resorted to set MaxConnectionsPerchild 0 and/or not ever do graceful
> restarts and set MaxSpareThreads to the same value as MaxRequestWorkers to
> avoid any kind of process/thread recycling and managed to work-around this
> issue successfully.

I'm not sure this (current) PR has to do with children recycling, I think it could happen without.

> 
> I hope I get get as good info as Nic has provided soon but I can't reproduce
> it that easily.

Possibly more Bug 61786 's material then...

> Although you guys seem to have caught the issue already. Perhaps a parameter
> for timeout for gracefully finishing threads could be considered if possible
> for the future for buggy third party modules? Something like (in seconds):
> 
> GracefulTimeOut 300

Let's discuss this on dev@ if you wish, that's certainly not a fix for processes stuck with no connections to handle...
Comment 31 Nic Jansma 2017-11-26 00:54:49 UTC
(In reply to Yann Ylavic from comment #26)
> Created attachment 35510 [details]
> Work around process_connection hooks not updating the state
> 
> This patch should address the mod_security case (or any module returning a
> connection to MPM event with no instruction on what to do with it).
> 
> Nic could you please test it? (without attachment 35360 [details] which is
> not needed a priori).

Yann,

I really appreciate your investigation into this issue!

I haven't yet built apache from source for this production server, and I'm a bit hesitant to with things running smoothly now (with mod_security2) off.  Is this patch waiting on me to test before it gets checked in, or might it make it in soon anyways?
Comment 32 dferradal 2017-11-29 08:40:48 UTC
(In reply to Yann Ylavic from comment #30)

> Daniel, is that with 2.4.27?
> I keep asking this because there is a significant change in mpm_event
> starting with 2.4.29 and related to how workers and graceful restart events
> have to wake up the listener thread when it's polling (possibly
> indefinitely) for connections.
> 
> This (current) PR is not related to the above change (because it happens
> with 2.4.27), but probably due to some third party module's processing hook
> which does not report an expected connection state to mpm_event, while that
> used to work with mpm_worker for instance.
> 
> So for your case to be relared to this PR, mod_wl (which I know nothing
> about) should have a process_connection hook acting like this, and
> attachment 35510 [details] should fix it.

Indeed, that's exactly what I'm using, version 2.4.27.

Thank you so much.
Comment 33 Yann Ylavic 2017-12-20 14:53:44 UTC
Committed in r1818804.

Nic, I don't think this patch will wait on you to be backported to 2.4.x (if the team reviews it positively), on the other hand I can't garantee this is the right one for your issue either...
Comment 34 Nic Jansma 2017-12-25 22:21:36 UTC
Thanks Yann!  Is there anything I can do to help encourage it being backported?
Comment 35 Luca Toscano 2017-12-26 19:12:28 UTC
(In reply to Nic Jansma from comment #34)
> Thanks Yann!  Is there anything I can do to help encourage it being
> backported?

Hi Nic,

if you could test the patch and give us some feedback it would surely help!

Luca
Comment 36 Nic Jansma 2018-01-03 18:12:27 UTC
Hi Luca!

I attempted applying r1818804, r1818951, r1818960 and r1818960 against server/mpm/event.c in 2.4.27, and got it to compile.  However, there appear to be a lot of other differences in event.c in 2.5.x vs 2.4.27, so I'm not sure how complete my backport was.

After the patch, when start the server, and run a Slowloris attack:

> slowhttptest -H -u http://127.0.0.1/a -c 1000 -l 30

I pretty quickly see slots go into 'G', and subsequent 'systemctl reload httpd' put them into Stopping (old gen) permanently:

Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	24231	yes (old gen)	191	no	0	0	0	0	0
1	24396	yes (old gen)	146	no	0	0	0	0	0
2	23398	yes (old gen)	184	no	0	0	0	0	0
3	23442	yes (old gen)	119	no	0	0	0	0	0
4	23507	yes (old gen)	121	no	0	0	0	0	0
5	24397	yes (old gen)	191	no	0	0	0	0	0
6	24736	yes (old gen)	2	no	0	0	0	0	0
7	25056	yes (old gen)	166	no	0	0	0	0	0
8	25223	yes (old gen)	191	no	0	0	0	0	0
9	25224	yes (old gen)	193	no	0	0	0	0	0
10	25560	yes (old gen)	191	no	0	0	0	0	0
11	25726	yes (old gen)	91	no	0	0	0	0	0
12	25727	yes (old gen)	119	no	0	0	0	0	0
13	26086	no	0	yes	1	63	0	0	0
14	26087	no	0	yes	0	64	0	0	0
15	26090	no	0	yes	0	64	0	0	0
Sum	16	13	1905	 	1	191	0	0	0
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GG..............................................................
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
_W______________________________________________________________
________________________________________________________________
________________________________________________________________

I'm able to takeover all of the open connections in two cycles of slowhttptest/reload on my test server.

The good news is this is easily reproducible via the following steps:

Apache 2.4.27
mod_security2 with config of https://pastebin.com/6tDXBBvY

run:
> slowhttptest -H -u http://127.0.0.1/a -c 1000 -l 30    # connections should go to 'G' quickly
> systemctl reload httpd
> slowhttptest -H -u http://127.0.0.1/a -c 1000 -l 30    # connections should go to 'G' quickly
> systemctl reload httpd                                 # my test setup is now unresponsive

So I guess I would say that my attempt at backporting just the few changes mentioned above to 2.4.27 doesn't seem to fix the issue, but I wasn't sure what other changes might be needed before those revisions?
Comment 37 Yann Ylavic 2018-01-04 01:54:17 UTC
Created attachment 35649 [details]
Patches for 2.4.27

Thanks Nic for testing.

There were several changes to mpm_event (and related) since 2.4.27, and I think the above/attached set of 3 patches should be applied for your test to be relevant (all are already or will be part of next 2.4.x).

I quickly tested the slowhttptest scenario on 2.4.27 and it seems to work/fail as expected with/without the patches.

Please note that according to your mod_security settings (i.e. SecConnReadStateLimit 50), there may still be at most 50 active connections/threads handled by "old gen" processses with other/idle threads in G state, this is the expected behaviour (for mod_security's SecConnReadStateLimit detection and action).
There may also be "old gen" processses with no active connection (all threads in G state), depending on your MPM setting (see below), those can/will be reused at any time according to the load.

Also, if you expect httpd reloads with active/long-standing connections kept alive, you probably want to "oversize" ServerLimit w.r.t. MaxRequestWorkers and ThreadsPerChild, such that there is room in the scoreboard for gracefully restarting processes (without interfering with the new generation ones).
This relaxes the usual/strict "MaxRequestWorkers = ServerLimit * ThreadsPerChild" formula to take reloads into account. For example, if MaxRequestWorkers should be 1000 (adapt to your needs), something like ThreadLimit = ThreadsPerChild = 50 and ServerLimit = 50 (instead of 20) leaves room for 2.5 simultaneous reloads, or one reload and some MaxConnectionsPerChild kicking at the same time. YMMV...
Comment 38 Nic Jansma 2018-01-06 16:12:59 UTC
Awesome! With the supplied patches against 2.4.27, I've verified the fix works via slowhttptest and (so far) looking good on our production server.

Many thanks!  Looking forward to the next release.
Comment 39 Yann Ylavic 2018-01-18 14:02:12 UTC
Backported to 2.4.x in r1820796, will be in upcoming 2.4.30.