Bug 61786 - Solaris pollset wakeup pipe not rearmed
Summary: Solaris pollset wakeup pipe not rearmed
Status: REOPENED
Alias: None
Product: APR
Classification: Unclassified
Component: APR (show other bugs)
Version: 1.6.3
Hardware: Sun Solaris
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache Portable Runtime bugs mailinglist
URL:
Keywords:
: 62634 62644 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-11-20 12:38 UTC by Silvan
Modified: 2019-06-22 01:11 UTC (History)
4 users (show)



Attachments
pstack on all the processes after the server stopped responding (8.23 KB, application/zip)
2017-11-28 05:56 UTC, Konstantin J. Chernov
Details
gdb output of process stuck in graceful (116.22 KB, text/plain)
2017-12-01 08:27 UTC, Silvan
Details
Rearm polled Solaris Port wakeup pipe/events (2.68 KB, patch)
2018-01-02 14:32 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Silvan 2017-11-20 12:38:12 UTC
we are using mpm_event 

we see the behaviour that if the web server has been busy and the number of threads gets lower so it will need to clean up processes, there are PID's in a 'stopping' state for a very long time

this is causing the scoreboard to fill up after a while, halting the webserver, we are required to force a restart to free up the processes again
Comment 1 Silvan 2017-11-23 07:18:04 UTC
Addition:

Scoreboard of one of our servers. The processes seem stopping for ages.


Slot	PID	Stopping	Connections	Threads	Async connections
total	accepting	busy	idle	writing	keep-alive	closing
0	54957	yes	5	no	0	0	0	0	0
1	24721	yes	12	no	0	0	0	0	0
2	57230	yes	7	no	0	0	0	0	0
3	57892	yes	10	no	0	0	0	0	0
4	58393	yes	3	no	0	0	0	0	0
5	53025	no	35	yes	29	35	0	6	0
6	58529	yes	2	no	0	0	0	0	0
7	59050	yes	1	no	0	0	0	0	0
9	34843	no	8	yes	3	61	0	4	0
10	59762	yes	1	no	0	0	0	0	0
Sum	10	8	84	 	32	96	0	10	0
G...........G.........................G.........G...G...........
............G...................................................
..G.............G...G.G....G...........G...G....................
..................G...G......G.GG...........GG..G.G........G....
.....................G......................G...G...............
_W_WW_____W___W_WWW____WWW_W_____WWW__W_WWWWWW_WW____W___WWW_W__
...........G..................G.................................
................G...............................................
................................................................
__________________________W________________________W_____W______
.......................G........................................
................................................................
................................................................
................................................................
Comment 2 Yann Ylavic 2017-11-23 08:16:51 UTC
Could you please attach a debugger to one of the processes in G state (for quite some time) and show us the state of the threads?

For example with gdb that would be:

$ gdb /path/to/httpd <pid of process in G-state>
$ thread apply all bt full

and paste (or attach) the output here.

Thanks.
Comment 3 Konstantin J. Chernov 2017-11-28 05:56:03 UTC
Created attachment 35560 [details]
pstack on all the processes after the server stopped responding
Comment 4 Konstantin J. Chernov 2017-11-28 06:03:42 UTC
Hi.

We encountered the same situation that author describes.

It started on 2.4.26 (after update from 2.4.23) and got much much worse on 2.4.29.

While we were running 2.4.26, it used to happen once or twice in a week, on 2.4.29 it happens after 2-4 hours after server start. Also I noticed that it happens usually on a low-traffic servers (<1 rps), while the ones serving 80+ rps stays alive while having the same configuration options.

We are using Solaris 11 (SPARC), Apache is compiled using "gcc -m64", apr 1.6.3, apr-util 1.6.1.

On both 2.4.26 and 2.4.23 we were using apr 1.5.2 and apr-util 1.5.4.

I attached pstacks of all the processes of stucked apache (approx. 30 minutes after it stopped responding).

There were no errors on 2.4.26, and on 2.4.29 it started saying about "scoreboard is full" before getting stuck.
Comment 5 Silvan 2017-11-29 08:40:48 UTC
We have not yet been able to reproduce it but will post the data as soon as the issue re-occurs.
Comment 6 Konstantin J. Chernov 2017-11-30 02:10:00 UTC
I've recompiled apache-2.4.29 with following lines in event.c commented:

//        rv = apr_pollset_create_ex(&event_pollset, pollset_size, pchild, flags,
//                                   good_methods[i]);
//        if (rv == APR_SUCCESS) {
//            listener_is_wakeable = 1;
//            break;
//        }

Running stable for now (24hrs passed).
Comment 7 Silvan 2017-12-01 08:26:28 UTC
We hadd the issue today. Will add pdb dump.
Comment 8 Silvan 2017-12-01 08:27:27 UTC
Created attachment 35572 [details]
gdb output of process stuck in graceful
Comment 9 Luca Toscano 2017-12-02 09:06:35 UTC
I checked both Silvan's and Konstantin's debug traces (thanks to both!). As far as I got this issue happens when the volume of http requests to httpd lowers down and mpm_event cleans up some processes.

Some interesting notes from the stack traces:

1) in Silvan's gdb output, the listener seems to be waiting for threads to complete jk_tcp_socket_recvfull (mod_jk right?). So from my naive perspective it seems like the threads are still reading something from slow clients or something similar? Silvan, were you able to verify if slow clients were still sending data?

2) in Konstantin's pstack output, it seems that the all the threads in a process are already done but the listener is waiting for polling data. Before 2.4.28 the listener thread was waiting in poll() for new data on the sockets for maximum 100ms, to then wake up and verify the status of the timeouts queues (keep alives, lingering closes, etc..). With 2.4.28 the 100ms timeout is not used anymore, and we rely on a feature of apr_pollset to be "wakeable" (so the listener thread can be woken up from polling by another thread). It is really interesting to see that the issue happens on Solaris, so it might be that the 'wakeable' feature in APR for Solaris is showing some issues.
Yann committed r1809273 very recently (only in trunk for the moment, not backported to 2.4.x), that might help Konstantin's use case.

Luca
Comment 10 Silvan 2017-12-02 14:00:48 UTC
Hi Luca,

Yes you are correct we see this behaviour when we have had a busy period on our web servers and when it wants to settle down and mpm_event starts cleaning up close processes. 

Also yes we confirm we are using mod_jk to connect to Tomcat. 

It is very unlikely that the PIDs in stopping state with threads in 'G' state still are serving or receiving from slow clients. We don't really can see slow clients, request finish in a short time normally.

Silvan
Comment 11 Yann Ylavic 2017-12-02 17:32:17 UTC
Silvan,

there are cleary busy workers in attachment 35572 [details], all of which are waiting in mod_jk for a response from the backend (Tomcat). The process won't stop until these connections end, that's the expected behaviour.

Do you have different traces for processes in G state or are they all similar?
In the latter case, you probably need to find out why Tomcat isn't responding or closing the connection in a timely manner (and look around mod_jk timeouts which should trigger if the responses take really too long).
Comment 12 Yann Ylavic 2018-01-02 14:25:07 UTC
(In reply to Konstantin J. Chernov from comment #6)
> I've recompiled apache-2.4.29 with following lines in event.c commented:
[snip]
> Running stable for now (24hrs passed).

Thanks Konstantin for the traces (and diagnostic).

After searching without success for some race condition(s) in mpm_event, it looks like the issue comes from the implementation of the "wakeup pipe" for Solaris' poll() mechanism in APR (i.e. APR_POLLSET_PORT).

Could you please test the following patch (against APR)?
Comment 13 Luca Toscano 2018-01-02 14:29:04 UTC
Yann did you attach the patch? For some reason I don't see it :(
Comment 14 Yann Ylavic 2018-01-02 14:32:01 UTC
Created attachment 35636 [details]
Rearm polled Solaris Port wakeup pipe/events

On Solaris each poll()ed event should be re-associated with the port afterward for it to be re-armed, and it's not the case for the wakeup pipe.

Sorry but no Solaris at hand, so not even compile-tested...
The patch is based on APR-trunk, but it should apply to APR-1.6 (and probably 1.5) too.
Comment 15 Yann Ylavic 2018-01-03 10:11:35 UTC
Fix committed to APR trunk, 1.7.x, and 1.6.x (respectively in r1819858, r1819937 and r1819938).
Comment 16 Konstantin J. Chernov 2018-04-03 09:32:28 UTC
Sorry for long response.
Got the same issue with 2.4.33.

Applied patch to APR, will see if anything changes.
Comment 17 Konstantin J. Chernov 2018-04-03 09:33:26 UTC
Sorry for long response.
Got the same issue with 2.4.33.

Applied patch to APR, will see if anything changes.
Comment 18 Konstantin J. Chernov 2018-04-04 04:38:32 UTC
Looks like the patch is working, at least I'm running stable for more than 12hrs now. With unpatched APR httpd became completely stuck every ~6hrs or less.
Comment 19 Yann Ylavic 2018-04-04 07:43:28 UTC
Bug reassigned to APR and resolved, fix now in APR 1.6.4 (and 1.7.x).

Thanks Konstantin for testing and confirming.
Comment 20 Eric Covener 2018-08-21 10:35:49 UTC
*** Bug 62644 has been marked as a duplicate of this bug. ***
Comment 21 Yann Ylavic 2018-08-24 23:43:00 UTC
Reopen until APR-1.6.4 is released.
Comment 22 Yann Ylavic 2018-08-27 09:44:52 UTC
*** Bug 62634 has been marked as a duplicate of this bug. ***