Bug 66160

Summary:	Apache HTTPD stops serving and freezes HTTP/2 connections
Product:	Apache httpd-2	Reporter:	Steffen Moser <public>
Component:	mod_http2	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	NEW ---
Severity:	normal
Priority:	P2
Version:	2.4.53
Target Milestone:	---
Hardware:	PC
OS:	Solaris
Attachments:	Analysis data from Apache HTTP being in a state where a subset of the incomming HTTP/2 connection freeze

Description Steffen Moser 2022-07-05 00:22:28 UTC

Created attachment 38334 [details]
Analysis data from Apache HTTP being in a state where a subset of the incomming HTTP/2 connection freeze

We are running Apache 2.4.53 on Oracle Solaris 11.4 SRU 44 on x64 hardware. Apache has been compiled by Oracle. The server mainly runs an e-learning platform running Moodle, Nextcloud, Mahara, Guacamole and some more tools for about 2000 extra-occupational mainly distant-learning students. 

HTTP/2 is active.

After several weeks of having a perfectly running system, Apache HTTPD seems to stop serving some of the newly incoming HTTP requests using HTTP/2. The affected TCP connections stay frozen in the ESTABLISHED state on both, client and server.

Apache HTTPD seems to stay in this "erratic state" (where some connections are served fine and some are frozen) until the web server process is restarted by issuing a "svcadm restart apache24". After restarting it, the system is running fine for two or three weeks until the problem reoccurs out of nothing. 

When forcing HTTP/1.1 connections, the problem cannot be triggered even when Apache HTTPD is in the above-mentioned erratic state. It seems to be strongly related to HTTP/2.

Our servers with a similar configuration with HTTP/2 being deactivated don't seem to be affected - but they are in general loaded less intensively. 

You'll find an attachment consisting on four parts:

1) "truss" output of all running Apache HTTPD processes when a HTTP/2 connection comes in which is successfully served.

2) "truss" output of all running Apache HTTPD processes when a HTTP/2 connection comes in which is frozen.

3) "pstack" analysis of the affected process and it's threads. 

4) An output of Apache HTTPD's "server-status" when it was in the erratic state.


When looking at c), it seems that the problem can be narrowed down to the calls fired by thread #32 of process #8544:

------------  lwp# 32 / thread# 32  ---------------
 00007fb3cfc73147 lwp_park (0, 7fb3c4e16a20, 0)
 00007fb3cfc6b8d7 cond_wait_queue () + 67
 00007fb3cfc6bd5a cond_wait_common () + 1ea
 00007fb3cfc6bfdd __cond_timedwait () + 6d
 00007fb3cfc6c08a cond_timedwait () + 2a
 00007fb3cfc6c0c9 pthread_cond_timedwait () + 9
 00007fb3cf6428bf apr_thread_cond_timedwait () + 6f
 00007fb3c8c327db h2_mplx_m_out_trywait () + bb
 00007fb3c8c40a49 h2_session_process () + 1269
 00007fb3c8c28716 h2_conn_run () + 96
 00007fb3c8c2fea2 h2_h2_process_conn () + 9b2
 00000000004ac08b ap_run_process_connection () + 3b
 00007fb3ca00adee process_socket () + 3fe
 00007fb3ca00e365 worker_thread () + 355
 0000000000470d89 thread_start () + 19
 00007fb3cf65659e dummy_worker () + e
 00007fb3cfc72de3 _thrp_setup () + b3
 00007fb3cfc73100 _lwp_start ()

It seems to me that the "h2_mplx_m_out_trywait()" function, which is implemented in "httpd-2.4/modules/http2/h2_mplx.c", gets into a deadlock when it wants to enter the MUTEX section, possibly by issuing the "H2_MPLX_ENTER(m)" command. 

The problem is: I had to restart Apache HTTP on the affected machine because it is a production server. I hope that I could gather relevant data while it was in the strange state. If not, we unfortunately have to wait until it re-occurs.

Comment 1 Stefan Eissing 2022-07-11 07:27:27 UTC

Hi Steffen,

the stacktrace you listed does not point to a deadlock. The conditional, timed wait release the mutex. So it cannot be blocking others. If someone is holding the mplx lock, it must be another thread.

Seeing h2 in this part of code is expected during connection processing. The question is if your apache stops serving new connections or new requests on existing connections. The former would point to an exhaustion of the mpm workers, the latter to an exhaustion of the h2 worker pool.

Exhaustion can happen due to insufficient capacity or indeed a bug in the server that blocks processing and holds resources too long.

The most tricky things in a HTTP server are long running requests. If your setup/backend possibly encounters those, they can clog up your system. HTTP/2 is more vulnerable to that, if your h2 worker pool is "too small".

Hope this helps,

Stefan