Bug 65180

Summary: mod_proxy_wstunnel stuck processes at graceful stopping
Product: Apache httpd-2 Reporter: Yoch <yoch.melka>
Component: mod_proxy_wstunnelAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: normal CC: daniel.scharon, shieldwed, yoch.melka
Priority: P2    
Version: 2.4.46   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Apache status summary
Apache Server Status

Description Yoch 2021-03-12 10:46:43 UTC
Created attachment 37768 [details]
Apache status summary

Hi,

I noticed that our production server has several processes in the "stop: yes (old gen)" state, which leads to excessive memory consumption.

Here is the scoreboard under normal load (screenshot attached):

G..GGG.G.G.G.GG..G.G..G.G........G.G.GGGGG..G.GG.GG.GG...GG.....
..........................G.............................WWW_WWW_
_WWWW__W__W______WWW__W___________WWW_W___W_W_W____W___W_R_____R
___W____________W____________W__________WR_____W..GG.G..GGG.....
.G............G...G....GG.G.G...................................
............G.........G....................GG...G..G.........G..
G.G.......G.......GG.G....GG....................................
.........................G......................................
..............................................................G.
................................................................
................................................................
................................................................
................................................................
................................................................
....

I've investigated the issue with gdb, and it appears that some websocket connections handled by proxy_wstunnel are not properly released (trace attached below).

Details about our configuration :

Server version: Apache/2.4.25 (Debian)
Server built:   2019-10-13T15:43:54
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)

- event MPM config

<IfModule mpm_event_module>
        ServerLimit         16
        ThreadLimit         64
        ThreadsPerChild     60
        StartServers         2
        MaxRequestWorkers  900
        MinSpareThreads     30
        MaxSpareThreads     90
        AsyncRequestWorkerFactor       3
        MaxConnectionsPerChild         10000
</IfModule>

- mod_proxy_wstunnel config

<VirtualHost _default_:443>
    # ...
    ProxyPass /socket.io/ ws://127.0.0.1:5555/socket.io/
    ProxyPassReverse /socket.io/ ws://127.0.0.1:5555/socket.io/
</VirtualHost>

- GDB `thread apply all bt` of one PID

Thread 4 (Thread 0x7f44887c8700 (LWP 32758)):
#0  0x00007f44b1642303 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f44b1b38201 in ?? ()
   from target:/usr/lib/x86_64-linux-gnu/libapr-1.so.0
#2  0x00007f44ae860fdc in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#3  0x00007f44b18ff4a4 in start_thread (arg=0x7f44887c8700)
    at pthread_create.c:456
#4  0x00007f44b1641d0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 3 (Thread 0x7f44907d8700 (LWP 32742)):
#0  0x00007f44b1642303 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f44b1b38201 in ?? ()
   from target:/usr/lib/x86_64-linux-gnu/libapr-1.so.0
#2  0x00007f44ae024d83 in ?? ()
   from target:/usr/lib/apache2/modules/mod_proxy_wstunnel.so
#3  0x00007f44ae43c2bc in proxy_run_scheme_handler ()
   from target:/usr/lib/apache2/modules/mod_proxy.so
#4  0x00007f44ae43d251 in ?? ()
---Type <return> to continue, or q <return> to quit---
  o
#5  0x000055f06836dd60 in ap_run_handler ()
#6  0x000055f06836e2f6 in ap_invoke_handler ()
#7  0x000055f068385db3 in ap_process_async_request ()
#8  0x000055f068382209 in ?? ()
#9  0x000055f068377bd0 in ap_run_process_connection ()
#10 0x00007f44ae85e6e8 in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#11 0x00007f44b18ff4a4 in start_thread (arg=0x7f44907d8700)
    at pthread_create.c:456
#12 0x00007f44b1641d0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 2 (Thread 0x7f44a9253700 (LWP 32701)):
#0  0x00007f44b1642303 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f44b1b38201 in ?? ()
   from target:/usr/lib/x86_64-linux-gnu/libapr-1.so.0
#2  0x00007f44ae024d83 in ?? ()
   from target:/usr/lib/apache2/modules/mod_proxy_wstunnel.so
#3  0x00007f44ae43c2bc in proxy_run_scheme_handler ()
   from target:/usr/lib/apache2/modules/mod_proxy.so
#4  0x00007f44ae43d251 in ?? ()
   from target:/usr/lib/apache2/modules/mod_proxy.so
#5  0x000055f06836dd60 in ap_run_handler ()
#6  0x000055f06836e2f6 in ap_invoke_handler ()
#7  0x000055f068385db3 in ap_process_async_request ()
#8  0x000055f068382209 in ?? ()
#9  0x000055f068377bd0 in ap_run_process_connection ()
#10 0x00007f44ae85e6e8 in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#11 0x00007f44b18ff4a4 in start_thread (arg=0x7f44a9253700)
    at pthread_create.c:456
#12 0x00007f44b1641d0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 1 (Thread 0x7f44b23f6040 (LWP 32508)):
#0  0x00007f44b19006dd in pthread_join (threadid=139932324366080, 
    thread_return=0x7ffe0416bb00) at pthread_join.c:90
---Type <return> to continue, or q <return> to quit---
#1  0x00007f44b1b3cb5b in apr_thread_join ()
   from target:/usr/lib/x86_64-linux-gnu/libapr-1.so.0
#2  0x00007f44ae85c247 in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#3  0x00007f44ae85c67e in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#4  0x00007f44ae85fc99 in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#5  0x00007f44ae85fd3c in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#6  0x00007f44ae860a6b in ?? ()
   from target:/usr/lib/apache2/modules/mod_mpm_event.so
#7  0x000055f0683510fe in ap_run_mpm ()
#8  0x000055f068349cfd in main ()
Comment 1 Yoch 2021-03-14 09:46:39 UTC
Created attachment 37769 [details]
Apache Server Status
Comment 2 Yoch 2021-03-29 20:42:14 UTC
(In reply to Yoch from comment #0)
> Created attachment 37768 [details]
> Apache status summary
> 
> Hi,
> 
> I noticed that our production server has several processes in the "stop: yes
> (old gen)" state, which leads to excessive memory consumption.

After two weeks of observation, it's pretty clear that gracefully finishing processes can't complete due to the websockets connections. (this is quite easy to reproduce, simply do graceful restart several times)

If this is correct, it means that mod_proxy_wstunnel is seriously flawed, and may even interfere with apache operations under certains conditions, by preventing stopping processes to finish.

For now, my workaround is to restart our apache server every night to cleanup all the stopping processes.

NB: I updated to the latest stable version of Apache (2.4.46) and the problem still occurs.
Comment 3 Christopher Pereira 2021-06-30 19:22:47 UTC
I have been seen exactly the same behaviour with mod_proxy since some years/versions ago including latests CentOS version (httpd-2.4.6-97.el7.centos.x86_64).

We are proxying CouchDB with this config:

        AllowEncodedSlashes On
        ProxyPass /couchdb http://localhost:5984/ nocanon
        ProxyPassReverse /couchdb http://localhost:5984/

The problem ocurrs with CouchDB longpoll or continuous feed connections (AKA "Live replication" or "continuous replication" in PouchDB jargon), ie. those HTTP connections that stay alive.

The proxied connections are listed as "W" (sending reply) by apache status and all at once (!) they change all to "G" (gracefully finishing) no matter what their SS time is (seconds since beginning of most recent request). I'm not sure what triggers it, but it happens many times in a day, so I can provide more information if you need.

strace shows that the processes in "G" state continue reading and writing data:

	read(73, "1\r\n\n\r\n", 8000)           = 6
	read(73, 0x7fd2a9cd05b8, 8000)          = -1 EAGAIN (Resource temporarily unavailable)
	writev(72, [{"\27\3\3\0\34\265\221\7\333\31s\324\204\20\7\323{\30\35h\311\236\253}\306H\274B\36>\274\375"..., 33}, {"\27\3\3\0\32\265\221\7\333\31s\324\205y\32\332o\220YgkIo_\270\336\0z\300\355n", 31}], 2) = 64
	read(73, 0x7fd2a9cd05b8, 8000)          = -1 EAGAIN (Resource temporarily unavailable)
	poll([{fd=73, events=POLLIN}], 1, 60000) = 1 ([{fd=73, revents=POLLIN}])


Here fd = 73 is the connection between Apache and CouchDB (LAN) and fd = 72 is the connection between Apache and the PouchDB client (WAN).

After restarting CouchDB, all threads in "G" state become an open slot again (the same PID continues running).

The problem is that at some point, all slots will be used and the server gets irresponsive. Increasing MaxClients may also crash the server because each process consumes like 60 MB.
Comment 4 Christopher Pereira 2021-06-30 19:31:08 UTC
Wait. I just restarted CouchDB and the processes in "G" state were killed. Please ignore the part of my comment where I said all threads in "G" state become an open slot again.
Comment 5 Eric Covener 2021-06-30 19:31:27 UTC
> I'm
> not sure what triggers it, but it happens many times in a day, so I can
> provide more information if you need.

It means the process is trying to exit. This will happen during graceful restart, MaxRequestsPerChild/MaxConnectionsPerChild being hit, or due to a wane in load and MaxSpareThreads being exceeded.

Looks like your issue is with mod_proxy_http on a long-poll kind of request as opposed to a websockets tunnel.  I think this would rely on mod_proxy timeouts being able to fire and eventually fail to let the process exit, unless the backend continually pings data to keep the long-poll connection alive. Do you know what the data looks like when idle?

I'd suggest a separate bug if it's not websockets.
Comment 6 Christopher Pereira 2021-06-30 20:00:49 UTC
> It means the process is trying to exit. This will happen during graceful
> restart, MaxRequestsPerChild/MaxConnectionsPerChild being hit, or due to a
> wane in load and MaxSpareThreads being exceeded.
> 
> Looks like your issue is with mod_proxy_http on a long-poll kind of request
> as opposed to a websockets tunnel.  I think this would rely on mod_proxy
> timeouts being able to fire and eventually fail to let the process exit,
> unless the backend continually pings data to keep the long-poll connection
> alive. Do you know what the data looks like when idle?
> 
> I'd suggest a separate bug if it's not websockets.

Yes. CouchDB will be pinging data all the time to keep the long-poll connection alive.
So my understanding is that some condition is met and Apache is trying to gracefully finish the thread, but it never finishes it because CouchDB continues pinging data.
Is it normal behaviour that long-poll connections are marked as "G" after some time?
I will make some tests to confirm that the PouchDB client is still connected and receiving the data, and that the slot is reopened once the client disconnects.
In this scenario, I don't see a problem at all.
Thanks for clarifying.
Comment 7 shieldwed 2021-09-10 13:11:20 UTC
I can confirm this issue with the following details:

Server Version: Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f
Server MPM: event
Server Built: 2021-07-05T07:16:56

...............GGG........................G.......G.............
......G............G.....G...G..............G.G.G............G..
..........G..................G.......G...........G..............
...........G...............G..................G..............G..
.....G...............G........G.G..................GG...........
GGGGGGG.GG.G.GG............G...................G..........G.....
G.................G..............G..G.......GG......G......GGG..
........GG.G....W_WWWWWWWWWWWWWW_WWW_WRWWW_WWWWW..GG...G..GGG.G.
............G............G.G....................________________
................................................................
................................................................
................................................................
................................................................

Eventually, all processes have at least one gracefully terminating connection, and then no more new connections are accepted.

Same as stated above, I could provoke this situation by reloading Apache.

Configuration includes mod_proxy and mod_proxy_wstunnel and a reverse proxy configuration like this:

RewriteEngine on
RewriteCond %{HTTP:Upgrade} =websocket [NC]
RewriteRule /(.*) ws://127.0.0.1:8888/$1 [P,L]
RewriteRule /(.*) http://127.0.0.1:8888/$1 [P,L]

Having attached the debugger to a process containing one such gracefully stopping connection, I got the following stack trace for the specific thread:

Thread 26 (Thread 0x7f7e02bcd700 (LWP 320536)):
#0  0x00007f7e182895ce in epoll_wait (epfd=28, events=0x7f7e157a55a8, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f7e183a178e in ?? () from target:/lib/x86_64-linux-gnu/libapr-1.so.0
#2  0x00007f7e17ff0e2e in ?? () from target:/usr/lib/apache2/modules/mod_proxy_wstunnel.so
#3  0x00007f7e17c144eb in proxy_run_scheme_handler () from target:/usr/lib/apache2/modules/mod_proxy.so
#4  0x00007f7e17c152ae in ?? () from target:/usr/lib/apache2/modules/mod_proxy.so
#5  0x000055d8625686a8 in ap_run_handler ()
#6  0x000055d862568c56 in ap_invoke_handler ()
#7  0x000055d862581403 in ap_process_async_request ()
#8  0x000055d86257d7c3 in ?? ()
#9  0x000055d8625723e8 in ap_run_process_connection ()
#10 0x00007f7e17c8910e in ?? () from target:/usr/lib/apache2/modules/mod_mpm_event.so
#11 0x00007f7e17c89b17 in ?? () from target:/usr/lib/apache2/modules/mod_mpm_event.so
#12 0x00007f7e18362609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f7e18289293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Comment 8 Marc Stern 2023-02-10 11:23:47 UTC
I have version 2.4.37 (Redhat default) used as a reverse proxy and I can systematically reproduce the problem with the web sockets, although the effects are slightly different (but I have additional info).

After a fresh start, I can perform a few reload without any critical problems.
After several reloads (depending on the sizing), a new reload leads to a httpd crash.

In /var/log/messages:
systemd[1]: httpd.service: Reload operation timed out. Killing reload process.
systemd[1]: httpd.service: Killing process 28934 (httpd) with signal SIGKILL.
systemd[1]: httpd.service: Failed with result 'exit-code'.

In httpd error log:
[slotmem_shm:error] [pid 8523:tid 140032774932800] (28)No space left on device: AH02611: create: apr_shm_create(/etc/httpd/run/slotmem-shm-p3ecd5270_mediahubvia52_signal_1.shm) failed
[proxy_balancer:emerg] [pid 8523:tid 140032774932800] (28)No space left on device: AH01185: worker slotmem_create failed
[:emerg] [pid 8523:tid 140032774932800] AH00020: Configuration Failed, exiting

The "No space left on device" message means it cannot create the file, it's not a disk space problem; I guess it tries to use a name that is used by a process that should have been closed.
Comment 9 Sebastien R 2023-09-02 20:33:40 UTC
I reproduced the same issue with apache 2.4.57 (Debian bookworm) and nextcloud notify service using websocket. Each graceful restart leave workers in state G (finishing) forever.