Bug 63975

Summary: Some processes never terminate after graceful restart
Product: Apache httpd-2 Reporter: jaroslav
Component: mpm_workerAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: normal    
Priority: P2    
Version: 2.4.38   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: GDB output

Description jaroslav 2019-11-28 14:27:38 UTC

we have encountered a problem which manifests itself mostly on busy webservers. The system accumulates inactive (sleeping) Apache processes over time and eventually becomes unable to serve requests as no more workers can be spawned by the main Apache process. We are not seeing this behaviour on older servers, so I would guess this was introduced between Apache versions 2.4.25 (as found in Debian Stretch) and 2.4.38 (Debian Buster)

The specific error appearing in the log file when the server fails is "2019-11-27_20:06:09 (11)Resource temporarily unavailable: AH03142: apr_thread_create: unable to create worker thread". There is no mention about which specific resource was exhausted, my guess would be some process/thread-related limit, because at that point the server usually runs hundreds of Apache processes and thousands or tens of thousands threads (despite ServerLimit, which is set to 64.)

Similar behaviour can be observed with event MPM - it seems to honour ServerLimit, but most servers are in "Stopping: yes (old gen)" state and eventually the error log get flooded with "AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit." and no requests are served.

The only other suspicious or unusual error we found in the log is "2019-11-19_18:16:37 AH00291: long lost child came home!", I am not sure if it is relevant though

I tried to analyze what the inactive process is doing. According to ls -l on its /proc/pid/fd directory is has one socket opened:

lrwx------ 1 root root 64 Nov 28 11:33 24 -> 'socket:[350469138]'

There is no network connection associated with that however: lsof -n | grep 350469138

apache2   46425                                      www-data   24u     sock                0,9       0t0  350469138 protocol: TCP
apache2   46425 46448 apache2                        www-data   24u     sock                0,9       0t0  350469138 protocol: TCP
apache2   46425 46453 apache2                        www-data   24u     sock                0,9       0t0  350469138 protocol: TCP

(lsof would show source and destination IP/port for opened connection)

According to their /proc/pid/syscall, the process and threads are mostly waiting in futex(), sometimes in read() or in epoll_wait(). According to strace there is no activity in the process nor its threads.

Processes stuck in this state are not responding to any attempts to terminate them via TERM or INT signal, they need to be KILLed. However, they are not completely stuck, they terminate when the Apache main process is terminated (which solves the issue but introduces disruption of service, albeit a short one.)

Would it be possible - if the bug proves impossible to isolate - to hard-cap the time a process takes during its graceful termination? This process I analyzed was over 12 hours old and I suspect that no client was waiting for any data from it for most of that time.

Thanks for looking into this
Comment 1 Eric Covener 2019-11-28 14:32:52 UTC
> There is no mention about which specific resource was exhausted, my guess would be some process/thread-related limit

yes, the limits are discussed in the pthread_create() manpage.

You should get backtraces of these processes using gdb. 
Comment 2 jaroslav 2019-11-28 15:38:27 UTC
Created attachment 36904 [details]
GDB output

The most interesting (to me) seems to be thread 77, which is waiting in the read() call. /proc/pid/syscall/ output for that thread is:

0 0xf 0x7fc5a27cc75f 0x1 0x0 0x0 0x0 0x7fc5a27cc6e0 0x7fc5cfa48544

If I understand that correctly, it is a read from FD 15, which is (from /proc/pid/fd/15) pipe:[70519261]

According to lsof, this pipe is exclusive to the parent process 8904 and its threads and is not used by any other process in the system
Comment 3 jaroslav 2019-11-28 15:41:18 UTC
Changing status back to NEW after providing the GDB output - if more information is needed, let me know
Comment 4 Ruediger Pluem 2019-11-29 08:53:20 UTC
Looks like there is a request handled by mod_fcgid that never finishes due to a hanging FCGI process.
Comment 5 jaroslav 2019-11-29 14:59:04 UTC

could you please elaborate what hanging FCGI process means in this context?

The server uses mod_fcgid combined with suexec to launch php-cgi processes. It is quite possible for PHP process to hang - or more precisely to take too much time to process the request - but in that case mod_fcgid terminates it forcibly (SIGTERM, then KILL.) Also all FCGI processes are terminated on graceful restart and we even have a protective cron that hunts down orphaned PHP processes left in the system when Apache crashes.

In short - there are no FCGI processes older than an hour in the system so the Apache process (now over 36 hours old and inactive for most of that time) has nothing to wait for - all connections related to communications between Apache and FCGI should read EOF

Although, investigating this I noticed one more thing. The thread stuck reading from FD 15 is reading from "pipe:[70519261]" - ls -l /proc/pid/fd/ shows

lr-x------ 1 root root 64 Nov 29 13:49 15 -> 'pipe:[70519261]'

However, there is also this:

l-wx------ 1 root root 64 Nov 29 13:49 16 -> 'pipe:[70519261]'

Apparently, the inactive Apache process is reading from the pipe, but also has it opened for writing so it will never return an EOF. So reason for why the Apache process is stuck is because at some point some file descriptor (here FD 16) leaked and was not closed when it should have been?