Bug 63010 - mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory
Summary: mod_proxy_hcheck when health checks configured, esp. vs down service, consume...
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_hcheck (show other bugs)
Version: 2.4.37
Hardware: PC All
: P2 major with 5 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
Depends on:
Blocks: 61499
  Show dependency tree
Reported: 2018-12-13 21:18 UTC by CFindley
Modified: 2020-12-31 11:32 UTC (History)
1 user (show)

error.log output covering 2-tenths of a second/5k lines (847.00 KB, text/plain)
2018-12-13 21:22 UTC, CFindley
Avoid concurrent checks (2.87 KB, patch)
2020-12-16 14:06 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description CFindley 2018-12-13 21:18:29 UTC
Defect related to mod_proxy_hcheck

Found/repro's in Apache httpd version 2.4.33, 2.4.35, and 2.4.37 (mod_proxy_hcheck module) on Windows.

We added the hcheck module and then noticed that httpd started consuming large amounts of memory. In previous versions before configuring the module, httpd would be stable at about 35MB of ram. With this module, the memory would grow to 3GB and stabilize. One repro actually had it grow to 17GB and the machine ran out of pagefile space and crashed. We noticed that in the default configuration we were checking a service that was not running. After configuring this service to run, the memory used dropped significantly. Instead of 3GB, it grew to 150MB and stabilized.

Even with the service running, there is a period of time shortly after startup during which the healthcheck runs hundreds of times for each service each second. During this time the memory consumption steadily grows. It does this for about 5 seconds, generating thousands of lines of (trace) level logging and running the health check over a thousand times for each service. Then for some reason it stabilizes, and checks are seen at a more normal 30 second interval. But the memory is not released and the process still has an elevated amount of memory. This might be due to the healthcheck module just requiring more memory than running without it. But it is concerning as it seems like increased memory consumption is related to checking a service which is down. If a service was down, you have one problem, but if the healthcheck itself starts consuming tons of resources on top of that, you compound the problem.

278 times in one second:
Condition ok1234 for 6af324730 (http://localhost:8843): passed

This sequence repeated (this is logging from one single thread checking one endpoint) (please see attached file for a full one-second interval of logging, 15k+ lines):
[Thu Dec 13 17:58:40.618694 2018] [proxy_hcheck:debug] [pid 21288:tid 748] mod_proxy_hcheck.c(829): AH03256: Threaded Health checking http://localhost:8843
[Thu Dec 13 17:58:40.618694 2018] [proxy:debug] [pid 21288:tid 748] proxy_util.c(2313): AH00942: HCOH: has acquired connection for (localhost)
[Thu Dec 13 17:58:40.619694 2018] [proxy:trace2] [pid 21288:tid 748] proxy_util.c(3010): HCOH: fam 2 socket created to connect to localhost
[Thu Dec 13 17:58:40.619694 2018] [proxy:debug] [pid 21288:tid 748] proxy_util.c(3042): AH02824: HCOH: connection established with (localhost)
[Thu Dec 13 17:58:40.620696 2018] [proxy:debug] [pid 21288:tid 748] proxy_util.c(3212): AH00962: HCOH: connection complete to (localhost)
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:debug] [pid 21288:tid 748] mod_proxy_hcheck.c(644): AH03254: HTTP/1.1 200 OK
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:trace2] [pid 21288:tid 748] mod_proxy_hcheck.c(797): Condition ok1234 for 6af324730 (http://localhost:8843): passed
[Thu Dec 13 17:58:40.649654 2018] [proxy:debug] [pid 21288:tid 748] proxy_util.c(2328): AH00943: HCOH: has released connection for (localhost)
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:debug] [pid 21288:tid 748] mod_proxy_hcheck.c(573): AH03251: Health check GET Status (0) for 6af324730.

excerpts from httpd.conf

# This defines a template for use by proxy_hcheck_module.
# Some other specific parameters are set directly on the BalancerMember lines.
ProxyHCExpr ok1234 {%{REQUEST_STATUS} =~ /^[1234]/}
ProxyHCTemplate hcsimple hcmethod=GET hcexpr=ok1234 hcuri=/favicon.ico
<Proxy balancer://samlservice-cluster>
BalancerMember http://localhost:8843 route=1 hcinterval=30 hcfails=2 hcpasses=1 hctemplate=hcsimple
LogLevel debug
LogLevel proxy_hcheck:TRACE2
LogLevel proxy:TRACE2
LogLevel watchdog:TRACE2

Have fix for this issue in place

This issue has similarities - note comments about consuming 20GB heap space:

Also noticed this issue on Reddit
Comment 1 CFindley 2018-12-13 21:22:34 UTC
Created attachment 36324 [details]
error.log output covering 2-tenths of a second/5k lines
Comment 2 CFindley 2018-12-14 01:05:16 UTC
Note: we have a very lengthy and aged httpd.conf file which has cruft from years of fixing defects and adding new options, so it is entirely possible this is some artifact of that. I did try setting up a very simple configuration which just health checks a single non-existent service, and it seems to be well behaved. I can't share the config file as it contains sensitive (and embarrassing) information unfortunately. :)
Comment 3 pawel.c 2019-04-03 22:49:29 UTC
Replicable also on 39, Windows Server 2012 

It is about health check memory leaks depends on frequency.

To reproduce...
(1) You have to have more than one high freq health check like...
ProxyHCExpr wsOK {%{REQUEST_STATUS} =~ /405|200/}
ProxyHCTemplate X1 hcexpr=wsOK hcmethod=GET hcuri=WHATEVER_WORKS_FOR_YOU_1 hcinterval=5 hcpasses=2 hcfails=1
<Proxy balancer://clusterFxAgregation>
BalancerMember route= retry=0 hctemplate=X1
BalancerMember status=+H route= retry=0 hctemplate=X1
BalancerMember status=+H route= retry=0 hctemplate=X1

ProxyHCTemplate X2 hcexpr=wsOK hcmethod=GET hcuri=WHATEVER_WORKS_FOR_YOU_2 hcinterval=5 hcpasses=2 hcfails=1
<Proxy balancer://clusterFxAgregation>
BalancerMember route= retry=0 hctemplate=X2
BalancerMember status=+H route= retry=0 hctemplate=X2
BalancerMember status=+H route= retry=0 hctemplate=X2

here those are defined as two virtual hosts in the same conf

(2) Some nodes must be down. For instance in example above

(3) Observe memory. Leak spead depends on number of defined checks and their freq. With configuration like above and around 20 virtual hosts it should be like 10MB/min
Comment 4 WJCarpenter 2020-10-15 23:06:07 UTC
I've spent quite a bit of time over the last few days trying to understand what goes on here. I have reproduced the memory growth on both Windows and Linux with Apache httpd 2.4.46. It doesn't depend on health check failures, though they can affect the timing to make it more likely. (In my testing, I dummied out the actual check in hc_check() and just put in a fixed delay to avoid vagaries of network stuff.)

My conclusion is that it's strongly related to bug #62318 (https://bz.apache.org/bugzilla/show_bug.cgi?id=62318). The fix for that didn't account for an important edge case. When should we mark a worker health check as "updated", where "updated" means the last time it was checked? Originally, the worker was marked as "updated" when the check completed, though the timestamp used was when the check started. Any watchdog firings while the check was still running would trigger additional checks to run for that worker since the trigger condition was checking the "updated" timestamp that continued to have an old value until that check completed.

The previous fix used the same timestamp for "updated", but it changed when the timestamp was applied. It applied the timestamp at the beginning of the check. That (mostly) prevents additional checks from being triggered while the first one is running.

The edge case that isn't accounted for is when all of the threads in the pool used for health checks are busy. In that case, the "updated" timestamp is not applied until the health check task is popped from the queue and given to a thread to run. Let's look at an example (which I actually used experimentally). Suppose it takes 5 seconds to do a health check. That's a long time, but not preposterous. The watchdog callback runs every 100ms, and the default thread pool is 16 threads. With just a single worker to check, 49 additional checks will be triggered before the first check finishes. Each will carry a timestamp just 100ms later than the one just before it. The first 16 checks will be given to threads for immediate execution, with the final one updating "updated" 1.5 seconds later than the first one. The other 34 or so will be queued until threads free up to run them. Each of those queued health checks carries along with it a timestamp to be applied to "updated", and the timestamp reflects the time that the health check task was enqueued. There's just one "updated" field for the worker, regardless of how many health check tasks there are for that worker. The "updated" timestamp will always reflect the watchdog trigger time of the health check task that started most recently. In our example, the "updated" timestamps eventually be marked with a time about 5 seconds later than when the first check started.

With the default health check interval ("hcinterval") of 60 seconds, things settle down until the 65 second point or so. Then the cycle repeats. There is wasted effort of redundant health checks (which might foil expectations for "hcfails" and "hcpasses"), and the "hcinterval" is effectively changed from 60 seconds to about 65 seconds. Those things are undesirable but not a tragedy.

Having just a single worker is unlikely. For every worker of some kind getting health checks, the above scenario plays out. It's easy to get hundreds of health check tasks queued up waiting for threads to process them. Still, the net effect would be merely stretching the "hcinterval" time for each worker by the time it takes to do a check. It's 5 seconds in our example, but it would be way less for a healthy system. If the system is not healthy, the time it takes to do health check can be faster or slower. If we try to connect to a port with no listener process, then we will very quickly get a "connection refused". If an entire server is down, we won't get any response at all and will have to wait for a self-imposed socket timeout (from the "connectiontimeout" parameter, default 60s). Or, the process on the server might be in a weird state so that the health check takes a long time to complete without causing a socket timeout (from the "timeout" parameter, default 60s). If the time to complete the health check exceeds the "hcinterval" for the worker, we will never catch up. A growing number of health check tasks will be queued for that worker. Each queued health check task is holding a chunk of heap memory, and that is what accounts for the memory growth. (As one anecdotal data point, on Windows my httpd normally runs at 15-16 MB process size. With my pathological test case, it's easy for me to see it grow to a couple of GB over a couple hours or so with many tens of thousands of trapped health check tasks in the queue.)

So, what to do to workaround or fix this edge case?

* Workaround 1: Have at least as many threads in the thread pool (ProxyHCTPSize) as you have distinct workers getting health checks. That way, health check tasks will not usually get queued (though it still could happen occasionally, so maybe allocate some extra threads). The threads will be idle most of the time.

* Workaround 2: Make your health check intervals ("hcinterval") long enough to avoid the pathlogical case. That is, health checks will always complete, pass or fail, before the next check is due. Of course, you lose the benefit of more frequent health checks.

* Workaround 3: Don't use the thread pool for health checks, which you can do by setting ProxyHCTPSize=0. That will make all health checks synchronous, which is generally pretty undesirable.

* Fix 1: Apply the timestamp to "updated" before sending the health check task to the pool. In mod_proxy_hcheck.c, that's in the function hc_watchdog_callback() before the call to apr_thread_pool_push(). That will keep most additional health check tasks from being created for the same worker, though that can still happen if a health check takes longer than the health check interval.

* Fix 2: Put in an explcit check that a worker has a health check in flight, and don't create any new checks as long as that's true. This needs only a single bit flag in the worker data blob that already exists. Set that flag on the worker before you call apr_thread_pool_push() and clear the flag at the completion of hc_check(). I don't see the value in having simultaneious in-flight health checks for the same worker, but maybe someone else does.

If I were deciding, I would do both Fix 1 and Fix 2, though you could argue that Fix 2 makes Fix 1 unnecessary (depending on how you feel about the meaning of "hcinterval", as mentioned in the discussion of bug #62318). I have prototyped both fixes to verify that it resolves the problem (or certainly makes it much harder to trigger).
Comment 5 Jim Jagielski 2020-10-16 13:49:53 UTC
Thanks for the incredible bug report and the excellent investigation. I agree that Fixes 1+2 make the most sense. I'll look into adding them asap.
Comment 6 Tudor Tudorache 2020-12-15 10:58:21 UTC
We are running into the same issue with our current configuration. Apache develops a serious memory leak hitting over 1.2 GB if one of the JVM's are started or stopped. We were on Apache 2.4.26 previously with the exact same configuration, and never ran into this. 
We are still waiting for a proper fix, any estimate on that?
Thank you,

I can send out httpd configuration with what balancer HC's we run.
Comment 7 Yann Ylavic 2020-12-16 14:06:38 UTC
Created attachment 37621 [details]
Avoid concurrent checks

This patch resets the timestamp during the check (and prevents re-entering the check if the timestamp is zero), and then updates the timestamp after the check to schedule the next one.

That's fix 1 & 2 from comment 4.

How does it work for you?
Comment 8 Cristian Duca 2020-12-17 12:23:15 UTC
Hello Yann,

I work with Tudor.
We already implemented that fix and it works as expected -> apache is stable and we are not encountering any memory leaks issues.
We have configured our hcinterval at 10 seconds (due to a LoadBalancer Ping Test), hcpasses=2 and increased the ProxyHCTPSize to 256 threads.
We wanted to know if/when this issue will be fixed coding-wise and not mitigated through configuration. 

Thank you,
Cristian Duca.
Comment 9 Yann Ylavic 2020-12-17 15:12:05 UTC
Hi Cristian,

(In reply to Cristian Duca from comment #8)
> I work with Tudor.
> We already implemented that fix and it works as expected -> apache is stable
> and we are not encountering any memory leaks issues.

I'm not sure to understand, did you test the fix/patch from comment 7 already?

> We have configured our hcinterval at 10 seconds (due to a LoadBalancer Ping
> Test), hcpasses=2 and increased the ProxyHCTPSize to 256 threads.
> We wanted to know if/when this issue will be fixed coding-wise and not
> mitigated through configuration. 

If the fix you tested is this mitigation (by configuration), it would help / speed up the backport to 2.4 if you could test the patch (in the code) proposed in comment 7.
If it works with no particular mitigation in the configuration, then we can consider that it addresses the issue and do the backport to 2.4 (next release).
Comment 10 Cristian Duca 2020-12-29 08:02:37 UTC
Hello, Yann,

I have tested the workaround by adding more ProxyHCTPSize.

Now I am trying to compile Apahce on Windows and to test the coding changes from comment 7 without any mitigations added into the Apache configuration -> at the moment I am not able to compile Apache on Windows -> Do you a wiki link on how to compile the latest apache-trunk source files by using VS2017?

Thank you,
Cristian Duca.
Comment 11 Cristian Duca 2020-12-31 11:32:26 UTC
Hello, Yann,

I have compiled successfully apache on windows.
I have implemented the changes from Comment7 for the mod_proxy_hcheck module.

I can confirm that the fix from Comment7 works without any workarounds.

Can we expect to have this fix implemented in the next apache release build ?

Thank you,
Cristian Duca.