62318 – healthcheck executed more often than configured in hcinterval

Bug 62318 - healthcheck executed more often than configured in hcinterval

Summary: healthcheck executed more often than configured in hcinterval

Status:	RESOLVED FIXED

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_proxy_hcheck (show other bugs)
Version:	2.4.33
Hardware:	PC Linux

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-20 12:18 UTC by andreas.herold
Modified:	2020-05-09 08:23 UTC (History)
CC List:	1 user (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description andreas.herold 2018-04-20 12:18:56 UTC

I'm using following configuration, expecting that one health-check will be executed every 10 seconds on each balancer member.

#---------------

# make sure, not to act as forward proxy
ProxyRequests Off

# do not rewrite Host-Header when forwarding request
ProxyPreserveHost on

# define proxy with balancer-members and settings
<Proxy balancer://omscluster>
    BalancerMember http://10.0.29.90:8180 max=5 timeout=70 retry=40 acquire=30000 route=jdevoms36.rnd.j.intershop.de_frontend hcmethod=GET hcinterval=10 hcuri=/monitoring/services/health/status
    BalancerMember http://10.0.29.96:8180 max=5 timeout=70 retry=40 acquire=30000 route=jdevoms37.rnd.j.intershop.de_frontend hcmethod=GET hcinterval=10 hcuri=/monitoring/services/health/status
    ProxySet lbmethod=bybusyness
</Proxy>

# enable balancer-manager, avoid proxying
<Location /balancer-manager>
  SetHandler balancer-manager
  ProxyPass !
  AuthType Basic
  AuthUserFile "/opt/httpd/conf/htpasswd"
  AuthName "admin"
  require valid-user
</Location>

# proxying everything else
ProxyPass / balancer://omscluster/ stickysession=JSESSIONID
ProxyPassReverse / balancer://omscluster/ stickysession=JSESSIONID

#------------------------

I have enabled tracing, resulting log looks like this.

[root@jdevoms29 httpd]# tail -f logs/error_log.2018-04-20_11_CET |grep 'mod_watchdog'
[Fri Apr 20 13:44:10.001381 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(454): AH010033: Watchdog: Running with WatchdogInterval 1000ms
[Fri Apr 20 13:44:10.001393 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(462): AH02974: Watchdog: found parent providers.
[Fri Apr 20 13:44:10.001407 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(508): AH02977: Watchdog: found child providers.
[Fri Apr 20 13:44:10.001415 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(516): AH02978: Watchdog: Looking for child (_singleton_).
[Fri Apr 20 13:44:10.001425 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(516): AH02978: Watchdog: Looking for child (_default_).
[Fri Apr 20 13:44:10.001435 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(516): AH02978: Watchdog: Looking for child (_proxy_hcheck_).
[Fri Apr 20 13:44:10.001473 2018] [watchdog:debug] [pid 11386:tid 139841708402496] mod_watchdog.c(541): AH02979: Watchdog: Created child worker thread (_proxy_hcheck_).
[Fri Apr 20 13:44:10.016816 2018] [watchdog:debug] [pid 11389:tid 139841598105344] mod_watchdog.c(158): AH02972: Singleton Watchdog (_proxy_hcheck_) running
[Fri Apr 20 13:44:10.017861 2018] [watchdog:debug] [pid 11389:tid 139841708402496] mod_watchdog.c(588): AH02981: Watchdog: Created worker thread (_proxy_hcheck_).
[Fri Apr 20 13:44:10.018537 2018] [watchdog:debug] [pid 11390:tid 139841708402496] mod_watchdog.c(588): AH02981: Watchdog: Created worker thread (_proxy_hcheck_).
[Fri Apr 20 13:44:10.019052 2018] [watchdog:debug] [pid 11391:tid 139841708402496] mod_watchdog.c(588): AH02981: Watchdog: Created worker thread (_proxy_hcheck_).
[Fri Apr 20 13:44:15.014163 2018] [watchdog:debug] [pid 11493:tid 139841708402496] mod_watchdog.c(588): AH02981: Watchdog: Created worker thread (_proxy_hcheck_).
^C
[root@jdevoms29 httpd]# tail -f logs/error_log.2018-04-20_11_CET |grep 'AH03256: Threaded Health checking'
[Fri Apr 20 13:45:30.208536 2018] [proxy_hcheck:debug] [pid 11389:tid 139841484338944] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:45:30.208813 2018] [proxy_hcheck:debug] [pid 11389:tid 139841475946240] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:45:30.308718 2018] [proxy_hcheck:debug] [pid 11389:tid 139841492731648] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:45:40.237689 2018] [proxy_hcheck:debug] [pid 11389:tid 139841450768128] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:45:40.237948 2018] [proxy_hcheck:debug] [pid 11389:tid 139841392019200] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:45:50.269505 2018] [proxy_hcheck:debug] [pid 11389:tid 139841517909760] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:45:50.269814 2018] [proxy_hcheck:debug] [pid 11389:tid 139841459160832] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:45:50.369765 2018] [proxy_hcheck:debug] [pid 11389:tid 139841425590016] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:00.330515 2018] [proxy_hcheck:debug] [pid 11389:tid 139841408804608] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:00.330745 2018] [proxy_hcheck:debug] [pid 11389:tid 139841442375424] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:00.430916 2018] [proxy_hcheck:debug] [pid 11389:tid 139841417197312] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:00.531037 2018] [proxy_hcheck:debug] [pid 11389:tid 139841467553536] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:10.368412 2018] [proxy_hcheck:debug] [pid 11389:tid 139841433982720] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:10.368601 2018] [proxy_hcheck:debug] [pid 11389:tid 139841501124352] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:20.416752 2018] [proxy_hcheck:debug] [pid 11389:tid 139841509517056] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:20.416934 2018] [proxy_hcheck:debug] [pid 11389:tid 139841400411904] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:20.516910 2018] [proxy_hcheck:debug] [pid 11389:tid 139841484338944] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:20.617105 2018] [proxy_hcheck:debug] [pid 11389:tid 139841492731648] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:20.717236 2018] [proxy_hcheck:debug] [pid 11389:tid 139841475946240] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:30.462732 2018] [proxy_hcheck:debug] [pid 11389:tid 139841450768128] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:30.463110 2018] [proxy_hcheck:debug] [pid 11389:tid 139841392019200] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:30.562857 2018] [proxy_hcheck:debug] [pid 11389:tid 139841459160832] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:40.499302 2018] [proxy_hcheck:debug] [pid 11389:tid 139841425590016] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:40.499494 2018] [proxy_hcheck:debug] [pid 11389:tid 139841517909760] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:50.530944 2018] [proxy_hcheck:debug] [pid 11389:tid 139841408804608] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:50.531195 2018] [proxy_hcheck:debug] [pid 11389:tid 139841417197312] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.96:8180
[Fri Apr 20 13:46:50.631150 2018] [proxy_hcheck:debug] [pid 11389:tid 139841467553536] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:50.731319 2018] [proxy_hcheck:debug] [pid 11389:tid 139841442375424] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:50.831512 2018] [proxy_hcheck:debug] [pid 11389:tid 139841433982720] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:50.931690 2018] [proxy_hcheck:debug] [pid 11389:tid 139841501124352] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180
[Fri Apr 20 13:46:51.031958 2018] [proxy_hcheck:debug] [pid 11389:tid 139841400411904] mod_proxy_hcheck.c(822): AH03256: Threaded Health checking http://10.0.29.90:8180

The tracing shows only few more health-checks than configured. But identical configuration produced much more health-checks before. Unfortunately I have only access-logs of balancer member.

10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 48 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 967 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1229 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 792 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 41 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1104 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 81 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 833 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1062 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 43 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 54 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 53 application/json
10.0.29.32 - - [16/Mar/2018:17:02:47 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1076 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 639 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 386 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1638 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 191 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 83 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1121 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 597 application/json
10.0.29.32 - - [16/Mar/2018:17:02:48 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 373 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 386 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 2119 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 2144 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1781 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 2085 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 639 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1125 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1526 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 801 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 94 application/json
10.0.29.32 - - [16/Mar/2018:17:02:49 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 700 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 681 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 64 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 953 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 988 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 962 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1181 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1282 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1172 application/json
10.0.29.32 - - [16/Mar/2018:17:02:50 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 503 635 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1204 application/json
10.0.29.32 - - [16/Mar/2018:17:02:51 +0100] 'GET /monitoring/services/health/status HTTP/1.0' 200 518 '-' '-' '10.0.29.90:8180' '-' 10.0.29.90 8180 HTTP/1.0 1120 application/json

Comment 1 WJCarpenter 2018-08-23 19:38:46 UTC

I recently hit this problem, and I think I understand why it happens. I did a simple proof-of-concept change to validate my theory. Although I was also working with 2.4.33, I don't see any recent changes to suggest it's a new bug. BTW, I saw this on a Windows 10 Enterprise machine.

The health checks are done via callbacks from the watchdog timer module. The callback is scheduled to run every AP_WD_TM_SLICE interval. On my server, that's 100 milliseconds (I didn't investigate whether it's 100ms for everyone). When triggered, the callback iterates over all of the configured health checks to see which need attention. And, for any that are "due", the health check is performed.

In mod_proxy_hchceck.c, in function hc_watchdog_callback(), near line 928 (in 2.4.33), there is this "if" condition:

if (!PROXY_WORKER_IS(worker, PROXY_WORKER_STOPPED) &&
   (worker->s->method != NONE) &&
   (now > worker->s->updated + worker->s->interval)) {

The last part is intended to mean "if the HC interval has passed since the last time we checked, then do a health check".

If it is "due", according to that "if" condition, the actual health check is kicked off in function hc_check(), near line 811 (in 2.4.33). It makes the TCP or HTTP health check call made, the results evaluated, and the "updated" timestamp is updated.

====================

That's what the code does. Now, here is a description of the problem as I understand it:

1. It's kind of icky to be polling every 100 ms compared to what would be a normal health check interval [default is 60 s]. But I don't know if there is a wiser way to do it in httpd.

2. The "if" condition and surrounding logic don't keep track of whether there is already a health check in progress for a given target. I suppose there could be a reason to allow multiple requests (hung connections?), but in any case it at least contributes to this problem.

3. The "update" timestamp is updated after the health check is completed, but it is updated with a timestamp from when the health check began. In other words, if it takes 1000 ms for a health check call, the "updated" timestamp will eventually be updated with a time 1000 ms in the past.

The points just mentioned conspire to keep the health check "due" on every 100 ms watchdog callback if the individual health checks take longer than 100 ms to complete. Suppose, for example, each health check for a given target takes 1000 ms. Then there will be 9 or 10 in-progress health checks for that target. As each finishes, it will mark the "updated" timestamp to 1000 ms in the past. The situation will never resolve until the elapsed time for the health checks drops below 100 ms. That's made less likely by the additional health checks contributing to the load.

====================

To test the theory, I locally modified the hc_check() function to modify the "updated" timestamp with the time the health check finishes instead of the time it started. That produced the desired behavior. I have not examined that solution for unexpected consequences.

Comment 2 Jim Jagielski 2018-08-24 11:52:13 UTC

The issue appears to be one of, mainly, interpretation of what an interval is. As seen in the code and the documentation, the value of the interval determines when a health check *is performed*. It does not define the time *between* health checks, which is a separate timing entirely.

It is also assumes that the sysadmin matches the interval time with the time taken to perform said health check and avoids such nonsense as kicking off a check every 100ms if the health check itself takes 250ms.

Personally, I don't consider this a bug but rather expected behavior and behavior as designed. Admittedly, the docs could make this clearer and should be updated.

Comment 3 Eric Covener 2018-08-24 16:09:31 UTC

(In reply to Jim Jagielski from comment #2)
> The issue appears to be one of, mainly, interpretation of what an interval
> is. As seen in the code and the documentation, the value of the interval
> determines when a health check *is performed*. It does not define the time
> *between* health checks, which is a separate timing entirely.
> 
> It is also assumes that the sysadmin matches the interval time with the time
> taken to perform said health check and avoids such nonsense as kicking off a
> check every 100ms if the health check itself takes 250ms.

I don't think increasing the interval helps with a slow health check, since we check every AP_WD_TM_SLICE to see if a health check has completed in the desired interval.

If my health check takes 1 second and the interval is 10 seconds, we will still kick off 10 health checks in rapid succession just after 10 seconds has elapsed.

Comment 4 WJCarpenter 2018-08-24 16:42:44 UTC

Jim Jagielski, I don't understand what you are trying to say. No amount of changing the configured interval for the health checks will alter the fact that the watchdog timer does the callback every 100ms. In fact, the default for hcinterval is 60s, which suggests what the module author had in mind for the order of magnitude of that parameter.

Comment 5 WJCarpenter 2018-08-24 16:45:21 UTC

(sorry) I meant "the watchdog timer will consider doing the callback every 100ms"

Comment 6 Jim Jagielski 2018-08-29 12:06:11 UTC

According to my testing, the below patch seems to fix this. Can you test as well?

    http://svn.apache.org/viewvc?rev=1838937&view=rev

Comment 7 WJCarpenter 2018-08-29 16:29:38 UTC

The change is one of the things I tried, and it does solve the problem of having multiple outstanding health checks for the same target. 

One of the things I was not sure about (and am still not sure about) is whether it's better to set "updated" to the time of the beginning of the health check or the time of the end of the health check. It's a design decision that could go either way, depending on what you think "interval" should mean, as you commented earlier.

I previously gave this kind of change a quick sanity test. I should be able to do more thorough testing next week. I'll report back with the results either way.

Comment 8 Jim Jagielski 2018-09-11 15:12:52 UTC

Thx. Awaiting your check. I can confirm that the patch passes my tests which recreated the original problem

Comment 9 WJCarpenter 2018-09-12 16:40:13 UTC

I confirm that the patch fixed the problem in my test environment. NOTE: My environment is 2.4.33 to which I manually made the change from the patch.

Thanks for the fix.

Comment 10 Graham Leggett 2018-09-18 21:46:31 UTC

Backported to v2.4.36.