Bug 66302 - Passing health check does not recover worker from its error state
Summary: Passing health check does not recover worker from its error state
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_hcheck (show other bugs)
Version: 2.4-HEAD
Hardware: PC Linux
: P2 regression (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
Keywords: PatchAvailable
Depends on:
Reported: 2022-10-11 09:22 UTC by Alessandro Cavaliere
Modified: 2023-01-27 13:40 UTC (History)
1 user (show)

mod_proxy_hcheck: recover worker from error state (1.05 KB, patch)
2022-10-11 09:22 UTC, Alessandro Cavaliere
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alessandro Cavaliere 2022-10-11 09:22:33 UTC
Created attachment 38407 [details]
mod_proxy_hcheck: recover worker from error state

While we were in the process of enabling mod_proxy_hcheck on some of our apache2 nodes we encountered an unusual behavior: sometimes, after rebooting a backend, its worker status remains marked as "Init Err" (balancer manager) until another request is made to the backend, no matter how many health checks complete successfully.

The following list shows the sequence of events leading to the problem:

1. Watchdog triggers health check, request is successful; worker status is "Init Ok"
2. HTTP request to apache2 with unreachable backend (rebooting); status becomes "Init Err"
3. Watchdog triggers another health check, request is again successful because the backend recovered; worker status remains "Init Err"
4. same as 3
5. same as 4

The only way for the worker status to recover is to wait for "hcfails" unsuccessful health checks and then again for "hcpasses" requests to be completed or just wait for legitimate traffic to retry the failed worker, which may not happen for a long time for rarely used applications.

This was surprising to us since we were expecting the worker status to be recovered after "hcpasses" successful health checks; however this doesn't seem to happen when the error status is triggered by ordinary traffic to the backend (i.e not health checks).

We believe this behavior was accidentally introduced in r1725523. The patch we are proposing seems to fix the problem in our environment.
Comment 1 Jim Jagielski 2022-10-11 12:32:45 UTC
Thanks for the report.

In general, Health Check errors are considered different from "normal" errors and I can see why the behavior below is both confusing and could be considered "wrong".

The patch looks like a reasonable approach.
Comment 2 Christophe JAILLET 2023-01-27 13:40:15 UTC
Fix in trunk in r1904518
Backported in 2.4.x in r1906496

This is part of 2.4.55