Bug 63273

Summary:	Proxy error doesnot enable the worker even retry property is enabled.
Product:	Apache httpd-2	Reporter:	Bhushan Jade <bhushan.jade>
Component:	mod_proxy	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	NEW ---
Severity:	major	CC:	rageratwork
Priority:	P2
Version:	2.4.18
Target Milestone:	---
Hardware:	Other
OS:	Linux

Description Bhushan Jade 2019-03-20 14:58:06 UTC

We have setup reverse proxy on frontend servers which connect to backend servers throught load balancer.

Apache reverse proxy configuration :
ProxyPass http://<load-balancer-DNS>/context/ retry=0 timeout=30

When we do load test, worker for backend connection fails and never get connected for that backend. Although `retry=0` is specified.

Apache Error log :

[Wed Mar 20 12:23:26.292232 2019] [proxy:trace2] [pid 36176:tid 140617447184128] proxy_util.c(2765): HTTP: fam 2 socket created to connect to <LOAD-BALANACER-DNS>
[Wed Mar 20 12:23:56.322423 2019] [proxy:error] [pid 36176:tid 140617447184128] (70007)The timeout specified has expired: AH00957: HTTP: attempt to connect to 10.19.134.64:80 (<LOAD-BALANACER-DNS>) failed
[Wed Mar 20 12:23:56.322496 2019] [proxy:error] [pid 36176:tid 140617447184128] AH00959: ap_proxy_connect_backend disabling worker for (<LOAD-BALANACER-DNS>) for 0s
[Wed Mar 20 12:23:56.322509 2019] [proxy:debug] [pid 36176:tid 140617447184128] proxy_util.c(2175): AH00943: HTTP: has released connection for (<LOAD-BALANACER-DNS>)
[Wed Mar 20 12:25:57.590705 2019] [proxy:trace2] [pid 36178:tid 140617304508160] proxy_util.c(1966): [client 14.142.125.100:42564] http: found worker http://<LOAD-BALANACER-DNS>/<API-CONTEXT>/ for http://<LOAD-BALANACER-DNS>/<API-ENDPOINT>, referer: https://<APPLICATION-ENDPOINT>/
[Wed Mar 20 12:25:57.590740 2019] [proxy:debug] [pid 36178:tid 140617304508160] mod_proxy.c(1160): [client 14.142.125.100:42564] AH01143: Running scheme http handler (attempt 0), referer: https://<APPLICATION-ENDPOINT>/
[Wed Mar 20 12:25:57.590748 2019] [proxy:debug] [pid 36178:tid 140617304508160] proxy_util.c(1904): AH00932: HTTP: worker for (<LOAD-BALANACER-DNS>) has been marked for retry
[Wed Mar 20 12:25:57.590767 2019] [proxy:debug] [pid 36178:tid 140617304508160] proxy_util.c(2160): AH00942: HTTP: has acquired connection for (<LOAD-BALANACER-DNS>)
[Wed Mar 20 12:25:57.590772 2019] [proxy:debug] [pid 36178:tid 140617304508160] proxy_util.c(2213): [client 14.142.125.100:42564] AH00944: connecting http://<LOAD-BALANACER-DNS>/<API-ENDPOINT> to <LOAD-BALANACER-DNS>:80, referer: https://<APPLICATION-ENDPOINT>/
[Wed Mar 20 12:25:57.590779 2019] [proxy:debug] [pid 36178:tid 140617304508160] proxy_util.c(2422): [client 14.142.125.100:42564] AH00947: connected /<API-ENDPOINT> to <LOAD-BALANACER-DNS>:80, referer: https://<APPLICATION-ENDPOINT>/
[Wed Mar 20 12:25:57.590798 2019] [proxy:trace2] [pid 36178:tid 140617304508160] proxy_util.c(2765): HTTP: fam 2 socket created to connect to <LOAD-BALANACER-DNS>
[Wed Mar 20 12:26:00.588182 2019] [proxy:debug] [pid 36178:tid 140617304508160] proxy_util.c(2790): (113)No route to host: AH00957: HTTP: attempt to connect to 10.19.136.229:80 (<LOAD-BALANACER-DNS>) failed

--------------------------------------------------------------------------
Restarting apache server, then workers starts reconnecting to backend.

Comment 1 Eric Covener 2019-03-20 15:00:29 UTC

Does DNS change after the restart?

Comment 2 Bhushan Jade 2019-03-20 17:42:34 UTC

(In reply to Eric Covener from comment #1)
> Does DNS change after the restart?
No. We are restarting apache service. Backend load balancer is AWS ELB. Which gives response ,when we hit directly using LOAD-BALANACER-DNS URL.
We have setup like this :
[Front end Server(Apache)]<---elb--->[Backend Server]

Comment 3 Bhushan Jade 2019-03-20 17:45:25 UTC

Worker connection once lost,its not establishing again even there is retry=0. It starts when apache service restarted.

Comment 4 Dave Rager 2019-09-18 17:56:08 UTC

Hello, it looks like I may have encountered this recently in our Production environment. We have two front end servers that connect to backend servers through an AWS load balancer similar to what is described here. Both began to fail about the same time with this error.

I would like to try to recreate this in our Test environment but I'm unsure what triggered it. Does anyone have any pointers on how to reproduce it?

Comment 5 Dave Rager 2019-12-10 13:54:35 UTC

I believe the issue is related to how AWS ELBs scale and how Apache workers are configured by default.

From Apache docs:
"When connection reuse is enabled, each backend domain is resolved only once per child process, and cached for all further connections until the child is recycled."

When resolving an AWS ELB hostname "by default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request."

Under load, AWS ELBs will scale to handle the traffic. (Not the same as scaling application servers behind the ELB).

"The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS."

I believe what is happening is, under load, previously unused workers are started, resolve the ELB host and receive a "new" IP for the ELB. Once the load subsides, the ELBs scale down and one or more IP addresses now cached by workers are no longer valid. Because the worker has cached that IP, it tries to use it the next time it receives a request which then fails with the described error.

Regardless of the value of 'retry', that cached IP address will never be refreshed and the worker will always fail until it is recycled.

Using the parameter 'disablereuse=on' (or 'enablereuse=off') will force the worker to resolve the hostname to get a new IP.

Also note, it is better to leave 'retry' to its default value of 60 in the case a worker resolves an IP address that hasn't yet been removed from the DNS record when scaling down:

"The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds."