Bug 56703

Summary: "Current number of backend connections" counter ever increasing when timeouts are defined
Product: Tomcat Connectors Reporter: Martin Knoblauch <knobi>
Component: mod_jkAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 1.2.40   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Patch to fix connection statistics
Rebased patch to SVN r1617727

Description Martin Knoblauch 2014-07-08 10:04:35 UTC
Hi,

  our environment is (I know, some are old ...):

Apache 2.2.21
mod_jk-1.2.40
Tomcat 7.0.42

We have found that the numbers for "Con", the current number of backend connections" is growing steadily, well over any sensible number.

The following configuration seems to trigger the behavior:

worker.template.type=ajp13
worker.template.lbfactor=1
worker.template.prepost_timeout=15000
worker.template.connect_timeout=15000
worker.template.ping_timeout=15000
worker.template.ping_mode=A
worker.template.connection_ping_interval=180
worker.template.retries=3
worker.template.connection_pool_timeout=30

Without setting the timeout values, the displayed number of backend connections looks to be stable.

The behavior seems to have started between mod_jk-1.2.32 and mod_jk-1.2.37. Looking at the code, I can see some heavy redo of the handling of the "connected" counter. I suspect a "connected--" is missing, but hard to say where.

Cheers
Martin
Comment 1 Martin Knoblauch 2014-08-13 13:16:42 UTC
Created attachment 31913 [details]
Patch to fix connection statistics
Comment 2 Martin Knoblauch 2014-08-13 13:18:08 UTC
After some guessing and playing around, I think I have isolated my "failure case".

Necessary condition is than "connection_pool_timeout" is enabled in mod_jk:

worker.template.connect_timeout=15000

and that the backend Tomcats(s) are using a "connectionTimeout" on the AJP connector:

    <Connector port="9397" protocol="AJP/1.3"
		connectionTimeout="30000"
		maxThreads="300"
		minSpareThreads="10"
		redirectPort="8443" />


Given the "right" timing, one can find the following in "mod_jk.log":

###########
[Wed Aug 13 14:21:05.827 2014] [22249:139749745399552] [info] ajp_send_request::jk_ajp_common.c (1608): (lpsdm75-2) failed sending request, socket 27 is not connected any more (errno=0, connected=2)
[Wed Aug 13 14:21:05.827 2014] [22249:139749745399552] [info] ajp_send_request::jk_ajp_common.c (1681): (lpsdm75-2) all endpoints are disconnected, detected by connect check (1), cping (0), send (0)
###########

I assume, that in this case the backend has closed the connection due to the connctionTimeout and mod_jk still has to find out about it.

The code around line 1608 does this:

##########
        if (jk_is_socket_connected(ae->sd, l) == JK_FALSE) {
            ae->last_errno = errno;
            jk_log(l, JK_LOG_INFO,
                   "(%s) failed sending request, "
                   "socket %d is not connected any more (errno=%d, connected=%d)",
                   ae->worker->name, ae->sd, ae->last_errno, ae->worker->s->connected);
            ajp_abort_endpoint(ae, JK_FALSE, l);
            err = JK_TRUE;
            err_conn++;
        }
##########

If "jk_is_socket_connected" decides that the socket is no longer connected, it closes the socket, but can of course not decrement the "connected" counter. As a result, the statistics are never decaying for the workers.

Adding "ae->worker->s->connected--" below "ajp_abort_endpoint" fixes the issue for me.

A patch is included. In addition to the "fix", it also enables the logging in line 1608 and some other logging in the maintenance loop.
Comment 3 Martin Knoblauch 2014-08-13 13:25:38 UTC
Created attachment 31914 [details]
Rebased patch to SVN r1617727

Sorry, but previous version was against wrong base :-(
Comment 4 Rainer Jung 2014-08-13 18:00:50 UTC
Thanks for the analysis. I will double check latest next week and then likely apply if noone else does it before. If you don't get any more responses here, please ping again.
Comment 5 Rainer Jung 2014-08-14 17:50:27 UTC
Hi Martin,

your analysis helped me finding the likely real root cause and fix a few more places by adusting the connection counter decrement in ajp_abort_endpoint().

I applied the resulting fix in r1618000.

Do you have any chance to test this? Would you need me to roll a dev (snapshot) tarball, or do you know how to build mod_jk from a svn checkout?

The problem always is not to decrement to often such that the counter gets skewed to the other side. I think the fix I did now should be safe.

Thanks again for finding the right root cause and for (hopefully) testing this fix.

If testing goes well, this will e part of 1.2.41.
Comment 6 Martin Knoblauch 2014-08-15 09:54:09 UTC
Hi Rainer,

 good you double checked. I applied my previous fix to a productive system and found that while it reduced the growth by 80-90 %, there was still something missing.

 So, I found that [in my case] the remaining cases were happening in "ajp_connection_tcp_get_message" when the calls to "jk_tcp_socket_recvfull" were returning JK_SOCKET_EOF. So adding the "connected--" in those places cured the problem for good.

 I then looked at the remaining case when abort_endpoint was called with "shutdown == JK_FALSE". That is in "ajp_connection_tcp_send_message" when "jk_tcp_socket_sendfull" returns an error. So in my opinion "connection--" is warranted here also.

 But that would mean that commit r1334399 was the wrong fix to the going negative problem. Hmm...

 Now looking at r1618000, you have decoupled the checks for "valid socket" and "perform shutdown" in abort_endpoint and decrement the connection counter regardless of the shutdown setting. That will definitely fix the growth problem.

 But will it also fix the going negative thing? Looking at pre r1334399, a superfluous decrement could happen when abort_endpoint was called with an "invalidated" . Do you this could happen?

 Anyway, I will try to test on my high volume setup.

Martin
Comment 7 Rainer Jung 2014-08-15 17:00:49 UTC
Hi Martin,

yes I think r1334399 was a bit to much. There are cases where ajp_abort_endpoint() is called with shutdown==JK_FALSE because the shutdown already happened in code lower down. That code doesn't adjust the connected counter, so we have to do it. That part was correct before r1334399, wrong between r1334399 and r1618000 and should be OK after my commit r1618000.

The other case was when the socket "sd" was already INVALID. In that case my code inspection told me that we should not adjust the connected counter, so here r1334399 was right and r1618000 still is right in not adjusting connected when sd is INVALID.

I do hope that your production proves us right.

I see no risk of running the patch in production, because apart from the connected counter nothing has changed. So only your monitoring/status page display will change. The full 1.2.41 dev has a few more changes/fixes, but they should be low risk and I'm not aware of new problems.

Thanks for the cooperation. Looking forward for your production results.

Regards,

Rainer
Comment 8 Martin Knoblauch 2014-08-18 15:29:22 UTC
Hi Rainer,

 in order to minimize risk, I applied your changes on top of 1.2.40 as used by the high-volume/productive setup. So far (long weekend plus the working hours of today) the results are good, but not perfect:

- the "connected" counter does no longer grow without bounds and decays back to 0 if a tomcat is idle for a sufficiently long time
- unfortunately I have observer one worker, where the counter turned negative "-1". It only happened once in my setup, and I do not care to much about that effect. Others might of course care more.


 Either we:

- ignore it, or
- debug it, or
- do the "connected--" only if "connected > 0"

 I am willing to help, but there are limits to the extend of experimenting I am allowed to do on the productive setup

Cheers
Martin
Comment 9 Rainer Jung 2015-01-04 10:00:58 UTC
I can reproduce another case of inflated connection count: if idle Apache children are stopped, their connetion count isn't decremented.

Need to check, whether/how we can hook into the destruction of children (maybe pool cleanup or similar).
Comment 10 Rainer Jung 2015-01-05 11:49:28 UTC
I have improved the implementation of the connected counter in two ways now:

- using atomics so that we don't miss any increments or decrements under high load (r1649506)

- adding a shutdown handler for Apache child processes, that closes and decrements any connection held by a child process (r1649503). Such process shut downs can happen e.g. by a normal process termination because of e.g. MaxSpareThreads or MaxConnectionsPerChild in the Apache config.

Closing this now, can be reopened it the next release proves to still not count correctly.