Bug 63859 - AJP cping/cpong mode failing on Tomcat 9.x
Summary: AJP cping/cpong mode failing on Tomcat 9.x
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 9.0.26
Hardware: PC All
: P2 regression (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-17 11:59 UTC by Aurelien Pernoud
Modified: 2019-12-10 10:35 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aurelien Pernoud 2019-10-17 11:59:49 UTC
Hello,

We have a common infratructure for all our Tomcat instances (7,8&9) based on httpd + mod_jk / ajp.

We implemented a while ago cping/pong om httpd with mode A, and it works perfectly with Tomcat 7&8.

[info] ajp_handle_cping_cpong::jk_ajp_common.c (976): (tomcat9) timeout in reply cpong after 10000 ms. Socket = 31 (event=110)
[info] ajp_maintain::jk_ajp_common.c (3464): (tomcat9) failed sending request, socket -1 keepalive cping/cpong failure (errno=110)
Leading progressively to
[info] ajp_service::jk_ajp_common.c (2775): (tomcat9) sending request to tomcat failed (recoverable), because of error during request sending (attempt=2)
[error] ajp_service::jk_ajp_common.c (2796): (tomcat9) connecting to tomcat failed (rc=-3, errors=1, client_errors=0).
[info] service::jk_lb_worker.c (1601): service failed, worker tomcat9 is in error state

We use the latest mod_jk (mod_jk/1.2.46)

I tried using Tomcat native protocol (which we never used so far), same error.

I tried to look into tomcat 9 logs, nothing shows, and Tomcat is still up and running (Inspecction via JMX shows everything ok, Probe console and manager console shows no limit...).
I even tried to run a massload using Jmeter on AJP port while the error was showing in httpd logs : no error, all the GET ends in success. But httpd still sees a cping/cpong error.

In the end I may have fixed the issue by changing the mod_jk parameter and disabling the "cping cpong" mode :
worker.tomcat9.ping_mode=C
instead of previously worker.tomcat9.ping_mode=A
(which does the cping/cpong on a regular basis)

=> Since I changed this I don't have any errors in httpd.

So I suspect a regression in Tomcat 9 somewhere with AJP protocol, even though I couldn't find any error in Tomcat logs ...

Not sure if I can help more, just wanted to highlight this issue but if you need more detail or debug log don't hesitate.

The settings of my connectors (7 8 & 9 ) are the same :
<Connector port="xxxx" protocol="AJP/1.3" maxThreads="400" connectionTimeout="600000"/>
Comment 1 Aurelien Pernoud 2019-10-17 12:03:16 UTC
I missed a sentence...

We implemented a while ago cping/pong om httpd with mode A, and it works perfectly with Tomcat 7&8.
+"But with Tomcat 9 we inconsistently have those errors of cping/cpong in logs"

And then the logs...
Comment 2 Rainer Jung 2019-10-17 14:54:17 UTC
Any info on usage characteristics during times this happens? High load (how many requests per second), with or without load balancing, workers.properties config etc.?

How frequently does it happen (always, sporadicaly for NNN % of requests, in spikes, ...)?

Can you easily reproduce?

Do you have root privileges, so could you sniff network traffic to the AJP port? Are Apache and Tomcat on diferent machines? Any active components (Firewalls, routers) in between?

Thanks and regards,

Rainer
Comment 3 Aurelien Pernoud 2019-10-17 15:27:37 UTC
Hi Rainer,

Any info on usage characteristics during times this happens? High load (how many requests per second), with or without load balancing, workers.properties config etc.?

=> There is no "rule", it even happen with only a few users on my test instance.
Workers.properties is the same than for my tomcat 7&8 instances, e.g. I use a servertemplate which I apply to node and a lbtemplate too.

Here is an extract :
#Create one common template for all workers nodes
worker.servertemplate.type=ajp13
# factor of each worker is the same
worker.servertemplate.lbfactor=1
# ping_mode A is the most complete
worker.servertemplate.ping_mode=C
# socket_timeout in seconds
worker.servertemplate.socket_timeout=30
# connection_pool_timeout in seconds
worker.servertemplate.connection_pool_timeout=600
# reply_timeout in milliseconds : 10 min
worker.servertemplate.reply_timeout=600000
# recovery_options : 3 : don't retry on error after request was sent
worker.servertemplate.recovery_options=3

# Create one common template for all LB nodes
worker.balancertemplate.type=lb
# If we have 10 replies timing out in the same minute => worker goes in error state
worker.balancertemplate.max_reply_timeouts=10

Then based on this I apply those settings to nodes / clusters.

How frequently does it happen (always, sporadicaly for NNN % of requests, in spikes, ...)?

=> I couldn't find any "rule" sorry.... but it happened at least 10 times / day and since I changed the cping cpong to C it doesn't show anymore in the logs even though I have activity

Can you easily reproduce?

=> Yes, I have a test environment which is in use and if I push back cping/cpong mode to A I'm sure the error will show up again.

Do you have root privileges, so could you sniff network traffic to the AJP port? Are Apache and Tomcat on diferent machines? Any active components (Firewalls, routers) in between?

=> I'm not root unfortunately, but my config is that I run 2 linux servers, which both hosts httpd and tomcat instances (both servers runs same versions of httpd, tomcat), and it works fine with the ajp port of tomcat 7&8.
.
There is no firewall between them (hard or soft), and the error occurs even when the httpd is connecting to "himself" (even though I don't use "localhost" but the hostname)

I've ran this setup for more than a year on Tomcat 7&8 (even with upgrades), and only met the issue with 9. It failes with 9.0.20 so I tried to upgrade last week to 9.0.26 but still here. 
Since yesterday and switching the ping_mode to "C" it stopped failing, so for now I will go like that in production but might be good to investigate.

Let me know if I can be of any help.

FYI I tried :
- putting mod_jk in debug => way to verbose (I run 9 instances of tomcat in the end on the same server), but at the beginning the behaviour looks correct (I saw the ping pong working with my tomcat 9 instances every minute)
- adding debug in tomcat "AJP" : couldn't find exactly what I could put in debug... if I put the full org.apache logger in DEBUG it's nightmare :)

Let me know if I can help, and thanks for replying so quickly
Comment 4 Aurelien Pernoud 2019-10-17 15:29:19 UTC
Also I even checked while the error was showing the "total connections" from my httpd to tomcat 9 ajp port (netstat) : it was less than 10.
And when I stress test the ajp port, it worked. I'm lost :D
Comment 5 Christopher Schultz 2019-10-18 16:30:52 UTC
(In reply to Aurelien Pernoud from comment #3)
> Tomcat 7&8

Is this Tomcat 8.0.x or Tomcat 8.5.x?
Comment 6 Aurelien Pernoud 2019-10-18 18:39:25 UTC
Tomcat 8.0.x
Comment 7 Mark Thomas 2019-10-28 22:12:43 UTC
I know it is verbose but what we really need is the mod_jk debug log from when this error occurs. Based on the information provided, that looks like the best option for further investigation.
Comment 8 Aurelien Pernoud 2019-11-04 10:14:10 UTC
Hi Mark,

I have the debug but can't attach it due to limit of 1MB :-(

I managed to reproduce the issue, I have 2 httpd with the exact same configuration, pointing to same tomcat.

In the logs of "http node 1" I see for ex :

[Mon Nov 04 09:15:03 2019] [13694:140683904751360] [info] ajp_handle_cping_cpong::jk_ajp_common.c (976): (tom_tst01_srv09) timeout in reply cpong after 10000 ms. Socket = 25 (event=110)
[Mon Nov 04 09:15:13 2019] [13694:140683904751360] [info] ajp_handle_cping_cpong::jk_ajp_common.c (976): (tom_tst01_srv09) timeout in reply cpong after 10000 ms. Socket = 14 (event=110)
[Mon Nov 04 09:15:24 2019] [13694:140683904751360] [info] ajp_handle_cping_cpong::jk_ajp_common.c (976): (tom_tst01_srv09) timeout in reply cpong after 10000 ms. Socket = 14 (event=110)

leading progressively to :

[Mon Nov 04 09:15:24 2019] [13694:140683904751360] [error] ajp_send_request::jk_ajp_common.c (1725): (tom_tst01_srv09) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)

(Here I only searched for the error)

In the logs of "http node 2", at the same timing, the same server replies correctly to ping (at least that's how I interpret the below lines) :

[Mon Nov 04 09:14:52 2019] [25023:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3441): (tom_tst01_srv09) reached pool min size 13 from 25 cache slots
[Mon Nov 04 09:14:52 2019] [25023:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3493): (tom_tst01_srv09) pinged 1 and recycled 0 sockets in 0 seconds from 25 pool slots
[Mon Nov 04 09:15:32 2019] [25022:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3441): (tom_tst01_srv09) reached pool min size 13 from 25 cache slots
[Mon Nov 04 09:15:32 2019] [25022:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3493): (tom_tst01_srv09) pinged 1 and recycled 0 sockets in 0 seconds from 25 pool slots
[Mon Nov 04 09:15:32 2019] [25024:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3441): (tom_tst01_srv09) reached pool min size 13 from 25 cache slots
[Mon Nov 04 09:15:32 2019] [25024:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3493): (tom_tst01_srv09) pinged 2 and recycled 0 sockets in 0 seconds from 25 pool slots
[Mon Nov 04 09:15:50 2019] [26242:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3441): (tom_tst01_srv09) reached pool min size 13 from 25 cache slots
[Mon Nov 04 09:15:50 2019] [26242:139691817195264] [debug] ajp_maintain::jk_ajp_common.c (3493): (tom_tst01_srv09) pinged 2 and recycled 0 sockets in 0 seconds from 25 pool slots

Both server times are synecd of course.

I can send the logs, once zipped both nodes we are talking about a file of 30MB, let me know how ?
Comment 9 Aurelien Pernoud 2019-11-04 13:25:34 UTC
I also collected tomcat logs even though I can't find anything in it, except that "luckily" I also see for ex that our monitoring tool "pinged" (using HTTP protocol) the server at  [04/Nov/2019:09:15:02 +0000] using "Probe PSI quickcheck", and this call worked like a charm...
There was very limited activity too according to access_log.
I can also provide this if needed.

I don't know if I can provide any more details, let me know, for now I'm switching back to mode C to avoid this error, I just double checked again my production environment which has much more activity and I don't have this "timeout in reply cpong after 10000 ms" showing at all.
Comment 10 Mark Thomas 2019-11-04 13:39:05 UTC
I'll send you a Google Drive link privately where you can upload the files.

Please upload the mod_jk debug logs, the Tomcat logs, the access log, the mod_jk configuration files and the server.xml from the Tomcat instances (with any passwords replaced by ***).

Thanks.
Comment 11 Aurelien Pernoud 2019-11-04 15:17:06 UTC
Thanks Mark, just uploaded everything.

in "logs" : mod_jk debug log from both httpd instances limited to the period where the error occured
tom_tst01_srv09 is the full tomcat log

in "conf" : server.xml anonymized, I also provide the setenv as I use environment variables as you will see
workers.properties also provided, except of course that here I switched it back to "C" mode, the error showed only with "A" mode.
Comment 12 Mark Thomas 2019-11-05 09:11:37 UTC
Thanks. I can't see anything obviously wrong in those files. The CPING is sent and the CPONG is never seen.

If any Tomcat committer wants access to the logs, let me know and I'll send you a link.

Debug logging on the Tomcat side might be useful. Adding the following to logging.properties:
org.apache.coyote.ajp.level = FINE
org.apache.tomcat.util.net.level = FINE

If that generates too much data, try starting with just the first of the two lines.

Meanwhile, I am going to try and reproduce this.
Comment 13 Aurelien Pernoud 2019-11-05 09:55:34 UTC
Thanks Mark,

appreciate all the time spent, I also agree this looks a complex one.

Reading mod_jk logs it seems the cping / cpong is not coming back but as you also see the second httpd pointing to the same tomcat gets replies...

So to me the issue is definitely in the cping cpong (mode A), but why does it fail sometimes, no idea ... Also as you can see in the logs the only server failing is the one running tomcat 9, all the others works fine, so to me the bug is on tomcat too.
Comment 14 Mark Thomas 2019-11-08 17:02:37 UTC
I am unable to reproduce this.

CPing/CPong mode "A" is equivalent to mode "CPI". Clearly, mode "C" is not an issue but it would be useful to know if it is enabling P or I that triggers the issue. You can use mode "CP" and "CI" to test this.

I've been looking through the mod_jk logs in more detail and it is a P test that fails and takes Tomcat off-line but I see "I" test failures as well.

It is interesting that - in the logs we have - it is only srv09 that is seeing these errors. Is there something about those two instances that is unique?

I have also been looking at the Tomcat code. I do see something that isn't quite right (after a CPing/CPong Tomcat always does a blocking read) but I don't see how that could be triggering the error you see. I'll fix this anyway for 9.0.x and 8.5.x.

The debug logging (as previously mentioned for Tomcat) may help. If not, we'll need to ad more logging on the Tomcat side to try and get to the bottom of this.
Comment 15 Aurelien Pernoud 2019-11-11 08:54:31 UTC
Hi Mark,

thanks for all the testing.

"It is interesting that - in the logs we have - it is only srv09 that is seeing these errors. Is there something about those two instances that is unique?"

=> Yes, it's the only instance runnning Tomcat 9.0.x. Others (srv01->srv08) are running Tomcat 7.0.x and 8.0.x.
I have no 8.5.x instances, so if the "bug" you found is due to a change in Tomca > 8.5.x it might be it :)

Anyway, I'll make tests with : CI / CP only (see which one triggers the rror) and also activate debug on Tomcat 9 as required and will try to provide everything back when possible.

Appreciate your help !
Comment 16 Aurelien Pernoud 2019-11-15 09:40:15 UTC
Small update : I've been running for a week with CP, no error in the logs.
Will wait until next week and switch to CI see if it triggers the error or not.
Comment 17 Aurelien Pernoud 2019-11-19 15:25:45 UTC
Hi,

I've reproduced the bug with mode CI after one day... do definitely related to CI I think :)

I uploaded a new zip in your dropboy with :

mod_jk.log.20191119 from both nodes
You0'll see both of them have some "(tom_tst01_srv09) timeout in reply cpong after 10000 ms.", not at the same timing.

(I limited this log to the timing of the errors between 9 and 11 UTC) to avoid sending a huge file)


catalina logs in FINE mode 
(org.apache.coyote.ajp.level = FINE
org.apache.tomcat.util.net.level = FINE)

Since startup and for the day.

Let me know if I can help in any way
Comment 18 Aurelien Pernoud 2019-11-19 15:27:10 UTC
Of course I didn't mention that but the catalina logs is the one from tom_tst01_srv09 !
Comment 19 Mark Thomas 2019-11-19 22:23:53 UTC
Thanks for those logs. The Tomcat debug logs were useful. I'm now reasonably sure Tomcat is closing the connection because of an IOException. However, the logs don't show the exception. I have extended the debug logging to include this exception in 8.5.x and 9.0.x.

To save you building from source (I wasn't sure whether you were set up for that) I've uploaded the latest build here:
http://people.apache.org/~markt/dev/apache-tomcat-9.0.30-dev.tar.gz

Usual caveats apply. This isn't an official release. Use it at your own risk and if your server catches fire it isn't our fault ;)

I really do appreciate your willingness to help track this down. Thanks again.
Comment 20 Aurelien Pernoud 2019-11-21 13:58:59 UTC
Hi Mark,

I deployed your version with FINE debug as before, so far the server didn't catch fire which is pretty cool, as soon as I have the error (kept only CI mode) I will provide you the logs.

Definitely appreciate your help on this one.
Comment 21 Aurelien Pernoud 2019-11-27 15:17:00 UTC
Hi Mark,

the error is not showing for 4 days. Is it possible that you or someone actually fixed something in this patch too from version 9.0.26 ?? Or Should I just wait a bit more...
Comment 22 Mark Thomas 2019-11-27 17:17:14 UTC
The only thing that has changed as far as I know is Tomcat no longer always does a blocking read after the CPing/CPong.

I still don't see how that could have caused the error though...

Could we give it until after the weekend to see? I'll be tagging the next set of releases then. Meanwhile, I'll see if I can see how the blocking read could trigger this.
Comment 23 Aurelien Pernoud 2019-11-28 12:12:20 UTC
There is absolutely no hurry !
I've been working for a week almost now and no errors...
I'll leave it like that in our environment for the full month, I will even try to reactivate the full mode "A" in our test environment, and if it doesn't happen anymore then I'll consider it fix ;)
Thanks
Comment 24 Aurelien Pernoud 2019-12-05 13:06:58 UTC
2 weeks now, no error anymore.
I reactivated also mode "A" for 4 days now, still no error.
I will wait more as I'm not in hurry but honestly I think we can consider you fixed it !
Comment 25 Aurelien Pernoud 2019-12-09 15:19:49 UTC
FYI I disabled the mod_jk debug to avoid having to clean it on a daily basis, I will continue to monitor with mode "A" if the issue eventually comes back but to me I would consider this one fix now ;)

Many thanks Mark
Comment 26 Mark Thomas 2019-12-10 10:35:41 UTC
OK. Thanks for all you help on this. I'm marking this as resolved but do feel free to re-open if the issue re-surfaces.