We have some servers running Apache 2.2.3-11 on Centos 5 with mod_jk 1.2.25 connecting to tomcats running 5.5.20. Over the course of a couple of days, the busy count on the various servers (as reported by the jkstatus page) creeps upward, causing the balancing to get out of whack. A graceful restart of apache gets things back to normal, as does a restart of the tomcat, but the process starts again. I'm not entirely sure what information would be helpful in troubleshooting this, please advise. Earlier load testing did not show anything, so I'm guessing that specific circumstances may cause the count to become wrong. Thanks, Andrew
Are we sure, that the growing busy numbers are wrong? Asked differently: can you do Thread Dumps of your backends to confirm, that they are *not* having that many requests busy? In theory some requests might got stuck in the backend. If you don't have a reaply_timeout, mod_jk will wait indefinitely for a response. By using "kill -QUIT PID" (PID is the process id of your backend=romcat process) you will get a stack for each thread in Tomcat in your catalina.out. This procedure does not influence your running Tomcat negatively. It will pause for a few milliseconds and then resume the usual work. In the dump, the threads responsible for requests coming in via AJP13 (from mod_jk) can be identified as TP-ProcessorXXX. Some of the will only have a very short stack, which indicates, that they are idle in the thread pool. Some have a slightly longer stack (about 10 lines) with a top level method being a socket read, which tells us, that mod_jk is connected to them, but they are not actualy processing a request, and few will have a very long stack, usually containing webapp methods, a service method etc. Those ar the ones that process a request and should be equivalnt to "busy" (at least of Tomcat only gets requests from one httpd. Otherwise you need to add up all the busy for this Tomcat in the various httpd). Altough in principal there's a chance, that busy could go wrong, I never detected it in practice and the feature is out in the wild for quite some time now, without problem reports. That's why I first want you to check, taht the numbers are really wrong. Furthermore: are there errors in the JK error log? Last: if you want to track how and when busy increases, you can add "%{JK_LB_LAST_BUSY}n" to your LogFormat in httpd. Then your access log will contain the busy number after each request in the request log line. Regards, Rainer
Created attachment 21564 [details] Results from Thread Dump - Idle Server The JK Threads from an idle tomcat instance during a thread dump
I took one of our systems out of rotation and let it sit for a while. Its busy was hanging around 16-20 while in rotation. Once out, it almost immediately went to 12 and remained at 12 the entire time. I did the thread dump and attached the TP-ProcessorXXX entries. It looks like the vast majority of connections show in socket read, with a few idle in the pool. There were no active threads (which is good, since it was out of rotation). The mod_jk.log shows no errors, though there are many warnings for things like: [Tue Feb 19 15:56:18.849 2008] [13119:2611919760] [warn] map_uri_to_worker::jk_uri_worker_map.c (550): Uri echo|20YYY;echo| is invalid. Uri must start with / Good ole' internet. Thanks for the variable, I can see that being handy, though I can't add anything in right now. Let me know if there's anything more.
Thanks for the check, so we know for sure, that the busy count doesn't correspond to the backend status. Although there is still a theoretical possibility, that mod_jk *thinks* the backend is that busy, I checked, if we miss some error conditions when decreasing the busy count. There is not condition in the code, how the busy increment could not be followed by a decrement. The counter is hold inside a shared memory segment and it is marked as volatile. In case you hae a *very* busy web site, it's possible, that we run into a non-atomic increment/decrement problem. Although that's possible, I'm still thinking about other possibilities. The status worker shows a busy counter for the lb and for each member of an lb. Do both types of busy counters suffer from the problem? To get an idea: how fast does the problem happen (how much busy increase per day or so, and also how many requests approximately are handled by the corresponding worker during that time)? You mentioned, that this influences negatively the balancing behaviour. I deduct from that, that you are using the Busyness method. Do you have a reason not to use the Requests method? It would not suffer from the "wrong busy" problem. Finally: although we know no change in the code that would suggest that the problem will go sway in 1.2.26, is it possible to switch to that version? Be careful: in 1.2.26 your JkMounts are not automatically inhertited between httpd VirtualHosts. You either need to include them in each VirtualHost or use JkMountCopy (see docs). Apart from that, the version update should be reliable. I'm still wondering, if mod_jk threads might hang between start of request and end of response. You could use gstack, to do thread dumps on the httpd side. Since the busy number is summed up over all httpd child processes and we don't know if the problem is located in only one of those or visible in all of them, you would need to do gstack on all httpd child processes. The ones busy in mod_jk will be easily detectable, since mod_jk uses "jk" as part of most function names.
Well, we're a busy site, but not stunningly so... everyone's idea on what makes a busy site may differ. The httpd systems are never that terribly loaded CPU/Load wise (less then 40%). The global counter and the individual counters add up to each over, so I believe that indicates that they both are having the issue. It's hard to tell at what level the inflation occurs, as without calming servers at around the same time each day it's hard to tell what are real requests and which are inflation. We have a few servers in rotation that are bulkier and run a 2x higher LBFactor, and they seem currently to have about 2x the false busys. I'm guessing about 5 a day for the lighter servers, and 10 a day for the bigger servers. Because the load balancing is unequal, it varies, but since our graceful restart 2.5 days ago, the ACC on the servers is anywhere from 2-4 million, depending on the server. As for the balancing method, the busyness method is definitely more even when it works since our requests can be very different weights. As for running 1.2.26, I think we could compile/use it no problem (we're running 2.2.3, the precompile indicates 2.2.6 as a prereq), but I know management would be a little grumbly to make a change without a better idea as to what's going on. I don't think the config changes are a problem for us as we use preference files and JKMounts only at VHost level... no globals. I'm gathering the gstack data now and will post in a few minutes. In case it comes up, as typical worker is configured like: worker.<WORKER>.port=8009 worker.<WORKER>.host=<IP> worker.<WORKER>.type=ajp13 worker.<WORKER>.socket_timeout=30 worker.<WORKER>.connection_pool_timeout=60 worker.<WORKER>.lbfactor=20 The LB involved is: worker.<LB>.lock=P worker.<LB>.method=B and we have a global setting of: worker.maintain=15 Thanks, Andrew
Created attachment 21566 [details] Thread Dump from Apache Children
How many threads per process do you use in httpd (MPM configuration)? 60? Anyways, I adopted my JVM Stack analysis script to gstack. The results (sums over all 19 processes=1178 Threads) are following. In total I can see 68 threads that would count as busy in the lb (and 2 threads that seem to do direct ajp work without lb). So how do these numbers compare to your observed busyness at the time of dumps? Concerning your configuration: Pessimistic locking for me is more a relic from the early days of the code. I don't know about any cases, ahere it is actually needed. In case you had opened this case with optimistic locking (default) when running out of ideas, I woulkd have asked you to try pessimistic locking, but your case proves once more, that it's not really useful. I would stick to the default (optimistic) in order to run the code path that#s used most often. You don't have a reply_timeout. The stack seem to indicate, that hanging replies might well be your problem. Unfortunately in 1.2.25 max_reply_timeouts doesn't work, so any reply_timeout that fires will put your worker into error state until recovery, so other users will loose sessions when being switched over to failover workers. In 1.2.26 we fixed max_reply_timeouts, so that you can configure a reply_timeout (don't wait longer for an answer than X seconds) and the number of reply_timeouts that you expect before you think the backend itself is in serious trouble and should be put out of service. So maybe here's the argument for a switch to 1.2.26, although I hadn't expected that situation last night, when I wrote about 1.2.26. In 1.2.27 (march) we will be able to configure a reply_timeout not only per worker, but instead per mapping, so that you can give individual JkMounts individual reply_timeouts. So if some use cases are known to take a long time, you can have an appropriately short general timeout, with some longer timeouts for the URLs known to take notoriously long. It already works in 1.2.27-dev, but there's no release out there. Have a look at the following data yourself. I think the most reasonable approach is to use a reply_timeout with max_reply-timeouts, which in turn will force you to update to 1.2.26. The details of the dumps: 548 Threads idle waiting (I would say in Keepalive, so hanging on a connection from the client, but waiting for the next request): #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_recv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in apr_bucket_socket_create () from /usr/lib/libaprutil-1.so.0 #5 0xHEXADDR in apr_brigade_split_line () from /usr/lib/libaprutil-1.so.0 #6 0xHEXADDR in ap_core_input_filter () from /proc/PID/exe #7 0xHEXADDR in ap_get_brigade () from /proc/PID/exe #8 0xHEXADDR in ap_rgetline_core () from /proc/PID/exe #9 0xHEXADDR in ap_read_request () from /proc/PID/exe #10 0xHEXADDR in ap_register_input_filter () from /proc/PID/exe #11 0xHEXADDR in ap_run_process_connection () from /proc/PID/exe #12 0xHEXADDR in ap_process_connection () from /proc/PID/exe #13 0xHEXADDR in ap_graceful_stop_signalled () from /proc/PID/exe #14 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #15 0xHEXADDR in start_thread () from /lib/libpthread.so.0 #16 0xHEXADDR in clone () from /lib/libc.so.6 272 Threads idle #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xHEXADDR in apr_thread_cond_wait () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in ap_queue_pop () from /proc/PID/exe #4 0xHEXADDR in ap_graceful_stop_signalled () from /proc/PID/exe #5 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #6 0xHEXADDR in start_thread () from /lib/libpthread.so.0 #7 0xHEXADDR in clone () from /lib/libc.so.6 235 Threads lingering close (client connection shutdown) #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_recv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in ap_lingering_close () from /proc/PID/exe #5 0xHEXADDR in ap_graceful_stop_signalled () from /proc/PID/exe #6 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #7 0xHEXADDR in start_thread () from /lib/libpthread.so.0 #8 0xHEXADDR in clone () from /lib/libc.so.6 59 Threads in mod_jk, waiting for reply from the backend (with lb?) #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in read () from /lib/libpthread.so.0 #2 0xHEXADDR in jk_tcp_socket_recvfull () from /etc/httpd/modules/mod_jk.so #3 0xHEXADDR in ajp_connection_tcp_get_message () #4 0xHEXADDR in ajp_get_reply () from /etc/httpd/modules/mod_jk.so #5 0xHEXADDR in ajp_service () from /etc/httpd/modules/mod_jk.so #6 0xHEXADDR in service () from /etc/httpd/modules/mod_jk.so #7 0xHEXADDR in jk_handler () from /etc/httpd/modules/mod_jk.so #8 0xHEXADDR in ap_run_handler () from /proc/PID/exe #9 0xHEXADDR in ap_invoke_handler () from /proc/PID/exe #10 0xHEXADDR in ap_process_request () from /proc/PID/exe #11 0xHEXADDR in ap_register_input_filter () from /proc/PID/exe #12 0xHEXADDR in ap_run_process_connection () from /proc/PID/exe #13 0xHEXADDR in ap_process_connection () from /proc/PID/exe #14 0xHEXADDR in ap_graceful_stop_signalled () from /proc/PID/exe #15 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #16 0xHEXADDR in start_thread () from /lib/libpthread.so.0 #17 0xHEXADDR in clone () from /lib/libc.so.6 36 Threads internal tasks (non JK, non-worker threads) 11 Threads sending non-JK content back to browser #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_sendv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in ap_bucket_eoc_create () from /proc/PID/exe #5 0xHEXADDR in ap_core_output_filter () from /proc/PID/exe #6 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #7 0xHEXADDR in ap_http_chunk_filter () from /proc/PID/exe #8 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #9 0xHEXADDR in ap_http_outerror_filter () from /proc/PID/exe #10 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #11 0xHEXADDR in ap_content_length_filter () from /proc/PID/exe #12 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #13 0xHEXADDR in ?? () from /etc/httpd/modules/mod_deflate.so #14 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #15 0xHEXADDR in ap_old_write_filter () from /proc/PID/exe #16 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #17 0xHEXADDR in ap_note_auth_failure () from /proc/PID/exe #18 0xHEXADDR in ap_process_request () from /proc/PID/exe #19 0xHEXADDR in ap_register_input_filter () from /proc/PID/exe 6 Threads sending back jk-replies to browser #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_sendv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in ap_bucket_eoc_create () from /proc/PID/exe #5 0xHEXADDR in ap_core_output_filter () from /proc/PID/exe #6 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #7 0xHEXADDR in ap_http_chunk_filter () from /proc/PID/exe #8 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #9 0xHEXADDR in ap_http_outerror_filter () from /proc/PID/exe #10 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #11 0xHEXADDR in ap_content_length_filter () from /proc/PID/exe #12 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #13 0xHEXADDR in ap_filter_flush () from /proc/PID/exe #14 0xHEXADDR in apr_brigade_write () from /usr/lib/libaprutil-1.so.0 #15 0xHEXADDR in ap_old_write_filter () from /proc/PID/exe #16 0xHEXADDR in ap_rwrite () from /proc/PID/exe #17 0xHEXADDR in ws_write () from /etc/httpd/modules/mod_jk.so #18 0xHEXADDR in ajp_get_reply () from /etc/httpd/modules/mod_jk.so #19 0xHEXADDR in ajp_service () from /etc/httpd/modules/mod_jk.so 2 Threads waiting in mod_jk for response from backend (without lb?) #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in read () from /lib/libpthread.so.0 #2 0xHEXADDR in jk_tcp_socket_recvfull () from /etc/httpd/modules/mod_jk.so #3 0xHEXADDR in ajp_connection_tcp_get_message () #4 0xHEXADDR in ajp_get_reply () from /etc/httpd/modules/mod_jk.so #5 0xHEXADDR in ajp_service () from /etc/httpd/modules/mod_jk.so #6 0xHEXADDR in jk_handler () from /etc/httpd/modules/mod_jk.so #7 0xHEXADDR in ap_run_handler () from /proc/PID/exe #8 0xHEXADDR in ap_invoke_handler () from /proc/PID/exe #9 0xHEXADDR in ap_process_request () from /proc/PID/exe #10 0xHEXADDR in ap_register_input_filter () from /proc/PID/exe #11 0xHEXADDR in ap_run_process_connection () from /proc/PID/exe #12 0xHEXADDR in ap_process_connection () from /proc/PID/exe #13 0xHEXADDR in ap_graceful_stop_signalled () from /proc/PID/exe #14 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #15 0xHEXADDR in start_thread () from /lib/libpthread.so.0 #16 0xHEXADDR in clone () from /lib/libc.so.6 2 Threads writing back jk-replies to client (with chunking) #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_sendv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in ap_bucket_eoc_create () from /proc/PID/exe #5 0xHEXADDR in ap_core_output_filter () from /proc/PID/exe #6 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #7 0xHEXADDR in ap_http_outerror_filter () from /proc/PID/exe #8 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #9 0xHEXADDR in ap_content_length_filter () from /proc/PID/exe #10 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #11 0xHEXADDR in ap_filter_flush () from /proc/PID/exe #12 0xHEXADDR in apr_brigade_write () from /usr/lib/libaprutil-1.so.0 #13 0xHEXADDR in ap_old_write_filter () from /proc/PID/exe #14 0xHEXADDR in ap_rwrite () from /proc/PID/exe #15 0xHEXADDR in ws_write () from /etc/httpd/modules/mod_jk.so #16 0xHEXADDR in ajp_get_reply () from /etc/httpd/modules/mod_jk.so #17 0xHEXADDR in ajp_service () from /etc/httpd/modules/mod_jk.so #18 0xHEXADDR in service () from /etc/httpd/modules/mod_jk.so #19 0xHEXADDR in jk_handler () from /etc/httpd/modules/mod_jk.so 1 Thread flushing to client in jk reply #0 0xHEXADDR in __kernel_vsyscall () #1 0xHEXADDR in poll () from /lib/libc.so.6 #2 0xHEXADDR in apr_wait_for_io_or_timeout () from /usr/lib/libapr-1.so.0 #3 0xHEXADDR in apr_socket_sendv () from /usr/lib/libapr-1.so.0 #4 0xHEXADDR in ap_bucket_eoc_create () from /proc/PID/exe #5 0xHEXADDR in ap_core_output_filter () from /proc/PID/exe #6 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #7 0xHEXADDR in ap_http_outerror_filter () from /proc/PID/exe #8 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #9 0xHEXADDR in ap_content_length_filter () from /proc/PID/exe #10 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #11 0xHEXADDR in ap_old_write_filter () from /proc/PID/exe #12 0xHEXADDR in ap_pass_brigade () from /proc/PID/exe #13 0xHEXADDR in ap_rflush () from /proc/PID/exe #14 0xHEXADDR in ws_flush () from /etc/httpd/modules/mod_jk.so #15 0xHEXADDR in ajp_get_reply () from /etc/httpd/modules/mod_jk.so #16 0xHEXADDR in ajp_service () from /etc/httpd/modules/mod_jk.so #17 0xHEXADDR in service () from /etc/httpd/modules/mod_jk.so #18 0xHEXADDR in jk_handler () from /etc/httpd/modules/mod_jk.so #19 0xHEXADDR in ap_run_handler () from /proc/PID/exe 5 Threads other states, non-jk related
The busyness as reported at the time of the dump was around 300, though given the amount of inflation or so from previous testing, 68 sounds reasonable for actual connections at the time. We are running 60 threads per child. I think it's reasonable for us to run optimistic locking, and I'll see if we can change to that this evening. As for 1.2.26, we're doing some compiling and testing on that right now... any thoughts on the use of --enable-flock on the compile? Going to 1.2.26 will take a few days, though, as we will need to move slowly and test things out.
So unfoirtunately we now *know*, that the number is wrong, i.e. it doesn't reflect the internal state of mod_jk. I had the hope, that 68 was already the inflated value. That means: all my comments on reply_timeout are true, but will most likely *not* help about the wrong busy values. So I'm a little clueless on how to proceed further ... With the busy count in the access log, we could see, if the unwanted increase in busy mostly happens during busy hours (assuming that your load is not totally equally distributed over the day). If we really have a problem with non-atomic increase/decrease, we should expect that to happen mostly during the peak hours. Possible workarounds: - don't use lbmethod busyness (yes I know, you've got good reasons to use it, but unfortunately also one reason to not use it) - (gracefully) restarting the web server. If you can live with the problem for a day, you could restart the web server once every night. Caution: this will reset all settings changed in mod_jk via the status worker during the day back to the values configured in the config files, e.g. activations status (disable, stop). - I could find out, if oine can easily change the busy value from the outside in the shm file. Of course we don't know the correct busy value, so we could only reset to 0. If we decrease the value in mod_jk and it would get below 0, we keep it at zero. So there is some chance, that after resetting it to zero, it will get close to the correct number after some time (we would only need a short moment were statistically load is low). No really nice option available at the moment ... How many threads per process do you have configured for the MPM? 60? No advise on enable-flock. Do you think you could reproduce with a simple test scenario? This would open up the opportunity to test patches. As I said, we can't reproduce and I don't know of any other such case.
Yeah, I did a bunch of load testing against these systems when we moved to 2.2 and 1.2.25 (we were using 2.0 and an internally hacked version of 1.2.<something old, perhaps 12>) where I tossed a ton of requests at them to the tune of 5000 req/sec through JK with a couple of servers on the backend that soaked up the dummy requests with a random delay. These tests would run for over a day and the result would have the busy count going back down to 0. I suspect that this is event driven, and that the load relationship exists simply because higher loaded systems are more likely to experience the event. Is there a case that you can think of where the busy counter is incremented, but the thread is aborted? Is there JK logging that could be turned on for a while that would log the increment and decrement of the counter with requests? This way they could be paired up and the specific requests not being decremented would be identified and could be examined for issues. It seems like the access log results would not let us know this info because it could show the overall state at the completion of the request, and not if the thread actually incremented and decremented the counters. I don't really know what I'm talking about, here, so let me know if something like that is possible or not very practical. Thanks, Andrew
Before I start to think deeper about your suggestion: could it be related to "apachectl restart" or "apachectl graceful"? Or would you say that that doesn't fit what you see? Second possibility: are we sure, that you don't experience any process crashes (segfaults et.al.)? Normaly those should get logged in the usual error log of httpd.
Well, I'm not 100% sure what you mean by related to the apache graceful/restart. When we do a graceful (don't really do restarts because of the live nature), the counts do seem to reset, so I don't *think* there's a problem there. I don't see any segfaults in the log, just junk like "request failed: error reading the headers" from crappy bots and script kiddies. Also, I just started looking through the code and it seems like the difference between the optimistic locking and pessimistic locking is just that the pessimistic locking just does the same locking/unlocking sequence in more places, is that right? I'm not a C programmer by any stretch of the imagination.
optimistic/pessimistic: that's right. As I said, switching to ptimistic will most likely not help with this issue, but I would expect less risk for other issues. graceful: I was thinking about graceful restarts, which let threads end their request, but thought they might not be able to decrease the busy counter in the shared memory any more. As you said, that does not seem to be the reason for your problem. Another comment: the busy count is not really what's used by the Busyness method. We still use what's called "V" in the status output. It should be the same as busyness, in case all load factors are 1, and otherwise some bigger numbers indicating weighted busyness. As you observe problems in the balancing, I expect the "V" values to get inflated as well. Right? The factor by which V differs from busy is shown as multiplicity "M" in the status output. Thos are integers which have the same ratios for the different workers as 1/(load factor).
Yeah, the relationship between M, V, and the Busy is correct (which is to say that V is inflated).
We've noticed this problem in our production environment as well... At present it is a low-load point in the day (when we are processing ~20 requests), yet the busy count is reported as being 400+ (which is about MAX - 20 in jk-status). Nothing short of restarts seem to correct the count. I've literally tried all the timeout values I could find (on both ends - mod_jk and jboss) - keeping of course in mind their intended purposes as described in the docs - but none seemed to have any effect on the busy count not decrementing. I did a jboss thread dump and found 146 threads with the word "8009" in it, but according to Rainer's analysis only a handful (11 to be exact - I grepped RUNNABLE threads only) of them are actually processing requests. Curiously enough, we did not seem to have this issue in our test environment (with nearly identical configurations), even after intensive load testing. I'm not sure if this is something specific to our environment. For the record: SunOS mega 5.10 Generic_127111-06 sun4v sparc SUNW,Sun-Fire-T200 Server Version: Apache/2.2.6 (Unix) mod_jk/1.2.25 JK Version: mod_jk/1.2.25 Server version: Apache/2.2.6 (Unix) Server built: Jan 19 2008 22:12:59 Server's Module Magic Number: 20051115:5 Server loaded: APR 1.2.11, APR-Util 1.2.10 Compiled using: APR 1.2.11, APR-Util 1.2.10 Architecture: 32-bit Server MPM: Worker threaded: yes (fixed thread count) forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/worker" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_FCNTL_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/usr/local/httpd-cls01" -D SUEXEC_BIN="/usr/local/httpd-cls01/bin/suexec" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="conf/mime.types" -D SERVER_CONFIG_FILE="conf/httpd.conf" We are currently running in degraded mode due to a session replication issue, so for the time being only one node is responding to requests. Perhaps this is part of the equation?
I forgot to mention the rate of the leak... we last did a restart about 48 hours ago. We seem to be incrementing the busy count by about 200 per day. (In reply to comment #15) > We've noticed this problem in our production environment as well... At present > it is a low-load point in the day (when we are processing ~20 requests), yet > the busy count is reported as being 400+ (which is about MAX - 20 in > jk-status). > > Nothing short of restarts seem to correct the count. > > I've literally tried all the timeout values I could find (on both ends - mod_jk > and jboss) - keeping of course in mind their intended purposes as described in > the docs - but none seemed to have any effect on the busy count not > decrementing. > > I did a jboss thread dump and found 146 threads with the word "8009" in it, but > according to Rainer's analysis only a handful (11 to be exact - I grepped > RUNNABLE threads only) of them are actually processing requests. > > Curiously enough, we did not seem to have this issue in our test environment > (with nearly identical configurations), even after intensive load testing. > > I'm not sure if this is something specific to our environment. For the record: > > SunOS mega 5.10 Generic_127111-06 sun4v sparc SUNW,Sun-Fire-T200 > > Server Version: Apache/2.2.6 (Unix) mod_jk/1.2.25 > JK Version: mod_jk/1.2.25 > > Server version: Apache/2.2.6 (Unix) > Server built: Jan 19 2008 22:12:59 > Server's Module Magic Number: 20051115:5 > Server loaded: APR 1.2.11, APR-Util 1.2.10 > Compiled using: APR 1.2.11, APR-Util 1.2.10 > Architecture: 32-bit > Server MPM: Worker > threaded: yes (fixed thread count) > forked: yes (variable process count) > Server compiled with.... > -D APACHE_MPM_DIR="server/mpm/worker" > -D APR_HAS_SENDFILE > -D APR_HAS_MMAP > -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) > -D APR_USE_FCNTL_SERIALIZE > -D APR_USE_PTHREAD_SERIALIZE > -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT > -D APR_HAS_OTHER_CHILD > -D AP_HAVE_RELIABLE_PIPED_LOGS > -D DYNAMIC_MODULE_LIMIT=128 > -D HTTPD_ROOT="/usr/local/httpd-cls01" > -D SUEXEC_BIN="/usr/local/httpd-cls01/bin/suexec" > -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" > -D DEFAULT_ERRORLOG="logs/error_log" > -D AP_TYPES_CONFIG_FILE="conf/mime.types" > -D SERVER_CONFIG_FILE="conf/httpd.conf" > > > We are currently running in degraded mode due to a session replication issue, > so for the time being only one node is responding to requests. Perhaps this is > part of the equation? >
Hi, I wanted to know if there was any intent on fixing this issue as I have experienced EXACTLY the same symptoms on my production environment and unable to reproduce them in a test environment by simply throwing a load tester at it. I'm using: Server Version: Apache/2.0.52 (CentOS) JK Version: mod_jk/1.2.26 JBoss Version: 4.2.2 GA I have also set about every timeout I could find in the workers.properties documentation as well as the timeout on the AJP connector inside tomcat. I was incredibly happy to see this bug filed as every time I see a forum post with the same issue, someone is quick to respond with telling the poster to set all these timeout settings. Thanks.
I too have been having this exact same issue in my PROD environment. I am using version 1.2.27 of the mod_jk and running IHS on Solaris. The symptoms are identical to the ones described in this bug. We have spent a good part of the last 3 weeks trying to address the issue of the inflated Busy count. Is there any plans to resolve this issue once and for all? Thank you
Studying this at length; the locking mechanism is not the issue. fcntl() is likely the most reliable (not fastest) and flock(), which is not used by default, would be far less reliable as it isn't supported by nas solutions. The faulty assumption that 'volatile' magically enables atomic math operations on smb environments is the issue. 'int64' is not atomic on any platform, and it would appear that 'int' increments are acting appropriately for affected users, while their decrements are failing. All operations on the scoreboard must be mutexed.
Thanks for looking into this Bill! Should 32Bit operations be atomic? The busy counter could well be 32 Bits. For most of the other pure statistics counters, like bytes transferred, total request number etc. I'm not to concerned about correctness of the numbers, but the busy count influences the work of the balancer. Regards, Rainer
Well, we are already misusing the phrase optlock vs real locking, so perhaps an additional pedantic mode that always locks? 32 bit math is not smp-safe on most architectures. There was an apr_atomic interface that illustrated that, if you would like to refer to the apr project sources.
At the risk of stating the obvious: 32 bit reads and writes are guaranteed atomic by the JVM. However, an arithmetic operation (e.g. increment) requires two operations: read and then write, and another thread can write the field between the two atomic operations. 64 bit reads and writes are not even guaranteed atomic. This was to allow for systems that did not have 64 bit operations, which therefore had to perform 2 off 32-bit operations. (Note: they probably are atomic on most modern systems, depending on the alignment). Adding volatile makes 64 bit reads and writes atomic, but does not protect multiple operations such as increment.
(In reply to comment #22) > At the risk of stating the obvious: Interesting but irrelevant. This is httpd module code, not Java code.
Oops, sorry! Though a JVM is not involved here, increment still requires two separate memory operations. BTW, shared memory can behave very unexpectedly - e.g. writes by one CPU may appear out of order to another CPU unless appropriate instructions (memory barriers) are used. This is in addition to the problems caused by interleaved reads/writes, which can affect arithmetic operations potentially even on a single CPU system.
(In reply to comment #21) > Well, we are already misusing the phrase optlock vs real locking, so perhaps an > additional pedantic mode that always locks? > > 32 bit math is not smp-safe on most architectures. There was an apr_atomic > interface that illustrated that, if you would like to refer to the apr project > sources. We are experiencing this problem with ModJK 1.2.28 and Tomcat 6.0.26 on Ubuntu Lucid 64bit. I can confirm this is still very much an issue. When we graceful apache, the counter seems to reset back to 0... however an apache graceful sets the counter back to 0 even if there *ARE* active connections (this is probably expected). When those active connections finish up, the busy counter does not go negative, it stays at zero (which is correct). Are there plans on fixing this issue? We have built into our tomcat init.d wrapper the ability to gracefully bounce a JVM without interrupting or loosing any pending apache requests, or requests that come in while the JVM is restarting. Unfortunately this bug makes this impossible because we think there are active pending requests, when there are not.
Created attachment 28807 [details] Mod_jk scheduling errors
For anyone having this issue, you should know that mod_proxy can essentially be used as a functional replacement to mod_jk. While mod_proxy is theoretically less efficient than mod_jk, we found that the overhead added was minimal and of essentially no consequence as the apache-tomcat bridge was far from being the bottleneck. I had worked around this issue by moving to mod_proxy - if this avenue is available to you, it is worth exploring.
I added a warning to the docs in r1649182. The only real solution would be the introduction of atomics to "sync" increments and decrements of the busy counter. We could use APR atomics for mod_jk and would have to look for Windows atomics for IIS. That would leave NSAPI open. Since it seems to be used only very infrequently, we could leave the problem as-is for NSAPI.
I have now changed the implementation of the busy counter to use atomics (see r1649306). This is based on the builtin atomics on Windows, gcc atomics if gcc is used or APR atomics if APR is available. In other cases, still no atomics are used. Some smaller problems might still exist, e.g. when doing a web server restart it could be that the busy number might not be correct. I wasn't able to reproduce such behavior, but it might happen. Most of the cases for skewed numbers should be fixed by the atomics, so I'm closing this ticket now.
Should we add a WARN-level log message if the user requests "busyness" and the binary does not support atomic busyness-tracking? I think that's a reasonable warning to issue. We could just emit that log on startup... no need to spew logs repeatedly.
Warning added in r1650043: [Wed Jan 07 12:53:10.882 2015] [6976:1] [warn] init::jk_lb_worker.c (1908): Missing support for atomics: LB method 'busyness' not recommended