Bug 36281 - problem with Failover in jk_lb_worker.c
Summary: problem with Failover in jk_lb_worker.c
Alias: None
Product: Tomcat Connectors
Classification: Unclassified
Component: Common (show other bugs)
Version: unspecified
Hardware: Sun Solaris
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
Depends on: 36525
  Show dependency tree
Reported: 2005-08-19 21:08 UTC by Chuck Betts
Modified: 2008-10-05 03:09 UTC (History)
0 users

A full log file showing one failed request (68.15 KB, application/octet-stream)
2005-08-19 21:09 UTC, Chuck Betts
workers.properties (434 bytes, text/plain)
2005-08-19 22:21 UTC, Chuck Betts
Mod_jk configuration file (281 bytes, text/plain)
2005-08-19 22:22 UTC, Chuck Betts

Note You need to log in before you can comment on or make changes to this bug.
Description Chuck Betts 2005-08-19 21:08:37 UTC
Apache 1.3.33
Tomcat 5.0.30
mod_jk 1.2.14
Solaris 9


There seems to be a problem in supporting Failover of a failed Tomcat server on 
Solaris 9 with mod_jk 1.2.14.  When I simulate the failed Tomcat server by 
pulling the network cable, the worker will do all the appropriate 
socket_timeouts and retries, but when that fails, as it should, another worker 
is not chosen, and the failed worker is not put in_error.  Here is a snippet 
from the log:

[trace] service::jk_lb_worker.c (551): enter
[trace] get_most_suitable_worker::jk_lb_worker.c (453): enter
[debug] get_most_suitable_worker::jk_lb_worker.c (539): found best worker 
(giraffe) using by request method
[trace] get_most_suitable_worker::jk_lb_worker.c (543): exit
[debug] service::jk_lb_worker.c (587): service worker=giraffe jvm_route=giraffe
[trace] ajp_service::jk_ajp_common.c (1630): enter
[debug] ajp_service::jk_ajp_common.c (1670): processing with 3 retries
[info]  ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed,  
recoverable operation attempt=1
[info]  ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed,  
recoverable operation attempt=2
[info]  ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed,  
recoverable operation attempt=3
[error] ajp_service::jk_ajp_common.c (1758): Error connecting to tomcat. Tomcat 
is probably not started or is listening on the wrong port. worker=giraffe failed
[trace] ajp_service::jk_ajp_common.c (1768): exit
<log ends>

The ajp_service::jk_ajp_common.c method is called by the 
service::jk_lb_worker.c method, but there are no more log messages after 
ajp_service returns.  We should at least see an exit trace log for the service 
method.  I put in logging statements in the code, and found the offending lines.

From jk_lb_worker.c ln. 603-607:

service_stat = end->service(end, s, l, &is_service_error);
rec->s->readed += end->rd;
rec->s->transferred += end->wr;
end->done(&end, l);

A logging message directly after the end->service method is called is seen, but 
one right before the end->done method is called is not.
In any case, if I comment out the two lines that update the shared memory, 
everything works as expected.
Comment 1 Chuck Betts 2005-08-19 21:09:18 UTC
Created attachment 16117 [details]
A full log file showing one failed request
Comment 2 Rainer Jung 2005-08-19 21:33:06 UTC
I would like to see your workers.properties and the Jk directives in your httpd

Also: Could you please check, if the failing Apache child processes are still
there, or if they died? The log chows the PID of the relevant process directly
behind the date. If the process is still there, you can use

/usr/proc/bin/pstack PID

to write a thrad dump. That way we can confirm, in which function the hanging
process sits. You could attach the pstack.

If the process is not there any more, it might have crashed. In case you succeed
in dumping a core file of the process, you can again use pstack on that core file.

Finally it might be interesting to use truss (or if you are already familiar
with it dtrace) to find out, if the error is related to some system call.
Configure apache to only use very few children (like startng 2 servers and
having sparemax and min also equals to 2). Start the server via

truss -f -o some_output_file -w all -r all -v all \
/usr/local/apache/bin/apachectl start

and then redo your tests (apache will be a little slow). The file
some_output_file contains information about all system calls, about bytes read
and written, signals received, errnos etc.

To have a somewhat easier test case: In the log you attached one can see, that
the working request for the html produces to image requests going to two apache
processes in parallel which both fail. I ould be a simpler retest to only use
single requests, not something in parallel, e.g. not automatically reloading
embedded objevcts via the browser.
Comment 3 Chuck Betts 2005-08-19 22:21:00 UTC
Created attachment 16119 [details]
Comment 4 Chuck Betts 2005-08-19 22:22:49 UTC
Created attachment 16120 [details]
Mod_jk configuration file
Comment 5 Chuck Betts 2005-08-19 22:24:29 UTC
The pstack returned nothing, and I confirmed by watching TOP that the process 
is created then dies when the http request is over.

I am having trouble running truss, I'll get back to you with that when I can.
Comment 6 Rainer Jung 2005-08-19 23:43:35 UTC
Config is pretty basic and looks OK.
The process dead is strange and I don't see an immediate reason for it.

It would be good, if you could reproduce it with without parallel requests to
simplify the situation.

Did you build apache and mod_jk yourself? Have they been compiled using the same
compiler and the same CFLAGS?
Comment 7 Chuck Betts 2005-08-20 01:28:26 UTC
Sorry for all this.  We were building on a seperate machine using the gcc 
compiler, because our primary Sun build machine wasnt working at the time.  Now 
it is and I recompiled on it, and everything is hunkey-dorey now.
Comment 8 Chuck Betts 2005-12-09 02:14:27 UTC

The compiler we used was the Forte from Sun, the compiler flag is: 
CFLAGS="-mt -fast -xarch=v8plusa -xtarget=ultra2 -xO5 -KPIC"
LDFLAGS="-L/lib -R/usr/local/lib -R/usr/lib -R/lib -lpthread"

Comment 9 Rainer Jung 2005-12-09 03:23:33 UTC
Most likely this is a duplicate of 36525 and fixed in the already released
1.2.15. Since it was a 64Bit alignment bug, some compilers might show the bug,
others not.