Bug 37850 - Core dumps on Solaris under concurrent load.
Summary: Core dumps on Solaris under concurrent load.
Alias: None
Product: Tomcat Connectors
Classification: Unclassified
Component: Common (show other bugs)
Version: unspecified
Hardware: Sun Solaris
: P2 major (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
Depends on:
Reported: 2005-12-09 12:51 UTC by Jorge
Modified: 2008-10-05 03:09 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Jorge 2005-12-09 12:51:49 UTC
Hi! We've been trying to set up Apache + Tomcat + mod_jk on both Solaris 8 and
9, but we keep comming across the following notice in Apache's error_log:

[Wed Dec 07 10:57:20 2005] [notice] child pid 2198 exit signal Segmentation
fault (11), possible coredump in [/apache's/server/root/directory]

Here's the setup:

UltraSPARC hardware (Sun ES 450 on 4GB RAM and 4 440MHz processors)
Solaris 9 (tried on another machine with Solaris 8 too)
Apache 2.0.54 mpm=worker
Tomcat 5.0.28
mod_jk1.2.15 (we've tried compiling every version from 1.2.10 to 1.2.15 and
we've tried the binary versions available on the official download site. Same
results on ALL of them).

We've tried other scenarios, such as Apache on one machine and Tomcat on
another, or one Apache and several Tomcats on other machines in a load balancing
setup, and in every case we get the same problem. The server works just fine
under light load, but...

We are performing some load tests on the server. We get a core dump notice,
together with the actual core dump every few seconds or so, whenever the server
goes through heavy (or even fairly mild) loads. At first, we felt that we might
be pushing the limit on performance, and that the core dump was a result of
that. However, our tests show that the core dump starts appearing under very
light load for a server of this kind. Apparently, 10 or 20 concurrent users
throwing page requests is our thresshold. Below that load, there are no errors.
Above it, they reappear consistently. The server seems to be able to work very
well on user loads much bigger than 20 concurrent users,(ie: 100-200) and the
core dumping hardly seems to affect performance, because the requests keep being
served anyway. We hardly noticed during the initial tests that something was
going so wrong - until we checked the logs.

We are blaming the problem on mod_jk because we've tried both Apache and Tomcat
standalone with no errors under the same load scenario. However, but the mod_jk
log doesn't seem to show anything relevant - even in debug mode. Only the notice
on Apache's log shows up.

My knowledge of core dump analysis is next to none. Can anyone help me with that
so that we can work out how to fix it?
Comment 1 Jorge 2005-12-09 14:15:25 UTC
Here's the output of
gdb httpd -c core

#0  0xfee85f8c in __lwp_park () from /usr/lib/libthread.so.1
#1  0xfee81d08 in mutex_lock_queue () from /usr/lib/libthread.so.1
#2  0xfee82708 in slow_lock () from /usr/lib/libthread.so.1
#3  0xfef43c64 in free () from /usr/lib/libc.so.1
#4  0xfef54660 in tzcpy () from /usr/lib/libc.so.1
#5  0xfef54318 in _ltzset_u () from /usr/lib/libc.so.1
#6  0xfef533a4 in localtime_u () from /usr/lib/libc.so.1
#7  0xfedee5d0 in set_time_str (str=0xf93f8a2c "C\231\177╗worker1 from 10 ",
len=-17029004) at jk_util.c:134
#8  0xfedee9d0 in jk_log (l=0xd12e8, file=0xfee067e8 "mod_jk.c", line=1917,
funcname=0xfee07410 "jk_handler", level=4,
    fmt=0xfee074d8 "Could not get endpoint for worker=%s") at jk_util.c:309
#9  0xfede9a7c in jk_handler (r=0x39f5b8) at mod_jk.c:1917
#10 0x48fbc in ap_run_handler (r=0x39f5b8) at config.c:152
#11 0x49560 in ap_invoke_handler (r=0x39f5b8) at config.c:364
#12 0x34148 in ap_process_request (r=0x39f5b8) at http_request.c:249
#13 0x2f604 in ap_process_http_connection (c=0x395680) at http_core.c:251
#14 0x54634 in ap_run_process_connection (c=0x395680) at connection.c:43
#15 0x45c34 in process_socket (p=0x395558, sock=0x395590, my_child_num=0,
my_thread_num=87, bucket_alloc=0x39b570)
    at worker.c:521
#16 0x46304 in worker_thread (thd=0x114de0, dummy=0x395558) at worker.c:835
#17 0xff1d4aa8 in dummy_worker (opaque=0x114de0) at thread.c:105
Comment 2 Mladen Turk 2006-03-17 09:41:46 UTC
I have never observed such behavior.
There was bug with Solaris dealing with the shared memory that
was causing core dump, but never something like that.

The interesting is that it fails in strformat, so try adjusting the
Comment 3 Rainer Jung 2006-03-17 20:21:01 UTC
Hi Jorge, Hi Mladen,

are you sure, that the stack is from the right thread?

If so, man page of Solaris 9 says localtime is not MT safe:

     The return values for  ctime(),  localtime(),  and  gmtime()
     point  to  static  data whose content is overwritten by each
     The asctime(), ctime(), gmtime(), and localtime()  functions
     are unsafe in multithread applications.  The asctime_r() and
     gmtime_r()   functions   are   MT-Safe.    The    ctime_r(),
     localtime_r(),  and  tzset()  functions  are MT-Safe in mul-
     tithread applications, as long as no  user-defined  function
     directly  modifies one of the following variables: timezone,
     altzone, daylight, and tzname.  These four variables are not
     MT-Safe to access. They are modified by the tzset() function
     in an MT-Safe manner.   The   mktime(),  localtime_r(),  and
     ctime_r() functions call tzset().
Comment 4 Rainer Jung 2008-01-03 05:32:00 UTC
We switch to a thread safe variant of localtime() in version 1.2.27.
I still doubt, that this was the cause.