|Summary:||Core dumps on Solaris under concurrent load.|
|Product:||Tomcat Connectors||Reporter:||Jorge <jorge.degraciasantos>|
|Component:||Common||Assignee:||Tomcat Developers Mailing List <dev>|
Description Jorge 2005-12-09 12:51:49 UTC
Hi! We've been trying to set up Apache + Tomcat + mod_jk on both Solaris 8 and 9, but we keep comming across the following notice in Apache's error_log: [Wed Dec 07 10:57:20 2005] [notice] child pid 2198 exit signal Segmentation fault (11), possible coredump in [/apache's/server/root/directory] Here's the setup: UltraSPARC hardware (Sun ES 450 on 4GB RAM and 4 440MHz processors) Solaris 9 (tried on another machine with Solaris 8 too) JDK1.4.2 Apache 2.0.54 mpm=worker Tomcat 5.0.28 mod_jk1.2.15 (we've tried compiling every version from 1.2.10 to 1.2.15 and we've tried the binary versions available on the official download site. Same results on ALL of them). We've tried other scenarios, such as Apache on one machine and Tomcat on another, or one Apache and several Tomcats on other machines in a load balancing setup, and in every case we get the same problem. The server works just fine under light load, but... We are performing some load tests on the server. We get a core dump notice, together with the actual core dump every few seconds or so, whenever the server goes through heavy (or even fairly mild) loads. At first, we felt that we might be pushing the limit on performance, and that the core dump was a result of that. However, our tests show that the core dump starts appearing under very light load for a server of this kind. Apparently, 10 or 20 concurrent users throwing page requests is our thresshold. Below that load, there are no errors. Above it, they reappear consistently. The server seems to be able to work very well on user loads much bigger than 20 concurrent users,(ie: 100-200) and the core dumping hardly seems to affect performance, because the requests keep being served anyway. We hardly noticed during the initial tests that something was going so wrong - until we checked the logs. We are blaming the problem on mod_jk because we've tried both Apache and Tomcat standalone with no errors under the same load scenario. However, but the mod_jk log doesn't seem to show anything relevant - even in debug mode. Only the notice on Apache's log shows up. My knowledge of core dump analysis is next to none. Can anyone help me with that so that we can work out how to fix it?
Comment 1 Jorge 2005-12-09 14:15:25 UTC
Here's the output of gdb httpd -c core where #0 0xfee85f8c in __lwp_park () from /usr/lib/libthread.so.1 #1 0xfee81d08 in mutex_lock_queue () from /usr/lib/libthread.so.1 #2 0xfee82708 in slow_lock () from /usr/lib/libthread.so.1 #3 0xfef43c64 in free () from /usr/lib/libc.so.1 #4 0xfef54660 in tzcpy () from /usr/lib/libc.so.1 #5 0xfef54318 in _ltzset_u () from /usr/lib/libc.so.1 #6 0xfef533a4 in localtime_u () from /usr/lib/libc.so.1 #7 0xfedee5d0 in set_time_str (str=0xf93f8a2c "C\231\177╗worker1 from 10 ", len=-17029004) at jk_util.c:134 #8 0xfedee9d0 in jk_log (l=0xd12e8, file=0xfee067e8 "mod_jk.c", line=1917, funcname=0xfee07410 "jk_handler", level=4, fmt=0xfee074d8 "Could not get endpoint for worker=%s") at jk_util.c:309 #9 0xfede9a7c in jk_handler (r=0x39f5b8) at mod_jk.c:1917 #10 0x48fbc in ap_run_handler (r=0x39f5b8) at config.c:152 #11 0x49560 in ap_invoke_handler (r=0x39f5b8) at config.c:364 #12 0x34148 in ap_process_request (r=0x39f5b8) at http_request.c:249 #13 0x2f604 in ap_process_http_connection (c=0x395680) at http_core.c:251 #14 0x54634 in ap_run_process_connection (c=0x395680) at connection.c:43 #15 0x45c34 in process_socket (p=0x395558, sock=0x395590, my_child_num=0, my_thread_num=87, bucket_alloc=0x39b570) at worker.c:521 #16 0x46304 in worker_thread (thd=0x114de0, dummy=0x395558) at worker.c:835 #17 0xff1d4aa8 in dummy_worker (opaque=0x114de0) at thread.c:105
Comment 2 Mladen Turk 2006-03-17 09:41:46 UTC
I have never observed such behavior. There was bug with Solaris dealing with the shared memory that was causing core dump, but never something like that. The interesting is that it fails in strformat, so try adjusting the JkLogStampFormat.
Comment 3 Rainer Jung 2006-03-17 20:21:01 UTC
Hi Jorge, Hi Mladen, are you sure, that the stack is from the right thread? If so, man page of Solaris 9 says localtime is not MT safe: The return values for ctime(), localtime(), and gmtime() point to static data whose content is overwritten by each call. ... The asctime(), ctime(), gmtime(), and localtime() functions are unsafe in multithread applications. The asctime_r() and gmtime_r() functions are MT-Safe. The ctime_r(), localtime_r(), and tzset() functions are MT-Safe in mul- tithread applications, as long as no user-defined function directly modifies one of the following variables: timezone, altzone, daylight, and tzname. These four variables are not MT-Safe to access. They are modified by the tzset() function in an MT-Safe manner. The mktime(), localtime_r(), and ctime_r() functions call tzset().
Comment 4 Rainer Jung 2008-01-03 05:32:00 UTC
We switch to a thread safe variant of localtime() in version 1.2.27. I still doubt, that this was the cause.