mod_jk occasionally crashes Apache because due to a buffer overflow. mod_jk 1.2.41 (happens also for 1.2.37) Apache 2.4.7 Tomcat 6.0.39 Java 1.6.0_45 x86 Linux Ubuntu 14.04 x64 (3.13.0-91-generic) Here is the error log from Apache: **** buffer overflow detected ***: /usr/sbin/apache2 terminated======= Backtrace: =========/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7fe9aa7de29f]/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fe9aa875bbc]/lib/x86_64-linux-gnu/libc.so.6(+0x109a90)[0x7fe9aa874a90]/lib/x86_64-linux-gnu/libc.so.6(+0x10ab07)[0x7fe9aa875b07]/usr/lib/apache2/modules/mod_jk.so(jk_open_socket+0x8d8)[0x7fe9a7c60cb8]/usr/lib/apache2/modules/mod_jk.so(ajp_connect_to_endpoint+0x65)[0x7fe9a7c7bf75]/usr/lib/apache2/modules/mod_jk.so(+0x36422)[0x7fe9a7c7d422]/usr/lib/apache2/modules/mod_jk.so(+0x1674c)[0x7fe9a7c5d74c]/usr/sbin/apache2(ap_run_handler+0x40)[0x7fe9ab65fbe0]/usr/sbin/apache2(ap_invoke_handler+0x69)[0x7fe9ab660129]/usr/sbin/apache2(ap_process_async_request+0x20a)[0x7fe9ab6756ca]/usr/sbin/apache2(+0x69500)[0x7fe9ab672500]/usr/sbin/apache2(ap_run_process_connection+0x40)[0x7fe9ab669220]/usr/lib/apache2/modules/mod_mpm_event.so(+0x681b)[0x7fe9a783981b]/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe9aab38184]/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe9aa86537d]* ======= Memory map: ======== 7fe688000000-7fe68806a000 rw-p 00000000 00:00 0 7fe68806a000-7fe68c000000 ---p 00000000 00:00 0 ....... 7fffa6c27000-7fffa6c48000 rw-p 00000000 00:00 0 [stack] 7fffa6c86000-7fffa6c88000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] [Wed Jun 29 05:01:50.052325 2016] [core:notice] [pid 1747:tid 140641581987712] AH00051: child pid 17018 exit signal Aborted (6), possible coredump in /etc/apache2 I was able to trace it down to the method nb_connect in jk_connect.c. In version 1.2.41 the issue is line 291: 280> do { 281> rc = connect(sd, (const struct sockaddr *)&addr->sa.sin, addr->salen); 282> } while (rc == -1 && errno == EINTR); 283> 284> if ((rc == -1) && (errno == EINPROGRESS || errno == EALREADY) 285> && (timeout > 0)) { 286> fd_set wfdset; 287> struct timeval tv; 288> socklen_t rclen = (socklen_t)sizeof(rc); 289> 290> FD_ZERO(&wfdset); *291> FD_SET(sd, &wfdset);* 292> tv.tv_sec = timeout / 1000; 293> tv.tv_usec = (timeout % 1000) * 1000; 294> rc = select(sd + 1, NULL, &wfdset, NULL, &tv); From what I understand a buffer overflow would only happen for FD_SET if the fd_set gets over 1024 descriptors. I made sure that my ulimit for open files is set and applied large enough, so that's not it. I tried to switch FD_SET to poll and it seems to work now also for sd greater than 1024: struct pollfd pfd_read; pfd_read.fd = sd; pfd_read.events = POLLOUT; rc = poll(&pfd_read, 1, timeout); This would be a possible fix for the problem - at least it works fine in my setup. Also, poll() already seems to be used somewhere else in this particular source file, so no extra import necessary. Here more configuration files: /etc/libapache2-mod-jk/httpd-jk.conf <IfModule jk_module> JkWorkersFile /etc/libapache2-mod-jk/workers.properties JkLogFile /var/log/apache2/mod_jk.log JkLogLevel warn JkShmFile /var/log/apache2/jk-runtime-status </IfModule> /etc/libapache2-mod-jk/workers.properties workers.tomcat_home=/usr/share/tomcat6 workers.java_home=/usr/lib/jvm/java-6-sun ps=/ worker.list=loadbalancer worker.loadbalancer.type=lb worker.loadbalancer.balance_workers=ajp13_worker,ajp13_worker2 worker.loadbalancer.sticky_session=0 worker.ajp13_worker.port=xxx worker.ajp13_worker.host=localhost worker.ajp13_worker.type=ajp13 worker.ajp13_worker.ping_mode=A worker.ajp13_worker.secret=xxx worker.ajp13_worker.fail_on_status=503 worker.ajp13_worker.connection_pool_size=32768 worker.ajp13_worker.redirect=ajp13_worker2 worker.ajp13_worker2.port=xxx worker.ajp13_worker2.host=otherhost worker.ajp13_worker2.type=ajp13 worker.ajp13_worker2.ping_mode=A worker.ajp13_worker2.secret=xxx worker.ajp13_worker2.fail_on_status=503 worker.ajp13_worker2.connection_pool_size=32768 worker.ajp13_worker2.activation=disabled /etc/tomcat6/server.xml <Connector port="xxx" protocol="AJP/1.3" redirectPort="8443" enableLookups="false" maxThreads="65536" minSpareThreads="25" maxSpareThreads="75" connectionTimeout="300000" packetSize="65536" request.secret="xxx" /> Apache mpm_event: StartServers 2 ServerLimit 16 MinSpareThreads 256 MaxSpareThreads 1280 ThreadLimit 1024 ThreadsPerChild 1024 MaxRequestWorkers 16384 MaxConnectionsPerChild 0 Please also see my question about this in the tomcat_users mailing group here (continued in July): https://mail-archives.apache.org/mod_mbox/tomcat-users/201606.mbox/%3CCABVo0f+stYj9=Cxrb-t+bhJaf_a9hX2wdvHsBYmE-bge_vwxTg@mail.gmail.com%3E
One more thing to add, although Apache mpm_event is used, most connections are via SSL, so AFAIK it should behave like mpm_worker.
Created attachment 34417 [details] [PATCH] Use poll(2) in posix nb_connect This issue is caused by limitations of the select(2) system call. From the (linux) manpage: > POSIX allows an implementation to define an upper limit, advertised via the > constant FD_SETSIZE, on the range of file descriptors that can be specified > in a file descriptor set. The Linux kernel imposes no fixed limit, but the > glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE defined > as 1024, and the FD_*() macros operating according to that limit. To > monitor file descriptors greater than 1023, use poll(2) instead. As Michiel already noted, poll(2) is already imported in jk_connect.c, so using poll(2) doesn't add any new dependencies. I've attached a patch that uses poll(2) if it is available at compile time; otherwise it falls back to the current select(2) implementation. On the long run, it would probably be preferable to use some kind of event library like libuv or libevent that abstracts over the kernel interface, and automatically uses the optimal one available (e.g. epoll on linux and kqueue on FreeBSD). This would both improve portability and performance, and possibly code simplicity.
I think this patch is worth serious consideration and testing. (I feel like we had this conversation elsewhere, too.)
Many thanks for the patch. Applied to 1.2.x for 1.2.44 onwards.