Summary: | Segmentation Fault in shmcb_cyclic_cton_memcpy | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | Ken Avery <ken.avery> |
Component: | mod_ssl | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | CLOSED FIXED | ||
Severity: | critical | CC: | geoff |
Priority: | P3 | ||
Version: | 2.0.48 | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Linux |
Description
Ken Avery
2004-03-17 17:26:38 UTC
Here is inforamtion from one of our developers: While attempting to locate the cause for what appears to be a memory consumption problem in the SSL code, the server segmentation faults. The first worker child & all of its child threads continue to consume memory while the parent stays the same or gets a little smaller. The child threads never give the memory back unless restarted. Please advise if this is an expected behavior. Running with 'SSLSessionCache none' doesn't consume memory (and doesn't seg fault), but it performs poorly when using 2048 bit keys. I observed the segmentation fault issue in mod_ssl while running the small script listed below. Based on the stack information the issue appears to be in shmcb_cton_memcpy() during an attempt to remove a session id. The server keeps on reponding, but all the child threads die and are restarted. I am not sure what is happening, but the following variables seem to get corrupted: The stack trace shows these are supposed to be: src_offset=6402 src_len=10240 Inside the frame they have these values: (gdb) print src_offset (in edi register) $55 = 3183473748 (gdb) print src_len (in edx register) $56 = 3183464512 The configuration file, and my initial debug session are attached. Apache error_log ... [Mon Mar 15 11:21:33 2004] [notice] Apache/2.0.48 configured -- resuming normal operations [Mon Mar 15 11:25:28 2004] [error] server reached MaxClients setting, consider raising the MaxClients setting [Mon Mar 15 11:38:29 2004] [notice] child pid 1065 exit signal Segmentation fault (11) [Mon Mar 15 12:06:28 2004] [notice] child pid 1154 exit signal Segmentation fault (11) [Mon Mar 15 12:44:49 2004] [notice] child pid 1258 exit signal Segmentation fault (11) [Mon Mar 15 13:04:40 2004] [notice] child pid 1315 exit signal Segmentation fault (11) [Mon Mar 15 13:17:29 2004] [notice] child pid 1363 exit signal Segmentation fault (11) [Mon Mar 15 13:45:12 2004] [notice] child pid 1401 exit signal Segmentation fault (11) ... OS RedHat 7.3 gcc-2.96-113 glibc-2.2.5-43 openssl-0.9.6b-35.7 Apache 2.0.48 Build Script: ./configure --with-program-name=leakd --with-port=9200 --with-mpm=worker -- enable-ssl=shared --enable-maintainer-mode \ --enable-proxy=shared --enable- cgi=shared --enable-setenvif=shared --enable-cgi=shared --enable-access=shared \ --enable-rewrite=shared --enable-dir=shared --enable-actions=shared --enable- mime=shared --enable-proxy_connect=shared \ --enable-proxy_http=shared -- enable-negotiation=shared --enable-alias=shared --enable-env=shared --enable- dir=shared \ --enable-mod-actions=shared --enable-log-config=shared --enable- imap=shared --enable-headers=shared \ --enable-layout=webserver --disable- autoindex --disable-userdir --disable-usertrack --disable-cgid \ --disable- asis --disable-auth --disable-auth_digest --disable-auth_dbm --disable- auth_anon --disable-dav \ --disable-dav_fs --disable-vhost_alias --disable- unique_id --disable-speling --disable-cern_meta --disable-include \ --disable- expires --enable-status=shared --enable-info=shared ldd leakd: libssl.so.2 => /lib/libssl.so.2 (0x40024000) libcrypto.so.2 => /lib/libcrypto.so.2 (0x40052000) libaprutil-0.so.0 => /usr/webserver/lib/libaprutil-0.so.0 (0x40119000) libgdbm.so.2 => /usr/lib/libgdbm.so.2 (0x4012d000) libdb-3.3.so => /lib/libdb-3.3.so (0x40133000) libexpat.so.0 => /usr/lib/libexpat.so.0 (0x401c2000) libapr-0.so.0 => /usr/webserver/lib/libapr-0.so.0 (0x401e1000) libpthread.so.0 => /lib/libpthread.so.0 (0x40200000) librt.so.1 => /lib/librt.so.1 (0x40215000) libm.so.6 => /lib/libm.so.6 (0x40226000) libcrypt.so.1 => /lib/libcrypt.so.1 (0x40247000) libnsl.so.1 => /lib/libnsl.so.1 (0x40274000) libdl.so.2 => /lib/libdl.so.2 (0x40288000) libc.so.6 => /lib/libc.so.6 (0x4028c000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) Simple script on external machine downloads copies of the stock Apache index.html.en page under both unsecure & secure sites: #!/bin/sh counter=0 limit=32000 while [ "$counter" -lt "$limit" ] do wget -O - http://myboxaddr:9200 wget -O - https://myboxaddr:9201 counter=`expr $counter + 1` echo "Count=> $counter" done I added some log messages to the code, and turned on debugging. I attempted to using either SSLMutex file:logs/ssl_mutex or SSLMutex default. It takes longer with SSLMutex default to seg fault, but the stack trace is basically the same. The debug error_log traces are available for both test runs if you want them. Finally the src_offest & src_len variables are not changing. GDB just doesn't reset the registers when you move back in the stack frame. It seems to me that src_offset and src_len are getting corrupted somehow, but it's not obvious to me where or how this is happening. The versions you're using of redhat, glibc, gcc (etc) are a little dated. and though I'm reluctant to dismiss the issue as being old tools, it would certainly be something to consider - if you're able to build using a different gcc or mess with the optimisation levels, that might hint as to whether this is compiler sensitive or something more macabre. Also, is it possible to insert some debugging lines in the last two frames around the problem area to dump the exact values being passed around? I'm curious how and where those values are getting mangled. As/when you hit a segfault, it would be useful to have something to help pinpoint where the corruption was introduced. (Another possible hint: could those "corrupt" values actually be some unsigned representation of a negative - indicating a possible bug in the "cyclic" logic?) I've added myself to the CC line for this ticket, please let me know how you get on with this. Logging messages were added into the function to print out the values for src_len and src_offset, and they were actually not changing. The seg fault is in memcpy() frame #0. When you move back to frame #1 to examine things, gdb 5.2- 2 does not reload the registers. Local variables were created inside the function, and assigned the values src_offset & src_len upon entry. The end result was the same (seg fault). It could be the tools, but everything is fine for 15-20 minutes. The function is called 305 times before a failure with the last three calls shown below: CALLER == shmcb_remove_session_id() CALLED == shmcb_cyclic_cton_memcpy() [Wed Mar 17 17:13:20 2004] [info] CALLER: header->cache_data_size=7190 src_offset=3972 src_len=10240 [Wed Mar 17 17:13:20 2004] [info] CALLED: buff_size=7190 src_offset=3972, src_len=10240 [Wed Mar 17 17:13:20 2004] [info] CALLER: header->cache_data_size=7190 src_offset=7166 src_len=10240 [Wed Mar 17 17:13:20 2004] [info] CALLED: buff_size=7190 src_offset=7166, src_len=10240 I have two debug traces. Ouch, ok - I have this gloomy sense that I'm about to dive back into apache code ... I notice you're on apache 2.0.48 ... I could try to help track the problem in that version and worry about migrating it (if applicable) to cvs after, but to avoid the potential for logjams with other issues already fixed, are you able to move to 2.0.49, or better still, CVS (head or 2.0.**-stable)? At the least, have you diffed the ssl module source against later releases or CVS to check if any fixes have already been made that might cover this? Whatever you do w.r.t. apache versions - please email me a copy of the first few pages of a *trace* log during startup (this should give me all the shmcb geometry settings), and then the last few pages leading up to your first crash. I noticed from the info you've already provided that you are caching sessions around ~10Kb, which would indicate that you're using client-authentication and probably with some biggish certs (or longish cert-chains). My hunch is that this is triggering some wrap-around issue, either in the cyclic logic itself or in the use of variables of insufficient size. Please mail me the details privately, no point drowning the bugzilla database. As/when I have potential suggestions/fixes, how should we handle that? Can I send you diffs to try? Can I shell to a box where this can be reproduced? Thanks again for the detailed report. Geoff's fix for this is now committed to HEAD and the 2.0 branch - thanks for the report, and thanks to Geoff for tracking it down. |