Summary: | Segfault on startup when using mod_ssl with APR-crypto | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | Jacob Champion <jchampion> |
Component: | mod_ssl | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | NEW --- | ||
Severity: | normal | CC: | bz.apache, dimitry, paul.spangler |
Priority: | P2 | ||
Version: | 2.5-HEAD | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Linux |
Description
Jacob Champion
2017-03-31 01:51:08 UTC
More research: 1) OpenSSL 1.1.0 no longer requires the use of CRYPTO_THREADID_set_callback() at all, and in fact the CRYPTO_THREADID_* stuff is now a no-op. See rsalz's post at https://www.openssl.org/blog/blog/2017/02/21/threads/ . 2) OpenSSL 1.0.2 (and possibly prior, but I haven't verified) contains default implementations for Windows and BeOS. For all other platforms, its fallback implementation uses the address of errno to tell threads apart. 3) Our current implementation of the threadid callback is possibly incorrect on some platforms, since we're using CRYPTO_THREADID_set_numeric() and truncating an apr_os_thread_t into the space of an unsigned long. This opens up the possibility of collisions. As a concrete example: this makes our implementation objectively worse than the default implementation on Windows, because we truncate a HANDLE into a long, whereas the default uses GetCurrentThreadId() directly. There is a CRYPTO_THREADID_set_pointer() that allows us to pass in a void*, and if it turns out we have to set a threadid callback, I think we should prefer the set_pointer() flavor if the native thread identifier is bigger than a long. This is all making me feel like we should try to dump the callback entirely on as many platforms as possible (i.e. those with an addressable, per-thread errno), and decide what to do with the remaining platforms (if any) on a case-by-case basis. This comment isn't related to the particular issue you've described with the callback, but is related to the title of the bug. We've also seen a rare segfault when mixing mod_ssl with APR-crypto, but in SSL_load_error_strings (which is supposedly fixed in 1.1.0, see https://rt.openssl.org/Ticket/Display.html?user=guest&pass=guest&id=2325): 1. SSL_load_error_strings sets up a static hash table in libeay containing error strings, some of which point to memory in ssleay. 2. libapr loads apr_crypto_openssl, which bumps the refcount on libeay. 3. Server executes normally... until say the child process crashes or a graceful restart. 4. All modules are unloaded, but APR drivers are not (i.e. apr_crypto_openssl remains loaded). 5. During unload, ssleay gets unloaded but libeay remains loaded due to step 2. 6. Modules are loaded again. 7. ssleay may load into a different location than it was in step 1. 8. The hash table from step 1 still exists (in libeay), but now if it tries to check one of the SSL error strings, it points to a now-invalid place in memory, crashing. Our research led to this (somewhat old) mailing list thread: http://mailing.openssl.dev.narkive.com/syUDbGxq/memory-corruption-after-libssl-is-unloaded-from-memory where one of the devs is "not sure it has ever been a feasible goal to make OpenSSL DSO/DLLs able to be unloaded (with the aim of subsequently loading)." In the end, we decided it was safest to pin ssleay and libeay in memory for the lifetime of the server via a module that preloads them and never unloads them. That "solution" wouldn't fix this particular callback issue since it's related to mod_ssl moving in memory, but I figured this might be worth sharing anyway. Feel free to ignore if it's irrelevant :) Maybe the "fix" here would also be to pin whatever the callback points to in memory (if that would even be feasible) so it survives a mod_ssl reload? (In reply to Paul Spangler from comment #2) > This comment isn't related to the particular issue you've described with the > callback, but is related to the title of the bug. We've also seen a rare > segfault when mixing mod_ssl with APR-crypto, but in SSL_load_error_strings > (which is supposedly fixed in 1.1.0, see > https://rt.openssl.org/Ticket/Display.html?user=guest&pass=guest&id=2325): Nice, thanks for the link! More evidence that perhaps the long-term fix for these sorts of problems is just to move to 1.1.0, where they've done away with a lot of the static global initialization stuff... > Maybe the "fix" here would also be to pin whatever the callback points to in > memory (if that would even be feasible) so it survives a mod_ssl reload? Interesting idea. I suppose we could have the callback(s) live in a completely separate DSO that never gets unloaded. Feels like the bang-for-buck might not be there, if this is a problem that's just going to go away after everyone finally updates to 1.1.0-or-later. *If* we have to keep supporting stuff like 0.9.8 forever, though, it might be worth a shot. Keep an eye on my recent dev thread [1]; maybe someone will have some input on that front. [1] https://lists.apache.org/thread.html/15c735e7c513f150a534bc0be69c101106b4d64d076f2c22e2a0ad52@%3Cdev.httpd.apache.org%3E Probably not really helpful, but this problem somewhat reminds me of the older https://bz.apache.org/bugzilla/show_bug.cgi?id=54357 (In reply to Rainer Jung from comment #4) > Probably not really helpful, but this problem somewhat reminds me of the > older > > https://bz.apache.org/bugzilla/show_bug.cgi?id=54357 Yep, and just like with that bug, there's no way to unregister the callback once it's been set. The only winning move is not to play... First attempt at a fix checked into trunk at r1791849. We've recently seen the same bug(crashing during graceful restart): ==== (gdb) bt #0 0x00002b2fac30f340 in ?? () #1 0x00002b2fac8b74d4 in ERR_get_state () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0 #2 0x00002b2fac8b793f in ERR_clear_error () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0 #3 0x00002b2fac8a669e in ENGINE_load_builtin_engines () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0 #4 0x00002b2facbf6fd5 in ssl_hook_pre_config (pconf=0x55ae04945138, plog=<optimized out>, ptemp=<optimized out>) at mod_ssl.c:407 #5 0x000055ae03f4986e in ap_run_pre_config (pconf=pconf@entry=0x55ae04945138, plog=0x55ae04972358, ptemp=0x55ae049c4608) at config.c:89 #6 0x000055ae03f2321c in main (argc=3, argv=0x7fff9928ab88) at main.c:775 (gdb) list ERR_get_state 1029 ERR_remove_thread_state(NULL); 1030 } 1031 #endif 1032 1033 ERR_STATE *ERR_get_state(void) 1034 { 1035 static ERR_STATE fallback; 1036 ERR_STATE *ret, tmp, *tmpp = NULL; 1037 int i; 1038 CRYPTO_THREADID tid; (gdb) l 1039 1040 err_fns_check(); 1041 CRYPTO_THREADID_current(&tid); 1042 CRYPTO_THREADID_cpy(&tmp.tid, &tid); 1043 ret = ERRFN(thread_get_item) (&tmp); 1044 1045 /* ret == the error state, if NULL, make a new one */ 1046 if (ret == NULL) { 1047 ret = (ERR_STATE *)OPENSSL_malloc(sizeof(ERR_STATE)); 1048 if (ret == NULL) (gdb) l CRYPTO_THREADID_current 481 return threadid_callback; 482 } 483 484 void CRYPTO_THREADID_current(CRYPTO_THREADID *id) 485 { 486 if (threadid_callback) { 487 threadid_callback(id); 488 return; 489 } 490 #ifndef OPENSSL_NO_DEPRECATED (gdb) ==== I saw a patch was committed to trunk, but it doesn't seem it ever made it into the 2.4.x branch: ==== https://github.com/apache/httpd/commit/d7dad162b7294df358284134c06d0889076d98af#diff-6b684bff3a841b11b1a51e5ab3c79761 ==== Before we include the change into our own distribution, is there any reason the patch wasn't merged into 2.4.x and this case closed? Another ping on this bug. It would be nice to have this officially backported to 2.4.x, as Apache + mod_ssl now segfaults by default, on various Dockerized platforms. In particular, running arm64 containers on x86_64 hardware, where qemu-user-static is used. Unfortunately r1791849 (and its other patches from the trunk-openssl-threadid branch) don't seem to apply cleanly to the current state of the 2.4.x branch... |