Bug 60947

Summary: Segfault on startup when using mod_ssl with APR-crypto
Product: Apache httpd-2 Reporter: Jacob Champion <jchampion>
Component: mod_sslAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: normal CC: bz.apache, dimitry, paul.spangler
Priority: P2    
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: Linux   

Description Jacob Champion 2017-03-31 01:51:08 UTC
mod_ssl's use of CRYPTO_THREADID_set_callback() is unfortunately bugged if another module has loaded OpenSSL, either via the apr_crypto routines or some other pathway.

When mod_ssl loads, it calls CRYPTO_THREADID_set_callback() as part of setting up the thread-safety routines in OpenSSL. On unload, it attempts to unset this callback by calling the registration function with a NULL argument. This always fails, and the function returns zero, because unfortunately this API is write-once.

*If* mod_ssl is the only OpenSSL consumer, then libcrypto will be unloaded and reloaded and the OpenSSL static initialization will reset the callback, masking any issue. But if something else has loaded OpenSSL as well (say, by using the apr_crypto_* API), that threadid callback will now be pointing at junk.

Even after this, there's still a way to "succeed" -- *if* mod_ssl is reloaded into exactly the same position, everything will proceed as normal. But if not, we'll segfault the next time we call into OpenSSL. Currently, our Ubuntu autotest machine is running into this crash.

Things we might consider:
- don't set any callback at all, and rely on OpenSSL's default threadid callback
- for platforms on which the default threadid callback is unsafe/nonfunctional, revert to the deprecated CRYPTO_set_id_callback(), which is not write-once
- talk to OpenSSL upstream to figure out the "right way" to deal with multiple (possibly competing) OpenSSL consumers in the same address space

See also:
- bug 54357, which similarly discusses an inability to reset a static callback in OpenSSL
- postgresql dev conversation on this: https://postgrespro.com/list/id/87zj6ht6ef.fsf@wulczer.org
Comment 1 Jacob Champion 2017-03-31 22:06:17 UTC
More research:

1) OpenSSL 1.1.0 no longer requires the use of CRYPTO_THREADID_set_callback() at all, and in fact the CRYPTO_THREADID_* stuff is now a no-op. See rsalz's post at https://www.openssl.org/blog/blog/2017/02/21/threads/ .

2) OpenSSL 1.0.2 (and possibly prior, but I haven't verified) contains default implementations for Windows and BeOS. For all other platforms, its fallback implementation uses the address of errno to tell threads apart.

3) Our current implementation of the threadid callback is possibly incorrect on some platforms, since we're using CRYPTO_THREADID_set_numeric() and truncating an apr_os_thread_t into the space of an unsigned long. This opens up the possibility of collisions. As a concrete example: this makes our implementation objectively worse than the default implementation on Windows, because we truncate a HANDLE into a long, whereas the default uses GetCurrentThreadId() directly.

There is a CRYPTO_THREADID_set_pointer() that allows us to pass in a void*, and if it turns out we have to set a threadid callback, I think we should prefer the set_pointer() flavor if the native thread identifier is bigger than a long.

This is all making me feel like we should try to dump the callback entirely on as many platforms as possible (i.e. those with an addressable, per-thread errno), and decide what to do with the remaining platforms (if any) on a case-by-case basis.
Comment 2 Paul Spangler 2017-04-10 21:21:37 UTC
This comment isn't related to the particular issue you've described with the callback, but is related to the title of the bug. We've also seen a rare segfault when mixing mod_ssl with APR-crypto, but in SSL_load_error_strings (which is supposedly fixed in 1.1.0, see https://rt.openssl.org/Ticket/Display.html?user=guest&pass=guest&id=2325):

1. SSL_load_error_strings sets up a static hash table in libeay containing error strings, some of which point to memory in ssleay.
2. libapr loads apr_crypto_openssl, which bumps the refcount on libeay.
3. Server executes normally... until say the child process crashes or a graceful restart.
4. All modules are unloaded, but APR drivers are not (i.e. apr_crypto_openssl remains loaded).
5. During unload, ssleay gets unloaded but libeay remains loaded due to step 2.
6. Modules are loaded again.
7. ssleay may load into a different location than it was in step 1.
8. The hash table from step 1 still exists (in libeay), but now if it tries to check one of the SSL error strings, it points to a now-invalid place in memory, crashing.

Our research led to this (somewhat old) mailing list thread: http://mailing.openssl.dev.narkive.com/syUDbGxq/memory-corruption-after-libssl-is-unloaded-from-memory where one of the devs is "not sure it has ever been a feasible goal to make OpenSSL DSO/DLLs
able to be unloaded (with the aim of subsequently loading)."

In the end, we decided it was safest to pin ssleay and libeay in memory for the lifetime of the server via a module that preloads them and never unloads them. That "solution" wouldn't fix this particular callback issue since it's related to mod_ssl moving in memory, but I figured this might be worth sharing anyway. Feel free to ignore if it's irrelevant :)

Maybe the "fix" here would also be to pin whatever the callback points to in memory (if that would even be feasible) so it survives a mod_ssl reload?
Comment 3 Jacob Champion 2017-04-10 21:37:00 UTC
(In reply to Paul Spangler from comment #2)
> This comment isn't related to the particular issue you've described with the
> callback, but is related to the title of the bug. We've also seen a rare
> segfault when mixing mod_ssl with APR-crypto, but in SSL_load_error_strings
> (which is supposedly fixed in 1.1.0, see
> https://rt.openssl.org/Ticket/Display.html?user=guest&pass=guest&id=2325):
Nice, thanks for the link! More evidence that perhaps the long-term fix for these sorts of problems is just to move to 1.1.0, where they've done away with a lot of the static global initialization stuff...

> Maybe the "fix" here would also be to pin whatever the callback points to in
> memory (if that would even be feasible) so it survives a mod_ssl reload?
Interesting idea. I suppose we could have the callback(s) live in a completely separate DSO that never gets unloaded. Feels like the bang-for-buck might not be there, if this is a problem that's just going to go away after everyone finally updates to 1.1.0-or-later.

*If* we have to keep supporting stuff like 0.9.8 forever, though, it might be worth a shot. Keep an eye on my recent dev thread [1]; maybe someone will have some input on that front.

[1] https://lists.apache.org/thread.html/15c735e7c513f150a534bc0be69c101106b4d64d076f2c22e2a0ad52@%3Cdev.httpd.apache.org%3E
Comment 4 Rainer Jung 2017-04-10 22:34:33 UTC
Probably not really helpful, but this problem somewhat reminds me of the older

https://bz.apache.org/bugzilla/show_bug.cgi?id=54357
Comment 5 Jacob Champion 2017-04-10 23:06:21 UTC
(In reply to Rainer Jung from comment #4)
> Probably not really helpful, but this problem somewhat reminds me of the
> older
> 
> https://bz.apache.org/bugzilla/show_bug.cgi?id=54357
Yep, and just like with that bug, there's no way to unregister the callback once it's been set. The only winning move is not to play...
Comment 6 Jacob Champion 2017-04-19 02:56:24 UTC
First attempt at a fix checked into trunk at r1791849.
Comment 7 sta 2020-01-07 13:44:57 UTC
We've recently seen the same bug(crashing during graceful restart):
====
(gdb) bt
#0  0x00002b2fac30f340 in ?? ()
#1  0x00002b2fac8b74d4 in ERR_get_state () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0
#2  0x00002b2fac8b793f in ERR_clear_error () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0
#3  0x00002b2fac8a669e in ENGINE_load_builtin_engines () from /opt/cpanel/ea-openssl/lib64/libcrypto.so.1.0.0
#4  0x00002b2facbf6fd5 in ssl_hook_pre_config (pconf=0x55ae04945138, plog=<optimized out>, ptemp=<optimized out>) at mod_ssl.c:407
#5  0x000055ae03f4986e in ap_run_pre_config (pconf=pconf@entry=0x55ae04945138, plog=0x55ae04972358, ptemp=0x55ae049c4608) at config.c:89
#6  0x000055ae03f2321c in main (argc=3, argv=0x7fff9928ab88) at main.c:775
(gdb) list ERR_get_state
1029     ERR_remove_thread_state(NULL);
1030    }
1031    #endif
1032
1033    ERR_STATE *ERR_get_state(void)
1034    {
1035     static ERR_STATE fallback;
1036     ERR_STATE *ret, tmp, *tmpp = NULL;
1037     int i;
1038     CRYPTO_THREADID tid;
(gdb) l
1039
1040     err_fns_check();
1041     CRYPTO_THREADID_current(&tid);
1042     CRYPTO_THREADID_cpy(&tmp.tid, &tid);
1043     ret = ERRFN(thread_get_item) (&tmp);
1044
1045     /* ret == the error state, if NULL, make a new one */
1046     if (ret == NULL) {
1047         ret = (ERR_STATE *)OPENSSL_malloc(sizeof(ERR_STATE));
1048         if (ret == NULL)
(gdb) l CRYPTO_THREADID_current
481     return threadid_callback;
482    }
483
484    void CRYPTO_THREADID_current(CRYPTO_THREADID *id)
485    {
486     if (threadid_callback) {
487         threadid_callback(id);
488         return;
489     }
490    #ifndef OPENSSL_NO_DEPRECATED
(gdb)
====

I saw a patch was committed to trunk, but it doesn't seem it ever made it into the 2.4.x branch:
====
https://github.com/apache/httpd/commit/d7dad162b7294df358284134c06d0889076d98af#diff-6b684bff3a841b11b1a51e5ab3c79761
====

Before we include the change into our own distribution, is there any reason the patch wasn't merged into 2.4.x and this case closed?
Comment 8 Dimitry Andric 2021-06-26 14:39:02 UTC
Another ping on this bug. It would be nice to have this officially backported to 2.4.x, as Apache + mod_ssl now segfaults by default, on various Dockerized platforms. In particular, running arm64 containers on x86_64 hardware, where qemu-user-static is used.

Unfortunately r1791849 (and its other patches from the trunk-openssl-threadid branch) don't seem to apply cleanly to the current state of the 2.4.x branch...