Bug 67927 - TLSCertificateReloadListener triggers race condition (?) in OpenSSL code which causes the JVM to die
Summary: TLSCertificateReloadListener triggers race condition (?) in OpenSSL code whic...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 9.0.x
Hardware: All All
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-25 14:57 UTC by Michael Osipov
Modified: 2023-11-07 18:39 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Osipov 2023-10-25 14:57:24 UTC
This has been verified on Tomcat 9 from bec7a51d7fc3fb913c755b258169d1816b77bea5, but I guess this happens on all Tomcat versions.

* OS: Windows 10 and HP-UX
* Java 8 in both cases
* tomcat-native 1.2.39 and 1.2.38
* OpenSSL 3.0.11 and 1.1.1t

Reproducing on Windows is harder, on HP-UX it fails quite fast.

Reproducer:
* Take the connector configs from https://bz.apache.org/bugzilla/show_bug.cgi?id=67666#c0.
* Configure listener: 
> <Listener className="org.apache.catalina.security.TLSCertificateReloadListener" checkPeriod="30" daysBefore="1" />
* Generate a soon to expire certificate or already expired which will trigger the listener
* Start Tomcat and you should start to see the listener warning you
* Run:
> while true ; do for port in 20001 20002 20003 20004 30001 30002 ; do echo -n "$port: "; echo Q | openssl s_client -connect localhost:$port -brief 2>&1 | grep -e DONE -e errno  ; done; done
You should see repeating:
> ...
> 20001: DONE
> 20002: DONE
> 20003: DONE
> 20004: DONE
> 30001: DONE
> 30002: DONE
> ...
> 20004: DONE
> 30001: write:errno=232
> 30002: write:errno=232
> 20001: connect:errno=239
> 20002: connect:errno=239
> 20003: connect:errno=239
> 20004: connect:errno=239
> ...
* Run:
> tail -f logs/catalina.out
and see the JVM crash.

On Windows I see:
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000001ec4bf30f9a, pid=5240, tid=0x0000000000005b40
> #
> # JRE version: OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-win64) (8.0_362-b09) (build 1.8.0_362-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.362-b09 mixed mode windows-amd64 compressed oops)
> # Problematic frame:
> # C  [tcnative-1.dll+0x80f9a]
> #
> ....
> 
> Stack: [0x0000009785200000,0x0000009785300000],  sp=0x00000097852febb0,  free space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C  [tcnative-1.dll+0x80f9a]
> C  [tcnative-1.dll+0x2d580]
> C  [tcnative-1.dll+0x10256]
> C  [tcnative-1.dll+0x10c8f]
> C  0x000001ec2f0e9d8e
> 
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  org.apache.tomcat.jni.SSLSocket.attach(JJ)I+0
> j  org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(Lorg/apache/tomcat/util/net/SocketWrapperBase;)Z+114
> j  org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run()V+32
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(Lorg/apache/tomcat/util/threads/ThreadPoolExecutor$Worker;)V+92
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run()V+5
> j  org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run()V+4
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub

On HP-UX I see:
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (11) at pc=c97d00b0, pid=6253, tid=80
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0) (build 1.8.0.26-hp-ux-b1)
> # Java VM: Java HotSpot(TM) Server VM (25.26-b1 mixed mode hp-ux-ia64 )
> # Problematic frame:
> # C  [libcrypto.so.1.1+0x4ab727a0]  OPENSSL_sk_dup+0x4ac131b0
> #
> # Core dump written. Default location: /var/opt/tomcat90-eval/core or core.6253 (max size 2097151 kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java again
> #
> #  Please report this error to HPE customer support.
> #
> 
> ---------------  T H R E A D  ---------------
> 
> Current thread (018dd000):  JavaThread "https-openssl-apr-30002-exec-1" daemon [_thread_in_native, id=80, lwp_id=7915536, stack(54b01000,54b41000)]
> 
> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 00000008
> ...
> Stack: [54b01000,54b41000],  sp=54b40a90,  free space=254k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C  [libcrypto.so.1.1+0x4ab727a0]  OPENSSL_sk_dup+0x4ac131b0
> 
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  org.apache.tomcat.jni.SSLSocket.attach(JJ)I+0
> j  org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(Lorg/apache/tomcat/util/net/SocketWrapperBase;)Z+114
> j  org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run()V+32
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(Lorg/apache/tomcat/util/threads/ThreadPoolExecutor$Worker;)V+92
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run()V+5
> j  org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run()V+4
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub

While I have no idea how to approach this on Windows, I have loaded the coredump in GDB on HP-UX:
> # gdb /opt/java8/bin/java core
> HP gdb 6.8 for HP Itanium (32 or 64 bit) and target HP-UX 11iv2 and 11iv3.
> Copyright 1986 - 2011 Free Software Foundation, Inc.
> Hewlett-Packard Wildebeest 6.8 (based on GDB) is covered by the
> GNU General Public License. Type "show copying" to see the conditions to
> change it and/or distribute copies. Type "show warranty" for warranty/support.
> ..
> Core was generated by `java'.
> Program terminated with signal 6, Aborted.
> 
> #0  0x60000000c020ff10:0 in _lwp_killtype Method, field __vfp not found in target
> type ConstantPool, field __vfp not found in target
> constant CollectedHeap::G1CollectedHeap not found in target
> +0x30 ()
>    from /usr/lib/hpux32/libpthread.so.1
> (gdb) where
> #0  0x60000000c020ff10:0 in _lwp_kill+0x30 ()
>    from /usr/lib/hpux32/libpthread.so.1
> #1  0x60000000c0174ec0:0 in pthread_kill+0x9f0 ()
>    from /usr/lib/hpux32/libpthread.so.1
> #2  0x60000000c0403740:0 in raise+0xe0 () from /usr/lib/hpux32/libc.so.1
> #3  0x60000000c0528690:0 in abort+0x170 () from /usr/lib/hpux32/libc.so.1
> #4  0x60000000cd743de0:0 in os::abort ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os/hp-ux/vm/os_hp-ux.cpp:2090
> #5  0x60000000cdd605f0:0 in VMError::report_and_die ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/utilities/vmError.cpp:1421
> #6  0x60000000cdd61210:0 in crash_handler ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os/hp-ux/vm/vmError_hp-ux.cpp:84
> #7  <signal handler called>
> #8  0x60000000cd78a200:0 in ParallelScavengeHeap::block_start ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/oops/oop.inline.hpp:440
> #9  0x60000000cd7233d0:0 in os::print_location ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/os.cpp:1222
> #10 0x60000000cd759f50:0 in os::print_register_info ()
> ---Type <return> to continue, or q <return> to quit---
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os_cpu/hp-ux_ia64/vm/os_hp-ux_ia64.cpp:1199
> #11 0x60000000cdd5dbb0:0 in VMError::report ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/utilities/vmError.cpp:676
> #12 0x60000000cdd60190:0 in VMError::report_and_die ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/utilities/vmError.cpp:1318
> #13 0x60000000cd75c170:0 in os::Hpux::JVM_handle_hpux_signal ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os_cpu/hp-ux_ia64/vm/os_hp-ux_ia64.cpp:1157
> #14 0x60000000cd72c600:0 in os::Hpux::signalHandler ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os/hp-ux/vm/os_hp-ux.cpp:4435
> #15 <signal handler called>
> #16 0x60000000c97d00b0:0 in OPENSSL_sk_dup+0x70 ()
>    from /usr/lib/hpux32/libcrypto.so.1.1
> #17 0x60000000c5901750:0 in SSL_new+0x290 () from /usr/lib/hpux32/libssl.so.1.1
> #18 0x60000000c55908e0:0 in ssl_create () at src/sslnetwork.c:128
> #19 0x60000000c5591780:0 in Java_org_apache_tomcat_jni_SSLSocket_attach ()
>     at src/sslnetwork.c:603
> #20 0x200000007a800d60 in Java native_call_stub frame
> #21 0x200000007a815af0 in JNI frame: org.apache.tomcat.jni.SSLSocket::attach (lo---Type <return> to continue, or q <return> to quit---
> ng, long) ->int
> #22 0x200000007a806920 in interpreted frame: org.apache.tomcat.util.net.AprEndpoint::setSocketOptions (org.apache.tomcat.util.net.SocketWrapperBase) ->boolean bci: 114
> #23 0x200000007a8066e0 in interpreted frame: org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor::run () ->void bci: 32
> #24 0x200000007a806d80 in interpreted frame: org.apache.tomcat.util.threads.ThreadPoolExecutor::runWorker (org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker) ->void bci: 92
> #25 0x200000007a806d40 in interpreted frame: org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker::run () ->void bci: 5
> #26 0x200000007a806d80 in interpreted frame: org.apache.tomcat.util.threads.TaskThread$WrappingRunnable::run () ->void bci: 4
> #27 0x200000007a806d80 in interpreted frame: java.lang.Thread::run () ->void bci: 11
> #28 0x200000007a800470 in Java entry frame
> #29 0x60000000ccd30d70:0 in JavaCalls::call_helper ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/javaCalls.cpp:415
> #30 0x60000000cd7500f0:0 in os::os_exception_wrapper ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os/hp-ux/vm/os_hp-ux.cpp:4367
> #31 0x60000000ccd356e0:0 in JavaCalls::call ()
> ---Type <return> to continue, or q <return> to quit---
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/javaCalls.cpp:318
> #32 0x60000000ccd39c90:0 in JavaCalls::call_virtual ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/javaCalls.cpp:215
> #33 0x60000000ccd3d760:0 in JavaCalls::call_virtual ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/javaCalls.cpp:221
> #34 0x60000000cceefa60:0 in thread_entry ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/prims/jvm.cpp:3182
> #35 0x60000000cdbe3af0:0 in JavaThread::run ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/share/vm/runtime/thread.cpp:3204
> #36 0x60000000cd72afa0:0 in _start ()
>     at /eevee_home/home/jinteg/jenkins_tmp/java8/h8.0.26.rc1b1/hotspot/src/os/hp-ux/vm/os_hp-ux.cpp:1211
> #37 0x60000000c013b820:0 in __pthread_bound_body+0x1c0 ()
>    from /usr/lib/hpux32/libpthread.so.1

Line in question:
> if ((ssl = SSL_new(ctx->ctx)) == NULL) {

I bet that either "ctx->ctx" is in an invalid state or it is NULL. I am certain that it can also be reproduced on macOS, *BSD and GNU/Linux.
Is it possible that the connector is not locked during the reload which opens a small windows for a NULL pointer?

Let me know what type of information I can provide.
Comment 1 Michael Osipov 2023-10-25 15:03:34 UTC
Before you ask: Yes, I know the low checkPeriod is silly, but it still should not cause a crash, at most a performance degregation during the reload phase.
Comment 2 Michael Osipov 2023-10-25 15:07:10 UTC
Mapping from errno.h on HP-UX:
> #  define ECONNRESET        232 /* Connection reset by peer */
> #  define ECONNREFUSED      239 /* Connection refused */
Comment 3 Mark Thomas 2023-10-25 16:44:18 UTC
Can you reproduce this with NioEndpoint+OpenSSL or is this AprEndpoint specific?
Comment 4 Michael Osipov 2023-10-25 17:27:37 UTC
(In reply to Mark Thomas from comment #3)
> Can you reproduce this with NioEndpoint+OpenSSL or is this AprEndpoint
> specific?

Unfortunately, yes:

> Stack: [0x0000004466a00000,0x0000004466b00000],  sp=0x0000004466afee10,  free space=1019k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C  [tcnative-1.dll+0x2d31c]
> C  [tcnative-1.dll+0xa9d2]
> C  0x00000232523b9d8e
> 
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  org.apache.tomcat.jni.SSL.newSSL(JZ)J+0
> j  org.apache.tomcat.util.net.openssl.OpenSSLEngine.<init>(JLjava/lang/String;ZLorg/apache/tomcat/util/net/openssl/OpenSSLSessionContext;ZZIZ)V+75
> j  org.apache.tomcat.util.net.openssl.OpenSSLContext.createSSLEngine()Ljavax/net/ssl/SSLEngine;+68
> j  org.apache.tomcat.util.net.AbstractJsseEndpoint.createSSLEngine(Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Ljavax/net/ssl/SSLEngine;+54
> j  org.apache.tomcat.util.net.SecureNioChannel.processSNI()I+368
> j  org.apache.tomcat.util.net.SecureNioChannel.handshake(ZZ)I+17
> j  org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun()V+118
> j  org.apache.tomcat.util.net.SocketProcessorBase.run()V+32
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(Lorg/apache/tomcat/util/threads/ThreadPoolExecutor$Worker;)V+92
> j  org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run()V+5
> j  org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run()V+4
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
Comment 5 Michael Osipov 2023-10-25 17:38:33 UTC
I will continue testing on Friday. I have an idea how to provoke this. I noticed this especially with I tested https://github.com/apache/tomcat-native/pull/22 locally.
Comment 6 Remy Maucherat 2023-10-26 08:39:27 UTC
Since deallocating the ssl context should be tied to GC when using NIO, I don't understand how it happens. OTOH, actually doing it is quite risky, so maybe it could be better to leak it ...

It would be a good idea to try the FFM code to debug this kind of issue (shameless ad). It would produce cleaner traces and would be easier to experiment with, although I doubt the native code itself would need a fix in that particular case.
Comment 7 Michael Osipov 2023-10-26 09:35:44 UTC
(In reply to Remy Maucherat from comment #6)
> Since deallocating the ssl context should be tied to GC when using NIO, I
> don't understand how it happens. OTOH, actually doing it is quite risky, so
> maybe it could be better to leak it ...
> 
> It would be a good idea to try the FFM code to debug this kind of issue
> (shameless ad). It would produce cleaner traces and would be easier to
> experiment with, although I doubt the native code itself would need a fix in
> that particular case.

I will try the FFM code next week. I am still convinced that all requests need to paused and the connector locked during security context reload.
Comment 8 Christopher Schultz 2023-10-26 16:58:32 UTC
Michael, do you know if calling the JMX-based "reload" operation will cause the same crash? I haven't dug-into the code, but it's possible this bug has existed far longer than the TLSCertificateReloadListener has.
Comment 9 Michael Osipov 2023-10-26 17:06:26 UTC
(In reply to Christopher Schultz from comment #8)
> Michael, do you know if calling the JMX-based "reload" operation will cause
> the same crash? I haven't dug-into the code, but it's possible this bug has
> existed far longer than the TLSCertificateReloadListener has.

I have never tried the JMX-based approach, but I guess if it uses the same code paths then it is problematic as well. I thinkt hat the listener just surfaces the bug.
Comment 10 Mark Thomas 2023-10-26 22:23:51 UTC
We need to allow in-progress usage of the old SSLContext to continue while new requests get the new SSLContext. We don't want new requests to have to wait for a long running request using the old SSLContext to complete.

I agree we need some locking to make sure the 'current' SSLContext is updated in a thread-safe manner but I wouldn't classify that as pausing all requests. That looks to be in place on the Java side.

It is rather hacky but I often find, if I can reproduce the issue, a few printf statements in the Native code can be very helpful.
Comment 11 Michael Osipov 2023-10-27 07:27:24 UTC
(In reply to Mark Thomas from comment #10)
> We need to allow in-progress usage of the old SSLContext to continue while
> new requests get the new SSLContext. We don't want new requests to have to
> wait for a long running request using the old SSLContext to complete.
> 
> I agree we need some locking to make sure the 'current' SSLContext is
> updated in a thread-safe manner but I wouldn't classify that as pausing all
> requests. That looks to be in place on the Java side.

Makes totally sense. Keeping at most two handles.

> It is rather hacky but I often find, if I can reproduce the issue, a few
> printf statements in the Native code can be very helpful.

Had the same idea. Where you able to reproduce it already?
Comment 12 Mark Thomas 2023-11-02 17:26:48 UTC
I can reproduce it and I think I have a fix. Whether it is a complete fix is TBD. I need to tidy things up before committing and then others can test.
Comment 13 Mark Thomas 2023-11-02 18:40:09 UTC
I've pushed the fix to all the branches. I've been running the test with a reload interval of 5s and have not been able to trigger the issue for a while with this patch in place.
Comment 14 Michael Osipov 2023-11-02 18:41:59 UTC
(In reply to Mark Thomas from comment #13)
> I've pushed the fix to all the branches. I've been running the test with a
> reload interval of 5s and have not been able to trigger the issue for a
> while with this patch in place.

Let me give it a try tomorrow...
Comment 15 Michael Osipov 2023-11-03 09:24:14 UTC
I have been running the change now for couple of hours on HP-UX and Windows on Tomcat 9.0.x. I cannot make it crash anymore.

Anyone else? This issue may have been satisfied.
Comment 16 Mark Thomas 2023-11-03 13:54:39 UTC
Thanks for testing. I'm going to assume this is fixed. We can always re-open it if not.