Bug 66669 - JVM crash in APR mode
Summary: JVM crash in APR mode
Status: RESOLVED FIXED
Alias: None
Product: Tomcat Native
Classification: Unclassified
Component: Library (show other bugs)
Version: 1.2.37
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-24 06:40 UTC by DigitalCat
Modified: 2023-07-31 09:06 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description DigitalCat 2023-06-24 06:40:09 UTC
Dear all:
	Last days, we found some jvm crash things when we use APR mode in tomcat.
	
	Version: Tomcat 9.0.65,APR 1.7.0 ,Tomcat Native library 1.2.33, OpenSSL 1.1.1n
	
	We used GDB to findout what happend. Show that crash is happend on this code ,  .
one crash:
 
org.apache.tomcat.util.buf.ByteBufferUtils#cleanDirectBuffer
org.apache.tomcat.util.net.AprEndpoint.AprSocketWrapper#doClose

another crash:

org.apache.tomcat.util.net.AprEndpoint#setSocketOptions(org.apache.tomcat.util.net.SocketWrapperBase<java.lang.Long>)
org.apache.tomcat.jni.SSLSocket#handshake

	
	To See some changelogs and code , I found there are some write lock an  read lock to do socket close or somethine, but some codes doesnt have same code , Im not sure wether this code need to add write lock.  like org.apache.tomcat.util.net.AprEndpoint.SocketProcessor#doRun


sorry that I have an doc file  to show stacktrace but due to safe reason , I cant add an attachment , If you neet more message ,I will try to add more details . 

And I am not a native speaker , hope that I made it clear
Comment 1 DigitalCat 2023-06-24 07:48:49 UTC
add crash stacktrace



Thread 8117: (state = IN_NATIVE)
 - org.apache.tomcat.jni.SSLSocket.handshake(long) @bci=0 (Compiled frame; information may be imprecise)
 - org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(org.apache.tomcat.util.net.SocketWrapperBase) @bci=169, line=738 (Compiled frame)
 - org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run() @bci=109, line=2105 (Compiled frame)
 - org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker) @bci=92, line=1191 (Compiled frame)
 - org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run() @bci=5, line=659 (Interpreted frame)
 - org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run() @bci=4, line=61 (Compiled frame)
 - java.lang.Thread.run() @bci=11, line=750 (Compiled frame)




Thread 16878: (state = IN_VM)
 - sun.misc.Unsafe.freeMemory(long) @bci=0 (Compiled frame; information may be imprecise)
 - java.nio.DirectByteBuffer$Deallocator.run() @bci=17, line=94 (Compiled frame)
 - sun.misc.Cleaner.clean() @bci=12, line=143 (Compiled frame)
 - sun.reflect.GeneratedMethodAccessor72.invoke(java.lang.Object, java.lang.Object[]) @bci=36 (Compiled frame)
 - sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=6, line=43 (Compiled frame)
 - java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object[]) @bci=56, line=498 (Compiled frame)
 - org.apache.tomcat.util.buf.ByteBufferUtils.cleanDirectBuffer(java.nio.ByteBuffer) @bci=24, line=124 (Compiled frame)
 - org.apache.tomcat.util.net.SocketBufferHandler.free() @bci=11, line=229 (Compiled frame)
 - org.apache.tomcat.util.net.AprEndpoint$AprSocketWrapper.doClose() @bci=76, line=2385 (Compiled frame)
 - org.apache.tomcat.util.net.SocketWrapperBase.close() @bci=34, line=429 (Compiled frame)
 - org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun() @bci=32, line=2145 (Compiled frame)
 - org.apache.tomcat.util.net.SocketProcessorBase.run() @bci=21, line=49 (Compiled frame)
 - org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker) @bci=92, line=1191 (Compiled frame)
 - org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run() @bci=5, line=659 (Interpreted frame)
 - org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run() @bci=4, line=61 (Compiled frame)
 - java.lang.Thread.run() @bci=11, line=750 (Compiled frame)
Comment 2 Mark Thomas 2023-06-28 13:07:13 UTC
Note:
The APR/Tomcat Native HTTP and AJP connectors are deprecated in Tomcat 9 and have been removed in Tomcat 10.1.x onwards. You have plenty of time before 9.0.x reaches End-Of-Life but you might want to switch to one of the alternatives sooner rather than later. The NIO connector with OpenSSLImplementation is probably a good option.
Comment 3 Michael Osipov 2023-06-28 13:19:32 UTC
Can you reliably reproduce the issue? What OS are you using? I haven't seen so many crashes with APR for the past 10 years in such a short time frame. Especially #setSocketOptions() is weird.
Comment 4 Mark Thomas 2023-06-28 13:22:59 UTC
To investigate this further we need a test case that reproduces the crash reliably enough for us to be able to debug it.
Comment 5 Mark Thomas 2023-06-28 13:23:26 UTC
Reduce severity to normal.
Comment 6 Christopher Schultz 2023-06-28 16:02:42 UTC
Also, the native backtrace would be helpful (it should be found in the hs_pid_xxxx.txt file generated on crash). If you were able to inspect with gdb, anything you found in there would be helpful as well.
Comment 7 DigitalCat 2023-06-30 10:33:28 UTC
We found the way to reproduce this issue。

Upgrading Tomcat Native Library 1.2.23 to Tomcat Native Library 1.2.33, we set the tomcat in apr mode, In a high-concurrency scenario, the Tomcat process crashes.
When line 476 is commented out in native/src/sslcontext.c, the JVM crash does not occur.
We gauss the content in servername is not released in line 192 in native/src/sslcontext.c, causing the JVM to crash.
I hope you reply as soon as possible.


Here's our analysis, and I hope it's helpful.

First, the apr allocates and releases the memory of internal objects through malloc_pool. The allocated objects do not need to be released. When the memory pool is destroyed, all the objects allocated in the memory pool are destroyed and released. At the same time, the cleanup callback function registered in the memory pool is called back.
The local invoking method for the Tomcat to create and destroy a memory pool is Pool.create,Pool.destory.

When Tomcat functions as an HTTPS server, the SSLContext (org.apache.tomcat.util.net.openssl.OpenSSLContext#OpenSSLContext) is initialized using SSContext.make (native invoking) during startup. During server certificate initialization, an APR memory pool is created,
The memory pool will be destroyed when the certificate is released. Check the invoking code. It is found that the memory pool destruction method is invoked in the org.apache.tomcat.util.net.AprEndpoint#unbind method. In normal cases, the memory pool is released when the endpoint is destroyed (destory) or stopped.

PR based on GitHub
https://github.com/apache/tomcat-native/commit/d1885c202d987f3898ba72b7fc084413f0bdb1f9
In this modification, the callback function ssl_callback_ClientHello is registered when SSLContext.make is used. This function is invoked when the SSL handshake is performed inside SSLSocket.handshake each time the client requests.
The memory is allocated on the memory pool for each call, storing the servername, which is not shown here.

It is suspected that memory leakage occurs before the apr memory pool destruction or cleanup method is invoked.
Comment 8 DigitalCat 2023-06-30 10:37:56 UTC
log in jvm hs_err_.log

---------------  T H R E A D  ---------------

Current thread (0x00002b7a41865800):  JavaThread "catalina-exec-18000-120" daemon [_thread_in_native, id=16233, stack(0x00002b7b724e7000,0x00002b7b725e8000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

Registers:
RAX=0x322e332e372e3632, RBX=0x00002b7a8c017ac0, RCX=0x0000000000000000, RDX=0x0000000000000030
RSP=0x00002b7b725e6580, RBP=0x00002b7b725e65f0, RSI=0x00000000ffffffff, RDI=0x0000000000000000
R8 =0xbbb9b201d5303400, R9 =0x00002b7a8c0bb940, R10=0x12656c79b7059708, R11=0xb71a9a122dd39dad
R12=0x0000000000000000, R13=0x0000000738591598, R14=0x00002b7b725e6820, R15=0x00002b7a41865800
RIP=0x00002b7aac08aaf3, EFLAGS=0x0000000000010246, CSGSFS=0x0000000000000033, ERR=0x0000000000000000
  TRAPNO=0x000000000000000d

Top of Stack: (sp=0x00002b7b725e6580)
0x00002b7b725e6580:   00002b7b725e6650 00002b7a8c14f330
0x00002b7b725e6590:   0000003000000000 0000000000000000
0x00002b7b725e65a0:   0000000000000000 00002b7a42262a80
0x00002b7b725e65b0:   00002b7aac8256a0 0000000000000030 

Instructions: (pc=0x00002b7aac08aaf3)
0x00002b7aac08aad3:   10 05 00 00 bf 00 00 00 00 e8 17 36 f9 ff 48 89
0x00002b7aac08aae3:   83 e0 01 00 00 48 8b 45 98 48 8b 80 a8 00 00 00
0x00002b7aac08aaf3:   48 8b 80 e0 03 00 00 48 85 c0 0f 84 ee 00 00 00
0x00002b7aac08ab03:   48 8b 45 98 48 8b 80 10 05 00 00 48 8b 80 40 02 

Register to memory mapping:

RAX=0x322e332e372e3632 is an unknown value
RBX=0x00002b7a8c017ac0 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x0000000000000030 is an unknown value
RSP=0x00002b7b725e6580 is pointing into the stack for thread: 0x00002b7a41865800
RBP=0x00002b7b725e65f0 is pointing into the stack for thread: 0x00002b7a41865800
RSI=0x00000000ffffffff is an unknown value
RDI=0x0000000000000000 is an unknown value
R8 =0xbbb9b201d5303400 is an unknown value
R9 =0x00002b7a8c0bb940 is an unknown value
R10=0x12656c79b7059708 is an unknown value
R11=0xb71a9a122dd39dad is an unknown value
R12=0x0000000000000000 is an unknown value
R13=0x0000000738591598 is an oop
java.util.concurrent.locks.ReentrantLock 
 - klass: 'java/util/concurrent/locks/ReentrantLock'
 - ---- fields (total size 2 words):
 - private final 'sync' 'Ljava/util/concurrent/locks/ReentrantLock$Sync;' @12  a 'java/util/concurrent/locks/ReentrantLock$NonfairSync' (e70ad38b 385915b0)
R14=0x00002b7b725e6820 is pointing into the stack for thread: 0x00002b7a41865800
R15=0x00002b7a41865800 is a thread


Stack: [0x00002b7b724e7000,0x00002b7b725e8000],  sp=0x00002b7b725e6580,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libssl.so.1.1+0x8aaf3]
C  [libssl.so.1.1+0x6e230]
C  [libssl.so.1.1+0x6d5f7]
C  [libssl.so.1.1+0x6d093]
C  [libssl.so.1.1+0x4ec81]  SSL_do_handshake+0xfa
C  [libtcnative-1.so.0.2.36+0x2b540]  Java_org_apache_tomcat_jni_SSLSocket_handshake+0x81
J 14729  org.apache.tomcat.jni.SSLSocket.handshake(J)I (0 bytes) @ 0x00002b7a45eebb41 [0x00002b7a45eeba80+0xc1]
J 15421 C2 org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(Lorg/apache/tomcat/util/net/SocketWrapperBase;)Z (419 bytes) @ 0x00002b7a469e46b8 [0x00002b7a469e4540+0x178]
J 15420 C2 org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run()V (200 bytes) @ 0x00002b7a46a2f1a4 [0x00002b7a46a2f0e0+0xc4]
J 15712% C2 org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(Lorg/apache/tomcat/util/threads/ThreadPoolExecutor$Worker;)V (187 bytes) @ 0x00002b7a45e8d014 [0x00002b7a45e8ce40+0x1d4]
j  org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run()V+5
J 10304 C1 org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run()V (25 bytes) @ 0x00002b7a45a274d4 [0x00002b7a45a273c0+0x114]
J 10295 C1 java.lang.Thread.run()V (17 bytes) @ 0x00002b7a45a2cb14 [0x00002b7a45a2c9c0+0x154]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x7778a5]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0xc85
V  [libjvm.so+0x77509f]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x26f
V  [libjvm.so+0x77564d]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x5d
V  [libjvm.so+0x82b1de]  thread_entry(JavaThread*, Thread*)+0xae
V  [libjvm.so+0xc0adf7]  JavaThread::thread_main_inner()+0x137
V  [libjvm.so+0xa8b730]  java_start(Thread*)+0x170
C  [libpthread.so.0+0x96b4]  start_thread+0xc4
Comment 9 Christopher Schultz 2023-06-30 19:26:23 UTC
(In reply to DigitalCat from comment #7)
> When line 476 is commented out in native/src/sslcontext.c, the JVM crash
> does not occur.
> We guess the content in servername is not released in line 192 in
> native/src/sslcontext.c, causing the JVM to crash.

These line numbers do not match the current main branch of Tomcat. Are you able to:

1. Use sources from git/main (1.2.38-dev)
2. Show the patch which resolves the issue
3. Run the process long enough to ensure that you aren't leaking memory from elsewhere
Comment 10 Mark Thomas 2023-07-03 11:28:30 UTC
The line numbers match up for the quoted version (1.2.33) and the relevant code is unchanged in 1.2.x.

The description of a potential memory leak looks plausible but I am unable to reproduce it with 9.0.x and Tomcat Native 1.2.37. With 1,000,000+ requests I do not see the expected increase in any of the memory pools in a profiler.

A test case that demonstrates the issue would be very helpful. Keep in mind that ideally it needs to be something committers can run locally.
Comment 11 DigitalCat 2023-07-03 12:19:02 UTC
Yes branch 1.2.X can match line no.

We use wrk to reproduce this issue.

"./wrk -t 20 -c 1000 -d 5s --latency --timeout 1s https://XXXXXX/V3/LoginRoute"

wrk is an opensource project, you can try with this .

Due to our company's security  policy, I can not offer our test case . Sorry for that.
Comment 12 DigitalCat 2023-07-10 13:31:28 UTC
We are anticipating your reply
Comment 13 Mark Thomas 2023-07-27 23:51:56 UTC
I'm unable to reproduce the crash but adding some debug logging did confirm that there is a memory leak here. I have a fix that I just need to clean up before I commit it.
Comment 14 Mark Thomas 2023-07-31 08:08:35 UTC
Correcting Product etc
Comment 15 Mark Thomas 2023-07-31 08:08:47 UTC
Note 2.0.x is also affected
Comment 16 Mark Thomas 2023-07-31 09:06:16 UTC
Fixed in:
- 2.0.x for 2.0.5 onwards
- 1.2.x for 1.2.38 onwards