Created attachment 34512 [details] JVM crash log I recently upgraded my development setup from Tomcat 8.5.6 to 8.5.8 and experiences a lot of SIGSEGV in Tomcat Native since then. It always happens when calling SSLSocket.getInfos, but not on every call. My HTTPS connector is configured this way : <Connector port="8443" SSLEnabled="true" protocol="org.apache.coyote.http11.Http11AprProtocol" maxThreads="150" scheme="https" secure="true" sslProtocol="TLS" sslImplementationName="org.apache.tomcat.util.net.openssl.OpenSSLImplementation" SSLCertificateFile="${catalina.home}/conf/certificate.crt" SSLCertificateKeyFile="${catalina.home}/conf/privateKey.key" server="Apache-Coyote/1.1" URIEncoding="UTF-8"> <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" /> </Connector> I use Google Chrome 55 as HTTP client. Debugging the project with Netbeans and attaching GDB to the java process, I get the following stacktrace when SEGV : __GI_raise (sig=6, sig@entry=6) __GI_abort () os::abort(bool) () VMError::report_and_die() () JVM_handle_linux_signal () signalHandler(int, siginfo*, void*) () <signal handler called> () Java_org_apache_tomcat_jni_SSLSocket_getInfoS (e=0x7fc41861b1f8, o=0x7fc37a6f2b68, sock=140478657220768, what=2) So, the violation occurs on the last line of : TCN_IMPLEMENT_CALL(jstring, SSLSocket, getInfoS)(TCN_STDARGS, jlong sock, jint what) { tcn_socket_t *a = J2P(sock, tcn_socket_t *); tcn_ssl_conn_t *s; jstring value = NULL; apr_status_t rv = APR_SUCCESS; UNREFERENCED(o); TCN_ASSERT(sock != 0); s = (tcn_ssl_conn_t *)(a->opaque); Please find attached the JVM crash log. To debug, I configure native with : ./configure CC=${IDE_CC} CXX=${IDE_CXX} CFLAGS="-g3 -gdwarf-2" CXXFLAGS="-g3 -gdwarf-2" --with-java-home=/usr/lib/jvm/java-8-oracle --with-ssl=/home/lpenet/devoss/openssl-1.0.2j and configure openssl 1.0.2j with the following command line : ./config no-shared -fPIC
I did not explicitly mentionned it, but I use stock openssl sources and the native version included in tomcat 8.5.8 sources.
Maybe, but when I see that, I think you should complain to other people as well: j org.apache.catalina.connector.Request.getAttributeNames()Ljava/util/Enumeration;+17 j org.apache.myfaces.webapp.ManagedBeanDestroyerListener.requestDestroyed(Ljavax/servlet/ServletRequestEvent;)V+13 It's only cosmetic, bt why are you specifying sslImplementationName when you're using APR ? Do you expect it to use it JSSE by accident ? server="Apache-Coyote/1.1" is the default already, so no need to
Specifying sslImplementationName is just plain paranoia, and to make it clear for coworkers arriving on the project and not already knowing it is the default with APR.
The MyFaces code doesn't seem to be doing anything unreasonable (like retaining references to previous requests). I've been unable to reproduce this. A minimal (i.e. the smallest possible) test case that reproduces this (at least some of the time, ideally all of the time) is going to be required to investigate this further. It would also be worth testing with 8.5.9.
Thank you for your quick reply. I acknowledge your request and will do my best to prepare this. However, I have to finish urgent work before a long christmas break and I might not do this before the second half of January. For the time being, I stick to 8.5.6 as I saw no security patch of interest to me but will test with the latest version (8.5.9 or later) then.
I would be interested to know if Tomcat 8.5.8 with the libtcnative from 8.5.6 still exhibits the same problem. Is that something you can test easily?
Yes, in fact, I already tried it. And yes, it has the same behaviour.
Hi. I tested with Tomcat 8.5.10 (built from sources as the release is still in progress), and it works fine. So, I think we can close this issue.
Created attachment 34790 [details] 8.5.11 Windows Crash Log This crash still exists in 8.5.11 on Windows. I've uploaded my log.
Reopening, following mattcoz comment.
Steps to reproduce please on a clean install of the latest release of any of the supported Tomcat versions (7.0.x, 8.0.x, 8.5.x, 9.0.x) and the latest Tomcat Native release (1.2.x).
Unfortunately I have no idea what is causing it and I don't know how to reproduce it. I installed a fresh copy of 8.5.11 with tcnative 1.2.10 and then copied in my configuration files. It crashed only once since I installed it on the 27th. It had crashed 6 times in 8.5.6 since late November, but 3 of those were in the last week and I'm not sure what could have changed. I did install a new SSL certificate, but it seems to be working fine most of the time. I really wish I could be of more help.
Well it was going well for a while, then it crashed again last night. I'm at a loss here, what can I do to help figure this out?
Steps to reproduce still required.
I'm sorry, I have no steps, I don't know how to reproduce it. Is there any other information I can give you? Is there anything I should do to try to figure out how to reproduce it?
Looking at his log and my logs, the commonality is that it starts with a call to getAttributeNames on the request. Not sure if that helps at all, that's a call that works all the time without crashing. I have multiple sites running on the server and the crashes are not limited to just one of them. None of them are doing anything out of the ordinary, and nothing in any of my logs looks suspicious. I would LOVE to give you the info you need, but I just don't know what that could be. I need some guidance more than just NEEDINFO.
Another crash last night, please let me know what I can do to help.
Just crashed again, is there anything I can look for to help? Please?
Created attachment 35004 [details] Another crash log
Are you really going to update this BZ every two weeks whenever you experience a JVM crash ? Surely you realize posting the same thing ten times isn't going to help, right ? I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this isn't going to be fixed for a while, most likely.
(In reply to Remy Maucherat from comment #20) > Are you really going to update this BZ every two weeks whenever you > experience a JVM crash ? Surely you realize posting the same thing ten times > isn't going to help, right ? > > I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this > isn't going to be fixed for a while, most likely. I know it's not helping, but nobody is telling me what I can do to help. I was under the impression that NIO used the JSSE implementation, which is one of the reasons I was hesitant to make the switch. It looks like this is no longer the case though, so I'll give it a shot. Thanks, but hopefully somebody can eventually fix this.
*** Bug 61149 has been marked as a duplicate of this bug. ***
Some initial investigation based upon the initial bug report + crash dump. 1. The call from Java is AprSSLSupport.getCipherSuite 2. AprSSLSupport.getCipherSuite calls SSLSocket.getInfoS [native] ... with arg values [socketRef, a pointer to a tcn_socket_t, Java-long] and [SSL.SSL_INFO_CIPHER, value=0x02 Java-int] 3. The native call, for SSL_INFO_CIPHER should execute the following code (compiled without DEBUG): TCN_IMPLEMENT_CALL(jstring, SSLSocket, getInfoS)(JNIEnv *e, jobject o, jlong sock, jint what) { tcn_socket_t *a = J2P(sock, tcn_socket_t *); tcn_ssl_conn_t *s; jstring value = NULL; apr_status_t rv = APR_SUCCESS; (o)=(o); // Don't complain that we aren't referencing a function arg s = (tcn_ssl_conn_t *)(a->opaque); if(what == SSL_INFO_CIPHER) // this is actually a switch case value = tcn_new_string(e, SSL_get_cipher_name(s->ssl)); if (rv != APR_SUCCESS) // Doesn't happen tcn_ThrowAPRException(e, rv); return value; } There really aren't many opportunities for a crash, here: 1. Without DEBUG enabled, the value of "sock" isn't tested to be non-NULL after the (o)=(o) line, so theoretically the local 'a' could be NULL 2. The value of 's' is never checked against NULL, and could segfault when dereferencing s->ssl The dump says the fault is in *this* function, and e.g. not a function called by it. SSL_get_cipher_name returns NULL if there isn't any SSL session.[1] The tcn_new_string function returns NULL if the argument is NULL, so it's safe to call without checking the return value of SSL_get_cipher_name. In the Windows dump, the referenced address is spelled-out: "reading address 0x0000000000000130". That's not "NULL". In the Linux dump, the referenced address is not directly shown. The Linux dump tells us where in the function we are: Java_org_apache_tomcat_jni_SSLSocket_getInfoS+0x19 That's 25 bytes into the function, which is "not very far". I'm not entirely sure if the instructions for setting-up the stack for local variables are included in that 0x19, but if they are, then it's possible that we are talking about the very beginning of the function, here. Here's the gdb disassembly of the beginning of the function: (gdb) disassemble Java_org_apache_tomcat_jni_SSLSocket_getInfoS Dump of assembler code for function Java_org_apache_tomcat_jni_SSLSocket_getInfoS: 0x00000000000200f0 <+0>: mov QWORD PTR [rsp-0x30],rbx 0x00000000000200f5 <+5>: mov QWORD PTR [rsp-0x20],r12 0x00000000000200fa <+10>: mov ebx,ecx 0x00000000000200fc <+12>: mov QWORD PTR [rsp-0x18],r13 0x0000000000020101 <+17>: mov QWORD PTR [rsp-0x28],rbp 0x0000000000020106 <+22>: mov r12,rdi 0x0000000000020109 <+25>: mov QWORD PTR [rsp-0x10],r14 <---- crash here?! 0x000000000002010e <+30>: mov QWORD PTR [rsp-0x8],r15 At this point, no accesses have taken place other than things on the stack. Basically, we haven't actually done anything, yet. If these really are two reports of the same bug, then the Windows report tells us we are trying to read from a bad location. Everything up through getInfoS+0x19 reads from registers, which should never cause a SIGSEGV. :( I'm trying to make some sense of the stacks in these dumps but ... they aren't making any sense to me... things like stack frames that are thousands of bytes large, missing data (0x2 has to be in there somewhere!) but I'll have to save that for another day. [1] https://www.openssl.org/docs/man1.1.0/ssl/SSL_get_cipher_name.html
Are any of the reporters of this bug able to compile their own custom tcnative library if I provide a small instrumentation patch?
I can try to compile my own version of the tcnative library.
I can. For GNU/Linux.
(In reply to Remy Maucherat from comment #20) > Are you really going to update this BZ every two weeks whenever you > experience a JVM crash ? Surely you realize posting the same thing ten times > isn't going to help, right ? > > I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this > isn't going to be fixed for a while, most likely. Ok, so my server is now running NIO with OpenSSL and I got another crash. It seems to be related to this as it has a common origin. j org.apache.tomcat.jni.SSL.getCipherForSSL(J)Ljava/lang/String;+0 j org.apache.tomcat.util.net.openssl.OpenSSLEngine$OpenSSLSession.getCipherSuite()Ljava/lang/String;+30 j org.apache.tomcat.util.net.jsse.JSSESupport.getCipherSuite()Ljava/lang/String;+13 j org.apache.coyote.AbstractProcessor.populateSslRequestAttributes()V+11 vs J 63194 org.apache.tomcat.jni.SSLSocket.getInfoS(JI)Ljava/lang/String; (0 bytes) @ 0x00000000016e3a04 [0x00000000016e39c0+0x44] j org.apache.tomcat.util.net.AprSSLSupport.getCipherSuite()Ljava/lang/String;+24 j org.apache.coyote.AbstractProcessor.populateSslRequestAttributes()V+11 So it appears that the problem extends beyond tcnative.
This *is* tcnative. It extends beyond the scope of the APR connector. Can you reproduce this in a testing environment? I have a small patch for sslinfo.c: At line 293, add this: s = (tcn_ssl_conn_t *)(a->opaque); // <--- this is existing line 292 assert(NULL != sock); assert(NULL != a); assert(NULL != s); assert(NULL != s->ssl); Re-build and let me know if it aborts on any of those lines.
Ludovic, why are you building OpenSSL in "no-shared" mode? If you are building OpenSSL in no-shared mode, how is OpenSSL 1.0.2j being loaded into the JVM process? Are you statically-linking OpenSSL into libtcnative?
Christopher : yes, I statically link the openssl lib to avoid depending on the versions installed on the host system, as I update them much quicker.
Maybe I did something really stupid, but I get : jsvc.exec: symbol lookup error: /usr/share/java/tomcat-base_senateurs-commun/libtcnative-1.so.0.2.12: undefined symbol: assert
Created attachment 35051 [details] Test patch for OpenSSL engine As the OpenSSL engine is affected, I guess it's inevitable some SSL session accesses are asynchronous. As a result, sync/cache has to be added. As I am not reproducing it, this is only an experiment patch that can be used for further testing. The APR connector would need something similar as well. The performance impact should be quite minimal.
Created attachment 35056 [details] Patch for OpenSSLEngine
I improved the patch (hopefully) and committed it, it will be in 9M22 and 8.5.16. I haven't looked at the APR connector issue.
For the APR connector issue, since I believe it is always possible and legitimate to have a concurrent socket close especially with HTTP/2, the accesses from AprSSLSupport would need to sync on AprEndpoint.AprSocketWrapper.closedLock. At the moment, all methods check that socketWrapper.getSocket().longValue() is not 0, but that won't work if concurrency is possible.
Created attachment 35069 [details] APR SSL support sync For APR, here is what the patch would look like. Please test it if you can.
I think this should be fixed by syncing for now. The changes are now included for APR and OpenSSL in 9M23 and 8.5.17.
I'm updated to 8.5.16 using NIO, I'll be sure to let you know if the crashes continue. Thanks.
Is it known whether this bug was introduced in 8.5.x? Or did it also exist in 8.0.x or 7.0.x?