Bug 60461 - SIGSEGV in SSLSocket.getInfos
Summary: SIGSEGV in SSLSocket.getInfos
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 8.5.8
Hardware: PC Linux
: P2 regression with 5 votes (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
: 61149 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-12-09 13:00 UTC by Ludovic Pénet
Modified: 2019-10-03 17:43 UTC (History)
1 user (show)



Attachments
JVM crash log (109.27 KB, text/plain)
2016-12-09 13:00 UTC, Ludovic Pénet
Details
8.5.11 Windows Crash Log (59.25 KB, text/plain)
2017-03-01 23:05 UTC, mattcoz
Details
Another crash log (62.79 KB, text/plain)
2017-05-23 17:38 UTC, mattcoz
Details
Test patch for OpenSSL engine (4.96 KB, patch)
2017-06-13 19:33 UTC, Remy Maucherat
Details | Diff
Patch for OpenSSLEngine (5.29 KB, patch)
2017-06-14 11:54 UTC, Remy Maucherat
Details | Diff
APR SSL support sync (5.80 KB, patch)
2017-06-22 19:24 UTC, Remy Maucherat
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ludovic Pénet 2016-12-09 13:00:55 UTC
Created attachment 34512 [details]
JVM crash log

I recently upgraded my development setup from Tomcat 8.5.6 to 8.5.8 and experiences a lot of SIGSEGV in Tomcat Native since then.

It always happens when calling SSLSocket.getInfos, but not on every call.

My HTTPS connector is configured this way :

    <Connector port="8443" SSLEnabled="true"
                protocol="org.apache.coyote.http11.Http11AprProtocol"
                maxThreads="150" scheme="https" secure="true"
                sslProtocol="TLS"
                sslImplementationName="org.apache.tomcat.util.net.openssl.OpenSSLImplementation"
		SSLCertificateFile="${catalina.home}/conf/certificate.crt"
                SSLCertificateKeyFile="${catalina.home}/conf/privateKey.key"
		server="Apache-Coyote/1.1"
                URIEncoding="UTF-8">
      <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
    </Connector>

I use Google Chrome 55 as HTTP client.

Debugging the project with Netbeans and attaching GDB to the java process, I get the following stacktrace when SEGV :

__GI_raise (sig=6, sig@entry=6)	
__GI_abort ()	
os::abort(bool) ()	
VMError::report_and_die() ()	
JVM_handle_linux_signal ()	
signalHandler(int, siginfo*, void*) ()	
<signal handler called> ()	
Java_org_apache_tomcat_jni_SSLSocket_getInfoS (e=0x7fc41861b1f8, o=0x7fc37a6f2b68, sock=140478657220768, what=2)	

So, the violation occurs on the last line of :

TCN_IMPLEMENT_CALL(jstring, SSLSocket, getInfoS)(TCN_STDARGS, jlong sock,
                                                 jint what)
{
    tcn_socket_t   *a = J2P(sock, tcn_socket_t *);
    tcn_ssl_conn_t *s;
    jstring value = NULL;
    apr_status_t rv = APR_SUCCESS;

    UNREFERENCED(o);
    TCN_ASSERT(sock != 0);

    s = (tcn_ssl_conn_t *)(a->opaque);

Please find attached the JVM crash log.

To debug, I configure native with :

    ./configure CC=${IDE_CC} CXX=${IDE_CXX} CFLAGS="-g3 -gdwarf-2" CXXFLAGS="-g3 -gdwarf-2" --with-java-home=/usr/lib/jvm/java-8-oracle --with-ssl=/home/lpenet/devoss/openssl-1.0.2j

and configure openssl 1.0.2j with the following command line :

    ./config no-shared -fPIC
Comment 1 Ludovic Pénet 2016-12-09 13:12:21 UTC
I did not explicitly mentionned it, but I use stock openssl sources and the native version included in tomcat 8.5.8 sources.
Comment 2 Remy Maucherat 2016-12-09 13:17:05 UTC
Maybe, but when I see that, I think you should complain to other people as well:
j  org.apache.catalina.connector.Request.getAttributeNames()Ljava/util/Enumeration;+17
j  org.apache.myfaces.webapp.ManagedBeanDestroyerListener.requestDestroyed(Ljavax/servlet/ServletRequestEvent;)V+13

It's only cosmetic, bt why are you specifying sslImplementationName when you're using APR ? Do you expect it to use it JSSE by accident ? server="Apache-Coyote/1.1" is the default already, so no need to
Comment 3 Ludovic Pénet 2016-12-09 13:37:00 UTC
Specifying sslImplementationName is just plain paranoia, and to make it clear for coworkers arriving on the project and not already knowing it is the default with APR.
Comment 4 Mark Thomas 2016-12-12 11:59:30 UTC
The MyFaces code doesn't seem to be doing anything unreasonable (like retaining references to previous requests).

I've been unable to reproduce this. A minimal (i.e. the smallest possible) test case that reproduces this (at least some of the time, ideally all of the time) is going to be required to investigate this further.

It would also be worth testing with 8.5.9.
Comment 5 Ludovic Pénet 2016-12-12 12:14:49 UTC
Thank you for your quick reply.

I acknowledge your request and will do my best to prepare this.

However, I have to finish urgent work before a long christmas break and I might not do this before the second half of January.

For the time being, I stick to 8.5.6 as I saw no security patch of interest to me but will test with the latest version (8.5.9 or later) then.
Comment 6 Christopher Schultz 2016-12-12 15:35:27 UTC
I would be interested to know if Tomcat 8.5.8 with the libtcnative from 8.5.6 still exhibits the same problem. Is that something you can test easily?
Comment 7 Ludovic Pénet 2016-12-12 15:37:34 UTC
Yes, in fact, I already tried it.

And yes, it has the same behaviour.
Comment 8 Ludovic Pénet 2017-01-09 17:31:03 UTC
Hi.

I tested with Tomcat 8.5.10 (built from sources as the release is still in progress), and it works fine.

So, I think we can close this issue.
Comment 9 mattcoz 2017-03-01 23:05:22 UTC
Created attachment 34790 [details]
8.5.11 Windows Crash Log

This crash still exists in 8.5.11 on Windows. I've uploaded my log.
Comment 10 Ludovic Pénet 2017-03-02 08:23:39 UTC
Reopening, following mattcoz comment.
Comment 11 Mark Thomas 2017-03-02 21:10:53 UTC
Steps to reproduce please on a clean install of the latest release of any of the supported Tomcat versions (7.0.x, 8.0.x, 8.5.x, 9.0.x) and the latest Tomcat Native release (1.2.x).
Comment 12 mattcoz 2017-03-02 21:42:46 UTC
Unfortunately I have no idea what is causing it and I don't know how to reproduce it. I installed a fresh copy of 8.5.11 with tcnative 1.2.10 and then copied in my configuration files. It crashed only once since I installed it on the 27th. It had crashed 6 times in 8.5.6 since late November, but 3 of those were in the last week and I'm not sure what could have changed. I did install a new SSL certificate, but it seems to be working fine most of the time. I really wish I could be of more help.
Comment 13 mattcoz 2017-04-21 03:53:20 UTC
Well it was going well for a while, then it crashed again last night. I'm at a loss here, what can I do to help figure this out?
Comment 14 Mark Thomas 2017-04-21 16:22:06 UTC
Steps to reproduce still required.
Comment 15 mattcoz 2017-04-21 16:32:06 UTC
I'm sorry, I have no steps, I don't know how to reproduce it. Is there any other information I can give you? Is there anything I should do to try to figure out how to reproduce it?
Comment 16 mattcoz 2017-04-21 21:28:15 UTC
Looking at his log and my logs, the commonality is that it starts with a call to getAttributeNames on the request. Not sure if that helps at all, that's a call that works all the time without crashing. I have multiple sites running on the server and the crashes are not limited to just one of them. None of them are doing anything out of the ordinary, and nothing in any of my logs looks suspicious. I would LOVE to give you the info you need, but I just don't know what that could be. I need some guidance more than just NEEDINFO.
Comment 17 mattcoz 2017-05-11 17:15:17 UTC
Another crash last night, please let me know what I can do to help.
Comment 18 mattcoz 2017-05-23 17:37:28 UTC
Just crashed again, is there anything I can look for to help? Please?
Comment 19 mattcoz 2017-05-23 17:38:45 UTC
Created attachment 35004 [details]
Another crash log
Comment 20 Remy Maucherat 2017-05-23 21:15:06 UTC
Are you really going to update this BZ every two weeks whenever you experience a JVM crash ? Surely you realize posting the same thing ten times isn't going to help, right ?

I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this isn't going to be fixed for a while, most likely.
Comment 21 mattcoz 2017-05-23 22:09:46 UTC
(In reply to Remy Maucherat from comment #20)
> Are you really going to update this BZ every two weeks whenever you
> experience a JVM crash ? Surely you realize posting the same thing ten times
> isn't going to help, right ?
> 
> I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this
> isn't going to be fixed for a while, most likely.

I know it's not helping, but nobody is telling me what I can do to help. I was under the impression that NIO used the JSSE implementation, which is one of the reasons I was hesitant to make the switch. It looks like this is no longer the case though, so I'll give it a shot. Thanks, but hopefully somebody can eventually fix this.
Comment 22 Remy Maucherat 2017-06-02 17:45:59 UTC
*** Bug 61149 has been marked as a duplicate of this bug. ***
Comment 23 Christopher Schultz 2017-06-02 22:38:16 UTC
Some initial investigation based upon the initial bug report + crash dump.

1. The call from Java is AprSSLSupport.getCipherSuite
2. AprSSLSupport.getCipherSuite calls SSLSocket.getInfoS [native]
... with arg values [socketRef, a pointer to a tcn_socket_t, Java-long]
                and [SSL.SSL_INFO_CIPHER, value=0x02 Java-int]
3. The native call, for SSL_INFO_CIPHER should execute the following code (compiled without DEBUG):

TCN_IMPLEMENT_CALL(jstring, SSLSocket, getInfoS)(JNIEnv *e, jobject o,
                                                 jlong sock, jint what)
{
    tcn_socket_t   *a = J2P(sock, tcn_socket_t *);
    tcn_ssl_conn_t *s;
    jstring value = NULL;
    apr_status_t rv = APR_SUCCESS;

    (o)=(o); // Don't complain that we aren't referencing a function arg

    s = (tcn_ssl_conn_t *)(a->opaque);

    if(what == SSL_INFO_CIPHER) // this is actually a switch case
        value = tcn_new_string(e, SSL_get_cipher_name(s->ssl));

    if (rv != APR_SUCCESS) // Doesn't happen
        tcn_ThrowAPRException(e, rv);

    return value;
}

There really aren't many opportunities for a crash, here:

1. Without DEBUG enabled, the value of "sock" isn't tested to be non-NULL after the (o)=(o) line, so theoretically the local 'a' could be NULL
2. The value of 's' is never checked against NULL, and could segfault when dereferencing s->ssl

The dump says the fault is in *this* function, and e.g. not a function called by it.
SSL_get_cipher_name returns NULL if there isn't any SSL session.[1]
The tcn_new_string function returns NULL if the argument is NULL, so it's safe to call without checking the return value of SSL_get_cipher_name.

In the Windows dump, the referenced address is spelled-out: "reading address 0x0000000000000130". That's not "NULL".
In the Linux dump, the referenced address is not directly shown.

The Linux dump tells us where in the function we are:
    Java_org_apache_tomcat_jni_SSLSocket_getInfoS+0x19

That's 25 bytes into the function, which is "not very far". I'm not entirely sure if the instructions for setting-up the stack for local variables are included in that 0x19, but if they are, then it's possible that we are talking about the very beginning of the function, here.

Here's the gdb disassembly of the beginning of the function:

(gdb) disassemble Java_org_apache_tomcat_jni_SSLSocket_getInfoS
Dump of assembler code for function Java_org_apache_tomcat_jni_SSLSocket_getInfoS:
   0x00000000000200f0 <+0>:	mov    QWORD PTR [rsp-0x30],rbx
   0x00000000000200f5 <+5>:	mov    QWORD PTR [rsp-0x20],r12
   0x00000000000200fa <+10>:	mov    ebx,ecx
   0x00000000000200fc <+12>:	mov    QWORD PTR [rsp-0x18],r13
   0x0000000000020101 <+17>:	mov    QWORD PTR [rsp-0x28],rbp
   0x0000000000020106 <+22>:	mov    r12,rdi
   0x0000000000020109 <+25>:	mov    QWORD PTR [rsp-0x10],r14    <---- crash here?!
   0x000000000002010e <+30>:	mov    QWORD PTR [rsp-0x8],r15

At this point, no accesses have taken place other than things on the stack. Basically, we haven't actually done anything, yet.

If these really are two reports of the same bug, then the Windows report tells us we are trying to read from a bad location. Everything up through getInfoS+0x19 reads from registers, which should never cause a SIGSEGV.

:(

I'm trying to make some sense of the stacks in these dumps but ... they aren't making any sense to me... things like stack frames that are thousands of bytes large, missing data (0x2 has to be in there somewhere!) but I'll have to save that for another day.

[1] https://www.openssl.org/docs/man1.1.0/ssl/SSL_get_cipher_name.html
Comment 24 Christopher Schultz 2017-06-02 22:39:30 UTC
Are any of the reporters of this bug able to compile their own custom tcnative library if I provide a small instrumentation patch?
Comment 25 Thomas 2017-06-03 07:32:29 UTC
I can try to compile my own version of the tcnative library.
Comment 26 Ludovic Pénet 2017-06-06 09:19:58 UTC
I can. For GNU/Linux.
Comment 27 mattcoz 2017-06-06 17:34:09 UTC
(In reply to Remy Maucherat from comment #20)
> Are you really going to update this BZ every two weeks whenever you
> experience a JVM crash ? Surely you realize posting the same thing ten times
> isn't going to help, right ?
> 
> I'd recommend trying NIO(1 or 2) with OpenSSL at this point, since this
> isn't going to be fixed for a while, most likely.

Ok, so my server is now running NIO with OpenSSL and I got another crash. It seems to be related to this as it has a common origin.

j  org.apache.tomcat.jni.SSL.getCipherForSSL(J)Ljava/lang/String;+0
j  org.apache.tomcat.util.net.openssl.OpenSSLEngine$OpenSSLSession.getCipherSuite()Ljava/lang/String;+30
j  org.apache.tomcat.util.net.jsse.JSSESupport.getCipherSuite()Ljava/lang/String;+13
j  org.apache.coyote.AbstractProcessor.populateSslRequestAttributes()V+11

vs

J 63194  org.apache.tomcat.jni.SSLSocket.getInfoS(JI)Ljava/lang/String; (0 bytes) @ 0x00000000016e3a04 [0x00000000016e39c0+0x44]
j  org.apache.tomcat.util.net.AprSSLSupport.getCipherSuite()Ljava/lang/String;+24
j  org.apache.coyote.AbstractProcessor.populateSslRequestAttributes()V+11

So it appears that the problem extends beyond tcnative.
Comment 28 Christopher Schultz 2017-06-06 18:18:33 UTC
This *is* tcnative.

It extends beyond the scope of the APR connector. Can you reproduce this in a testing environment? I have a small patch for sslinfo.c:

At line 293, add this:

    s = (tcn_ssl_conn_t *)(a->opaque);    // <--- this is existing line 292
    assert(NULL != sock);
    assert(NULL != a);
    assert(NULL != s);
    assert(NULL != s->ssl);

Re-build and let me know if it aborts on any of those lines.
Comment 29 Christopher Schultz 2017-06-06 18:28:07 UTC
Ludovic, why are you building OpenSSL in "no-shared" mode? If you are building OpenSSL in no-shared mode, how is OpenSSL 1.0.2j being loaded into the JVM process? Are you statically-linking OpenSSL into libtcnative?
Comment 30 Ludovic Pénet 2017-06-07 09:05:32 UTC
Christopher : yes, I  statically link the openssl lib to avoid depending on the versions installed on the host system, as I update them much quicker.
Comment 31 Ludovic Pénet 2017-06-09 11:06:07 UTC
Maybe I did something really stupid, but I get : 

jsvc.exec: symbol lookup error: /usr/share/java/tomcat-base_senateurs-commun/libtcnative-1.so.0.2.12: undefined symbol: assert
Comment 32 Remy Maucherat 2017-06-13 19:33:26 UTC
Created attachment 35051 [details]
Test patch for OpenSSL engine

As the OpenSSL engine is affected, I guess it's inevitable some SSL session accesses are asynchronous. As a result, sync/cache has to be added. As I am not reproducing it, this is only an experiment patch that can be used for further testing.
The APR connector would need something similar as well.

The performance impact should be quite minimal.
Comment 33 Remy Maucherat 2017-06-14 11:54:47 UTC
Created attachment 35056 [details]
Patch for OpenSSLEngine
Comment 34 Remy Maucherat 2017-06-19 14:44:51 UTC
I improved the patch (hopefully) and committed it, it will be in 9M22 and 8.5.16. I haven't looked at the APR connector issue.
Comment 35 Remy Maucherat 2017-06-19 20:48:00 UTC
For the APR connector issue, since I believe it is always possible and legitimate to have a concurrent socket close especially with HTTP/2, the accesses from AprSSLSupport would need to sync on AprEndpoint.AprSocketWrapper.closedLock. At the moment, all methods check that socketWrapper.getSocket().longValue() is not 0, but that won't work if concurrency is possible.
Comment 36 Remy Maucherat 2017-06-22 19:24:00 UTC
Created attachment 35069 [details]
APR SSL support sync

For APR, here is what the patch would look like. Please test it if you can.
Comment 37 Remy Maucherat 2017-06-23 21:37:41 UTC
I think this should be fixed by syncing for now. The changes are now included for APR and OpenSSL in 9M23 and 8.5.17.
Comment 38 mattcoz 2017-06-30 14:54:27 UTC
I'm updated to 8.5.16 using NIO, I'll be sure to let you know if the crashes continue. Thanks.
Comment 39 mattk 2019-10-03 17:43:55 UTC
Is it known whether this bug was introduced in 8.5.x?  Or did it also exist in 8.0.x or 7.0.x?