Bug 58901 - Download of large file is aborted in SSL_write on EINTR
Summary: Download of large file is aborted in SSL_write on EINTR
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_ssl (show other bugs)
Version: 2.4.4
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-21 05:34 UTC by Mark O'Donohue
Modified: 2019-01-06 12:55 UTC (History)
0 users



Attachments
test script files, one to download 20x 9m file, one to download smaller file (806 bytes, application/x-zip-compressed)
2016-01-21 05:34 UTC, Mark O'Donohue
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark O'Donohue 2016-01-21 05:34:54 UTC
Created attachment 33473 [details]
test script files, one to download 20x 9m file, one to download smaller file

A misconfiguration has revealed problem with SSL_write not handling EINTR .

Problem found in 2.4.4 but code is the same in 2.4.18 so expect the issue still occurs there as well. 

The problem occurs when large files over https://  and when the httpd process gets a signal in our case SIGUSR1. The large SSL download is aborted. 

A possible precondition is that the large download hits threshold and then results in synchronous write call - dont know just from some of the doco I read : 

[Tue Jan 19 05:04:59.742393 2016] [core:trace6] [pid 1402:tid 4115659632] core_filters.c(525): [client 10.134.120.119:59922] core_output_filter: flushing because of THRESHOLD_MAX_BUFFER
[Tue Jan 19 05:04:59.955371 2016] [core:trace6] [pid 1402:tid 4115659632] core_filters.c(525): [client 10.134.120.119:59922] core_output_filter: flushing because of THRESHOLD_MAX_BUFFER

Then with getting the SIGUSR1 and the download failure we see the following entry in the apache error_log : 

[Tue Jan 19 05:04:59.997093 2016] [ssl:info] [pid 1402:tid 4115659632] (4)Interrupted system call: [client 10.134.120.119:59922] AH01993: SSL output filter write failed.


The code link for 2.4.4 is here : 

https://svn.apache.org/viewvc/httpd/httpd/tags/2.4.4/modules/ssl/ssl_engine_io.c?view=markup#l761

760    outctx = (bio_filter_out_ctx_t *)filter_ctx->pbioWrite->ptr;
761	    res = SSL_write(filter_ctx->pssl, (unsigned char *)data, len);
...
...
778	        else if (ssl_err == SSL_ERROR_SYSCALL) {
779	            ap_log_cerror(APLOG_MARK, APLOG_INFO, outctx->rc, c, APLOGNO(01993)
780	                          "SSL output filter write failed.");
781	        }


So we can see SSL_ERROR_SYSCALL was the cause. 

Checking SSL_write, looks like on write fail it needs to test if can retry.  

http://stackoverflow.com/questions/24188013/openssl-and-signals

int result = SSL_write(ssl, buff, length);
if ((result < 0) && BIO_should_retry(SSL_get_wbio(ssl)))
    // need to retry


Here is sample retry code from OpenSSL

https://github.com/openssl/openssl/blob/919ba009429b3617e975933f37a23be996a33b8d/demos/bio/sconnect.c#L88

    for (;;) {
        i = BIO_write(out, &(p[off]), len);
        if (i <= 0) {
            if (BIO_should_retry(out)) {
                fprintf(stderr, "write DELAY\n");
                sleep(1);
                continue;
            } else {
                goto err;
            }
        }


There are also other samples within mod_ssl itself for checking EINTR for retry, as per this code around SSL_read, where we can see handling for EINTR that will resume : 

https://svn.apache.org/viewvc/httpd/httpd/tags/2.4.4/modules/ssl/ssl_engine_io.c?view=markup#l653


591	        rc = SSL_read(inctx->filter_ctx->pssl, buf + bytes, wanted - bytes);
...
...
653            else if (ssl_err == SSL_ERROR_SYSCALL) {
654	                if (APR_STATUS_IS_EAGAIN(inctx->rc)
655	                        || APR_STATUS_IS_EINTR(inctx->rc)) {
656	                    /* Already read something, return APR_SUCCESS instead. */
657	                    if (*len > 0) {
658	                        inctx->rc = APR_SUCCESS;
659	                        break;
660	                    }
661	                    if (inctx->block == APR_NONBLOCK_READ) {
662	                        break;
663	                    }
664	                    continue;  /* Blocking and nothing yet?  Try again. */
665	                }
666	                else {
667	                    ap_log_cerror(APLOG_MARK, APLOG_INFO, inctx->rc, c, APLOGNO(01991)
668	                                  "SSL input filter read failed.");
669	                }



For completeness this was from strace 

strace -v -o /tmp/strace.log -f -tt -s 16384 httpd/bin/apachectl -k start -D SSL


13817 23:11:00.530835 <... futex resumed> ) = 0
13817 23:11:00.530889 munmap(0xf5720000, 10489856) = 0
13817 23:11:00.530968 tgkill(13817, 13821, SIGUSR1 <unfinished ...>


13821 23:11:00.531027 <... poll resumed> ) = ? ERESTART_RESTARTBLOCK (To be rest
arted)
13817 23:11:00.531047 <... tgkill resumed> ) = 0
13821 23:11:00.531061 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
13821 23:11:00.531138 sigreturn( <unfinished ...>
13817 23:11:00.531158 futex(0xf41fdbd8, FUTEX_WAIT, 13821, NULL <unfinished ...>
13821 23:11:00.531175 <... sigreturn resumed> ) = -1 EINTR (Interrupted system c
all)
13821 23:11:00.531210 gettimeofday({1453263060, 531221}, NULL) = 0
13821 23:11:00.531274 write(2, "[Tue Jan 19 23:11:00.531221 2016] [ssl:info] [pi
d 13817:tid 4095728496] (4)Interrupted system call: [client 10.132.160.35:59069]
 AH01993: SSL output filter write failed.\n", 170) = 170
13821 23:11:00.531333 time(NULL)        = 1453263060
13821 23:11:00.531365 shutdown(14, 1 /* send */) = 0
13821 23:11:00.531664 poll([{fd=14, events=POLLIN}], 1, 500) = 1 ([{fd=14, reven
ts=POLLIN}])


Reproduction steps : 

The misconfiguration that made this problem visible was adding MaxRequestsPerChild = 1. to httpd-mpm.conf  : 

<IfModule mpm_worker_module>
    ...
    MaxRequestsPerChild      1
</IfModule>


The byproduct of that was that when one request was completed it then sent SIGUSR1 to the process, interrupting any working threads (which included our long running download). 

Here is the complete, MPM settings for worker, we also kept number of httpd proceses down, so we were more likely to get multiple requests goign to the same httpd process. 

<IfModule mpm_worker_module>
    StartServers             1
    MinSpareThreads         5
    MaxSpareThreads        25
    ThreadsPerChild         25
    MaxRequestWorkers      400
    MaxConnectionsPerChild   0
# tuneing - as per client
    ThreadLimit             128
    MaxClients             800
    ServerLimit            800
    MaxRequestsPerChild      1
</IfModule>


Then we run two scripts 
1) A script that downloads a large file (9) over SSL, 20 times. 
2) A script that repeatedly downloads a small .png file. 

(I will attach them)  

We find we can regularly interrupt the SSL download, so that none of the large downloads succeed. 

Obviously the problem is resolved by commenting out the parameter : 
    MaxRequestsPerChild      1
Which results in signal being sent after the small request is complete (since excceded maxrequests), and while the long download is still proceeding. 

And reduced signals will mean much less likely hood of the problem in SSL_write occuring. 

But I am sure there are legitemate times in operation when this condition will occur, and will result in aborted SSL download. 


Notes:

a) Problem does not occur over HTTP , only over HTTPS 
b) Problem happens with much smaller files too (300k), larger files just give more
   chance for the problem to occur.
Comment 1 Mark O'Donohue 2016-01-21 06:19:00 UTC
large file size was 9meg - reviewing I see that was not clear - Mark