Created attachment 33473 [details] test script files, one to download 20x 9m file, one to download smaller file A misconfiguration has revealed problem with SSL_write not handling EINTR . Problem found in 2.4.4 but code is the same in 2.4.18 so expect the issue still occurs there as well. The problem occurs when large files over https:// and when the httpd process gets a signal in our case SIGUSR1. The large SSL download is aborted. A possible precondition is that the large download hits threshold and then results in synchronous write call - dont know just from some of the doco I read : [Tue Jan 19 05:04:59.742393 2016] [core:trace6] [pid 1402:tid 4115659632] core_filters.c(525): [client 10.134.120.119:59922] core_output_filter: flushing because of THRESHOLD_MAX_BUFFER [Tue Jan 19 05:04:59.955371 2016] [core:trace6] [pid 1402:tid 4115659632] core_filters.c(525): [client 10.134.120.119:59922] core_output_filter: flushing because of THRESHOLD_MAX_BUFFER Then with getting the SIGUSR1 and the download failure we see the following entry in the apache error_log : [Tue Jan 19 05:04:59.997093 2016] [ssl:info] [pid 1402:tid 4115659632] (4)Interrupted system call: [client 10.134.120.119:59922] AH01993: SSL output filter write failed. The code link for 2.4.4 is here : https://svn.apache.org/viewvc/httpd/httpd/tags/2.4.4/modules/ssl/ssl_engine_io.c?view=markup#l761 760 outctx = (bio_filter_out_ctx_t *)filter_ctx->pbioWrite->ptr; 761 res = SSL_write(filter_ctx->pssl, (unsigned char *)data, len); ... ... 778 else if (ssl_err == SSL_ERROR_SYSCALL) { 779 ap_log_cerror(APLOG_MARK, APLOG_INFO, outctx->rc, c, APLOGNO(01993) 780 "SSL output filter write failed."); 781 } So we can see SSL_ERROR_SYSCALL was the cause. Checking SSL_write, looks like on write fail it needs to test if can retry. http://stackoverflow.com/questions/24188013/openssl-and-signals int result = SSL_write(ssl, buff, length); if ((result < 0) && BIO_should_retry(SSL_get_wbio(ssl))) // need to retry Here is sample retry code from OpenSSL https://github.com/openssl/openssl/blob/919ba009429b3617e975933f37a23be996a33b8d/demos/bio/sconnect.c#L88 for (;;) { i = BIO_write(out, &(p[off]), len); if (i <= 0) { if (BIO_should_retry(out)) { fprintf(stderr, "write DELAY\n"); sleep(1); continue; } else { goto err; } } There are also other samples within mod_ssl itself for checking EINTR for retry, as per this code around SSL_read, where we can see handling for EINTR that will resume : https://svn.apache.org/viewvc/httpd/httpd/tags/2.4.4/modules/ssl/ssl_engine_io.c?view=markup#l653 591 rc = SSL_read(inctx->filter_ctx->pssl, buf + bytes, wanted - bytes); ... ... 653 else if (ssl_err == SSL_ERROR_SYSCALL) { 654 if (APR_STATUS_IS_EAGAIN(inctx->rc) 655 || APR_STATUS_IS_EINTR(inctx->rc)) { 656 /* Already read something, return APR_SUCCESS instead. */ 657 if (*len > 0) { 658 inctx->rc = APR_SUCCESS; 659 break; 660 } 661 if (inctx->block == APR_NONBLOCK_READ) { 662 break; 663 } 664 continue; /* Blocking and nothing yet? Try again. */ 665 } 666 else { 667 ap_log_cerror(APLOG_MARK, APLOG_INFO, inctx->rc, c, APLOGNO(01991) 668 "SSL input filter read failed."); 669 } For completeness this was from strace strace -v -o /tmp/strace.log -f -tt -s 16384 httpd/bin/apachectl -k start -D SSL 13817 23:11:00.530835 <... futex resumed> ) = 0 13817 23:11:00.530889 munmap(0xf5720000, 10489856) = 0 13817 23:11:00.530968 tgkill(13817, 13821, SIGUSR1 <unfinished ...> 13821 23:11:00.531027 <... poll resumed> ) = ? ERESTART_RESTARTBLOCK (To be rest arted) 13817 23:11:00.531047 <... tgkill resumed> ) = 0 13821 23:11:00.531061 --- SIGUSR1 (User defined signal 1) @ 0 (0) --- 13821 23:11:00.531138 sigreturn( <unfinished ...> 13817 23:11:00.531158 futex(0xf41fdbd8, FUTEX_WAIT, 13821, NULL <unfinished ...> 13821 23:11:00.531175 <... sigreturn resumed> ) = -1 EINTR (Interrupted system c all) 13821 23:11:00.531210 gettimeofday({1453263060, 531221}, NULL) = 0 13821 23:11:00.531274 write(2, "[Tue Jan 19 23:11:00.531221 2016] [ssl:info] [pi d 13817:tid 4095728496] (4)Interrupted system call: [client 10.132.160.35:59069] AH01993: SSL output filter write failed.\n", 170) = 170 13821 23:11:00.531333 time(NULL) = 1453263060 13821 23:11:00.531365 shutdown(14, 1 /* send */) = 0 13821 23:11:00.531664 poll([{fd=14, events=POLLIN}], 1, 500) = 1 ([{fd=14, reven ts=POLLIN}]) Reproduction steps : The misconfiguration that made this problem visible was adding MaxRequestsPerChild = 1. to httpd-mpm.conf : <IfModule mpm_worker_module> ... MaxRequestsPerChild 1 </IfModule> The byproduct of that was that when one request was completed it then sent SIGUSR1 to the process, interrupting any working threads (which included our long running download). Here is the complete, MPM settings for worker, we also kept number of httpd proceses down, so we were more likely to get multiple requests goign to the same httpd process. <IfModule mpm_worker_module> StartServers 1 MinSpareThreads 5 MaxSpareThreads 25 ThreadsPerChild 25 MaxRequestWorkers 400 MaxConnectionsPerChild 0 # tuneing - as per client ThreadLimit 128 MaxClients 800 ServerLimit 800 MaxRequestsPerChild 1 </IfModule> Then we run two scripts 1) A script that downloads a large file (9) over SSL, 20 times. 2) A script that repeatedly downloads a small .png file. (I will attach them) We find we can regularly interrupt the SSL download, so that none of the large downloads succeed. Obviously the problem is resolved by commenting out the parameter : MaxRequestsPerChild 1 Which results in signal being sent after the small request is complete (since excceded maxrequests), and while the long download is still proceeding. And reduced signals will mean much less likely hood of the problem in SSL_write occuring. But I am sure there are legitemate times in operation when this condition will occur, and will result in aborted SSL download. Notes: a) Problem does not occur over HTTP , only over HTTPS b) Problem happens with much smaller files too (300k), larger files just give more chance for the problem to occur.
large file size was 9meg - reviewing I see that was not clear - Mark