Bug 63534

Summary: Large downloads fail with mod_http2 behind a proxy with proxyprotocol
Product: Apache httpd-2 Reporter: Sander Hoentjen <sander>
Component: mod_http2Assignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 2.4.39   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: http2 debug log of failed download
Log with LogLevel http2:trace2 of failed download
2.4.x patch v2
2.4.x patch v3
2.4.x patch v4

Description Sander Hoentjen 2019-07-01 15:06:10 UTC
Created attachment 36639 [details]
http2 debug log of failed download

Since updating httpd from 2.4.29 to 2.4.39 large(?) downloads fail when using http2.

We have httpd behind a haproxy instance in tcp-mode, we use proxyprotocol support in mod_remoteip.
With 2.4.29 downloads are okay. With 2.4.39 after some time the download aborts. It is not always after the same amount of time or bytes.

I found that setting `H2CopyFiles on` works around the issue, with this setting downloads are okay.

Without haproxy in front, downloads are okay as well.

If I disable http2, downloads are okay as well.

If you need me to provide extra info, please let me know.
Comment 1 Stefan Eissing 2019-07-02 06:59:57 UTC
Does this fail sometimes during production use only or are you able to reproduce this on an isolated connection? If so, the easiest to analyze would be a LogLevel http2:trace2 error log.
Comment 2 Sander Hoentjen 2019-07-02 07:22:28 UTC
Created attachment 36640 [details]
Log with LogLevel http2:trace2 of failed download
Comment 3 Sander Hoentjen 2019-07-02 07:26:05 UTC
I can reproduce this on an isolated connection. I stripped down the config to:
Loaded Modules:
 core_module (static)
 so_module (static)
 http_module (static)
 authz_core_module (shared)
 mpm_event_module (shared)
 http2_module (shared)
 log_config_module (shared)
 remoteip_module (shared)
 socache_shmcb_module (shared)
 ssl_module (shared)
 unixd_module (shared)

I reported against http2, since downloads are okay when I disable http2, but it is of course possible the bug lies in mod_remoteip
Comment 4 Stefan Eissing 2019-07-02 07:59:08 UTC
Thanks, that helps.

Which values do you have configured for Timeout and KeepAliveTimeout?

From the log, I can see that you probably run mpm_worker and that the h2 session, having filled the flowcontrol window for the stream, goes into a blocking read on the connection, waiting for the next frame from haproxy.

The block only lasts a second which suggests you have "KeepAliveTimeout 1"?
Comment 5 Sander Hoentjen 2019-07-02 08:15:50 UTC
Indeed, I have:
Timeout 120
KeepAliveTimeout 1

I didn't think KeepAliveTimeout would apply to 1 download request, but I guess this is somehow different with HTTP/2?
Comment 6 Stefan Eissing 2019-07-02 08:21:31 UTC
There is an interaction between the legacy HTTP/1.1 connection state handling and the "new" HTTP/2 stuff.  H1 has the three states: read request, write response, wait for next. While H2 has a variety of things.

So, I tried to match this in the 2.4 line, but it has limitations.

Can you verify that a larger KeepAliveTimeout works for you? Would this be acceptable? As an alternative, you could evaluate if mpm_event works better in your setup. We really recommend that for various reasons.
Comment 7 Sander Hoentjen 2019-07-02 08:37:21 UTC
Setting KeepAliveTimeout to 2 still failed for me, with 5 the download succeeded. We set this value low for a reason in the past, but of course the commit message only says "set keepalive timeout to 1 second", very helpful :)
I do see that we changed this to 1 when we still ran Apache 2.2 with prefork, so it might be time to reconsider :)

Sorry I failed to read you mentioned mpm_worker. We do use mod_event, it is listed in:
Loaded Modules:
 core_module (static)
 so_module (static)
 http_module (static)
 authz_core_module (shared)
 mpm_event_module (shared) <---------
 http2_module (shared)
 log_config_module (shared)
 remoteip_module (shared)
 socache_shmcb_module (shared)
 ssl_module (shared)
 unixd_module (shared)


I can try to see how increasing KeepAliveTimeout will go in production, but it feels a bit wrong to me that this setting is mapped to a single transfer in HTTP/2
I do understand that it is hard to map these kind of things to existing stuff. Maybe a mod_http2 setting could be added, that if unset falls back to KeepAliveTimeout?
BTW, do you know what changed between 2.4.29 and 2.4.39 that causes this different behaviour?
Comment 8 Stefan Eissing 2019-07-02 08:47:09 UTC
There were a lot of changes, but one was that http2 returns to the mpm connection handling when flow control is exhausted. This allow mpm_event to park the connection and use workers for others.

The fix is not easy, since the mpm connection handling does not know anything about the protocol it runs (but its state handling reflects HTTP/1.1, of course).

I share your view and configuration should be improved here. This is a discussion that we need to take to the dev mailing list.
Comment 9 Sander Hoentjen 2019-07-02 10:32:51 UTC
Thanks, I saw the message on the ML.
I have one final question: What are the downsides of having set `H2CopyFiles on` to workaround this issue?
Comment 10 Stefan Eissing 2019-07-02 10:54:11 UTC
With "H2CopyFiles on" static file contents are copied from the request worker to the main connection chunk by chunk. This buffers more memory and keeps a worker allocated much longer.

With "H2CopyFiles off" (and no compression filters etc.) the file handle is given to the main connection, freeing the worker. The handle is then read as h2 needs it for sending our the data frames. On h2c, even sendfile() gets used.
Comment 11 Stefan Eissing 2019-07-03 14:08:16 UTC
Created attachment 36642 [details]
2.4.x patch v2
Comment 12 Stefan Eissing 2019-07-03 14:09:01 UTC
If you can build your own 2.4.x httpd, you could evaluate the attached patch. This fixes the Timeout vs. KeepAliveTimeout behaviour in my tests.
Comment 13 Sander Hoentjen 2019-07-04 09:33:18 UTC
I just tried the patch, but it doesn't change anything for me. I still see:
h2_session(0,IDLE,1): enter idle, timeout = 1 sec
(if I change KeepAliveTimeout the timeout value changes as well)
Comment 14 Stefan Eissing 2019-07-04 10:16:52 UTC
The log statement is happening on a fresh connection:

ap_log_cerror(APLOG_MARK, APLOG_TRACE1, 0, session->c, 
   H2_SSSN_LOG("", session, "enter idle, timeout = %d sec"), 
   (int)apr_time_sec(H2MAX(session->s->timeout, 
           session->s->keep_alive_timeout)));

That is the maximum of "Timeout" and "KeepAliveTimeout" on the server. (If I am not totally mistaken here).

Are you sure you see this with your Timeout 120?
Comment 15 Sander Hoentjen 2019-07-04 12:14:08 UTC
Actually this log is a little bit lower in the file:

                ap_log_cerror(APLOG_MARK, APLOG_TRACE1, 0, session->c, 
                              H2_SSSN_LOG("", session, "enter idle, %s = %d sec"), 
                              s, (int)apr_time_sec(timeout));

So I guess we hit:
                else if (session->open_streams) {
                    s = "timeout";
                    timeout = session->s->keep_alive_timeout;
                    update_child_status(session, SERVER_BUSY_KEEPALIVE, "idle");
                }
Comment 16 Stefan Eissing 2019-07-04 12:30:06 UTC
Created attachment 36644 [details]
2.4.x patch v3
Comment 17 Stefan Eissing 2019-07-04 12:30:48 UTC
Of course, you are totally correct. I attached v3 of the patch that sets the correct timeout configurations. Hope this works better for you.
Comment 18 Sander Hoentjen 2019-07-04 13:26:01 UTC
The v3 patch works for me. Actually before you posted it I changed the same thing, but I also changed:
update_child_status(session, SERVER_BUSY_KEEPALIVE, "idle");
to:
update_child_status(session, SERVER_BUSY_READ, "idle");

This worked, but your patch works as well, I don't know which is actually more correct.
Comment 19 Stefan Eissing 2019-07-04 13:31:11 UTC
Created attachment 36645 [details]
2.4.x patch v4
Comment 20 Stefan Eissing 2019-07-04 13:31:49 UTC
Excellent, added you change to v4 of the patch. Will apply like this to trunk and propose for backport into the next release.

Thanks a lot!
Comment 21 Stefan Eissing 2019-07-31 11:49:00 UTC
Fixed in r1862475 and r1862583 (patch v4 mentioned here). Proposed for backport to 2.4.x release.
Comment 22 Stefan Eissing 2019-08-02 07:03:24 UTC
Fix was backported in r1864126 to 2.4.x branch and will be in the next release.