Bug 56188

Summary: mod_proxy_fcgi does not send FCGI_ABORT_REQUEST on client disconnect
Product: Apache httpd-2 Reporter: Justin <justine>
Component: mod_proxy_fcgiAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: normal CC: jason.michaelson, jim, rlifshay+abug, szg0000, toscano.luca
Priority: P2 Keywords: PatchAvailable
Version: 2.4.7   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Bug fix patch
Updated patch
Previous patch with comments and rebased for trunk
Send FCGI_ABORT to the backend when the client disconnects
Send FCGI_ABORT to the backend when the client disconnects
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch

Description Justin 2014-02-25 19:15:26 UTC
Steps to reproduce:
1. Send an XHR to a fcgi handler that takes a while to complete
2. Abort the xhr request

Expected Result:
mod_proxy_fcgi should send a FCGI_ABOR_REQUEST message, as specified in the fast cgi specification: http://www.fastcgi.com/devkit/doc/fcgi-spec.html#S5.4

Actual Result:
mod_proxy_fcgi ignores client disconnect and attempts to continue processing the response.
Comment 1 Jason Michaelson 2015-12-10 14:58:46 UTC
Created attachment 33336 [details]
Bug fix patch

The attached patch fixes this bug. We have been shipping a patched version of Apache for nearly two years using this patch.
Comment 2 Jason Michaelson 2016-01-13 17:48:50 UTC
Created attachment 33433 [details]
Updated patch

Updated patch file against 2.4.18.
Comment 3 Luca Toscano 2016-07-26 08:12:04 UTC
I was able to repro the problem with a simple php script that sleeps for a while, you are definitely right. I checked very quickly the patch and it looks good, but I am not super expert with mod_proxy_fcgi so I'll need to review it more carefully and ask for some advice.

Thanks a lot for the patch and the effort!
Comment 4 Luca Toscano 2016-07-26 12:53:33 UTC
Created attachment 34073 [details]
Previous patch with comments and rebased for trunk

Adding the same patch provided before but with some comments and a fix for a make error (send_data taking too many arguments).
Comment 5 Jason Michaelson 2016-11-02 18:01:03 UTC
Is there any chance of this bug fix getting included in a 2.4 release in the near future?
Comment 6 Jim Jagielski 2016-12-19 15:35:03 UTC
Thanks. Will review. Can anyone confirm that this also affects trunk as well? I see no reason why it would be different, so I assume so.

Will try to recreate.
Comment 7 Luca Toscano 2016-12-21 11:20:09 UTC
It affects trunk, the patch name was only a way to say that I had to rework a bit Justin's patch to make trunk compile, sorry :)

I got bitten by this issue at work yesterday (long php jobs), in my case (to be verified) even the ProxyTimeout/Timeout settings might trigger this problem (so the FCGI backend will not get notified that httpd timed out the client's request).
Comment 8 Luca Toscano 2016-12-22 17:44:51 UTC
Another script that illustrates the issue:

<?php

// current time
echo date('h:i:s') . "\n";

// sleep for 10 seconds
sleep(10);

$myfile = fopen("/tmp/timetable", "w") or die("Unable to open file!");

for ($i=0; $i<1000000; $i++) {
   $txt = date('h:i:s') . "\n";
   fwrite($myfile, $txt);
}

fclose($myfile);

// wake up !
echo date('h:i:s') . "\n";

?>

The /tmp/timetable will be written every time, even if:

1) Timeout or ProxyTimeout are set to 5 seconds (random number less than 10).
2) A client disconnects happens during the sleep time.
Comment 9 Luca Toscano 2016-12-23 17:43:22 UTC
So while reading http://www.mit.edu/~yandros/doc/specs/fcgi-spec.html#S5.4 I came across this:

> A Web server aborts a FastCGI request when an HTTP client closes its transport connection while the FastCGI request is running on behalf of that client. 

> The Web server sends a FCGI_ABORT_REQUEST record to abort a request. After receiving {FCGI_ABORT_REQUEST, R}, the application responds as soon as possible with {FCGI_END_REQUEST, R, {FCGI_REQUEST_COMPLETE, appStatus}}. This is truly a response from the application, not a low-level acknowledgement from the FastCGI library.

As far as I can see the patch breaks out of the main while after sending the FCGI ABORT, not waiting for the FCGI_END_REQUEST response. I tried to change it a bit to allow this flow, but from my test I never get FCGI_END_REQUEST because mod_proxy_fcgi stops before it finding out that the client dropped the connection (probably when it tries to flush the output).

I also didn't manage to see the FCGI backend stopping after the FCGI ABORT is sent, but maybe I am missing something trivial. I'll try to work on this code more during the next days.

Jason: just want to make it clear, your patch looks good, I just want to make sure that we don't miss any corner case :)

Thanks!
Comment 10 Luca Toscano 2016-12-23 21:28:34 UTC
Another bit of info:

http://php.net/manual/en/function.ignore-user-abort.php

> PHP will not detect that the user has aborted the connection until an attempt is made to send information to the client.
> Simply using an echo statement does not guarantee that information is sent, see flush().

In the above example, adding ob_flush() and flush() triggers a stop of the PHP processing right after the curl HTTP request / connection is aborted.

Justin, Jason - would you mind to add a tiny example of your XHR use case and how it fails? Having a good repro will surely help in speeding up this review process (at least for me!).

Thanks!
Comment 11 Luca Toscano 2016-12-24 12:38:53 UTC
I believe that I have a better understanding of the issue, that can be probably summarized in two things:

1) mod_proxy_fcgi should detect as early as possible that the client connection dropped taking appropriate actions, for example sending FCGI_ABORT to the backend and cleaning its state.

2) The FCGI backend needs to react after receiving a FCGI ABORT appropriately.

About 1), I have some ideas still to verify:

- Rather than creating a new apr_pollfd_t, the client aborted use case might be handled by something simpler like the following (missing the FCGI_ABORT bit):

Index: modules/proxy/mod_proxy_fcgi.c
===================================================================
--- modules/proxy/mod_proxy_fcgi.c	(revision 1775858)
+++ modules/proxy/mod_proxy_fcgi.c	(working copy)
@@ -481,6 +481,12 @@
          * cause timeout errors. */
         apr_socket_timeout_get(conn->sock, &timeout);

+        if (r->connection->aborted) {
+            ap_log_rerror(APLOG_MARK, APLOG_TRACE1, rv, r,
+                          "The client aborted the connection.");
+            break;
+        }
+
         rv = apr_poll(&pfd, 1, &n, timeout);
         if (rv != APR_SUCCESS) {
             if (APR_STATUS_IS_EINTR(rv)) {

- The FCGI specs seems to indicate that after sending a FCGI ABORT, mod_proxy_fcgi should need to wait for a related FCGI_END_REQUEST before wrapping up (we are not doing it in the various patches). 

I am going to do some more research during the next days, but if anybody has more knowledge then me related to FCGI please let me know. I am far from a expert in the field and it would be great to deliver this patch as quickly as possible.
Comment 12 Luca Toscano 2016-12-27 19:26:35 UTC
Created attachment 34560 [details]
Send FCGI_ABORT to the backend when the client disconnects

Updated the patch with some trace error messages and a better handling of the response value rv. The 'dispatch' function checks the returned 'rv' value if not different from APR_SUCCESS, so the last one to return shouldn't always be the result of the send_data call for the FCGI_ABORT notification (that most of the times, if the backend is healthy, returns APR_SUCCESS).

Remaining unclear points:

1) A 503 is logged with this patch, meanwhile a 200 was logged for the previous version. What is the best thing to log in case a client connection is aborted?

2) When ProxyTimeout is hit, should we send a FCGI_ABORT notification to the backend?
Comment 13 Luca Toscano 2016-12-27 20:03:18 UTC
Auto-answering the following previous point:

> - The FCGI specs seems to indicate that after sending a FCGI ABORT, mod_proxy_fcgi should need to wait for a related FCGI_END_REQUEST before wrapping up (we are not doing it in the various patches). 

I haven't found any good implementation/handling of FCGI_ABORT in common fcgi backends, but some ad-hoc ones might want to rely on it. I don't think that httpd needs to strictly follow the protocol specification for this use case (that seems to be waiting for FCGI_END_REQUEST), but I am open to suggestions.
Comment 14 Luca Toscano 2016-12-28 12:55:54 UTC
To add some references (still to verify):

- https://github.com/facebook/hhvm/blob/HHVM-3.17/hphp/runtime/server/http-request-handler.cpp#L411

- https://github.com/dreamcat4/php-fpm/blob/master/cgi/fastcgi.h#L37

From the HHVM commentary:

"""
  /* FastCGI ABORT_REQUEST prematurely terminates a request, signaling that
   * the webserver is no longer interested in processing it. The contentData
   * for such recoreds is empty. The application server must reply with
   * an END_REQUEST record to indicate that it has aborted.
   */
"""

So php-fpm seems not supporting it, meanwhile HHVM returns a 503 by default. These are not the full spectrum of FCGI backends but two famous ones that I know.
Comment 15 Luca Toscano 2016-12-30 19:16:04 UTC
Created attachment 34572 [details]
Send FCGI_ABORT to the backend when the client disconnects

Explicitly added the code to instruct httpd that the r->connection is aborted. In trunk I can see the following strange log:

[Fri Dec 30 19:08:52.950149 2016] [proxy_fcgi:trace1] [pid 3966:tid 139835091949312] mod_proxy_fcgi.c(597): [client 127.0.0.1:57840] EOF detected from the main client connection.
[Fri Dec 30 19:08:52.950180 2016] [proxy_fcgi:trace1] [pid 3966:tid 139835091949312] mod_proxy_fcgi.c(893): (70014)End of file found: [client 127.0.0.1:57840] The client aborted the connection.
[Fri Dec 30 19:08:52.950186 2016] [proxy:debug] [pid 3966:tid 139835091949312] proxy_util.c(2232): AH00943: FCGI: has released connection for (127.0.0.1)
[Fri Dec 30 19:08:52.950210 2016] [headers:trace2] [pid 3966:tid 139835091949312] mod_headers.c(875): AH01502: headers: ap_headers_output_filter()
[Fri Dec 30 19:08:52.950242 2016] [http:trace3] [pid 3966:tid 139835091949312] http_filters.c(1034): [client 127.0.0.1:57840] Response sent with status 200, headers:
[Fri Dec 30 19:08:52.950245 2016] [http:trace5] [pid 3966:tid 139835091949312] http_filters.c(1041): [client 127.0.0.1:57840]   Date: Fri, 30 Dec 2016 19:08:51 GMT
[Fri Dec 30 19:08:52.950247 2016] [http:trace5] [pid 3966:tid 139835091949312] http_filters.c(1044): [client 127.0.0.1:57840]   Server: Apache/2.5.0-dev (Unix)
[Fri Dec 30 19:08:52.950253 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Vary: User-Agent
[Fri Dec 30 19:08:52.950256 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Content-Length: 0
[Fri Dec 30 19:08:52.950257 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Connection: close
[Fri Dec 30 19:08:52.950261 2016] [core:trace6] [pid 3966:tid 139835091949312] util_filter.c(803): [client 127.0.0.1:57840] reinstate empty brigade to full brigade in 'req_core' output filter
[Fri Dec 30 19:08:52.950264 2016] [core:trace8] [pid 3966:tid 139835091949312] util_filter.c(914): [client 127.0.0.1:57840] brigade contains: bytes: 145, non-file bytes: 145, eor buckets: 0, morphing buckets: 0
[Fri Dec 30 19:08:52.950272 2016] [http:trace3] [pid 3966:tid 139835091949312] http_filters.c(1034): [client 127.0.0.1:57840] Response sent with status 200, headers:
[Fri Dec 30 19:08:52.950274 2016] [http:trace5] [pid 3966:tid 139835091949312] http_filters.c(1041): [client 127.0.0.1:57840]   Date: Fri, 30 Dec 2016 19:08:51 GMT
[Fri Dec 30 19:08:52.950276 2016] [http:trace5] [pid 3966:tid 139835091949312] http_filters.c(1044): [client 127.0.0.1:57840]   Server: Apache/2.5.0-dev (Unix)
[Fri Dec 30 19:08:52.950278 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Vary: User-Agent
[Fri Dec 30 19:08:52.950280 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Content-Length: 0
[Fri Dec 30 19:08:52.950282 2016] [http:trace4] [pid 3966:tid 139835091949312] http_filters.c(863): [client 127.0.0.1:57840]   Connection: close
[Fri Dec 30 19:08:52.950284 2016] [core:trace6] [pid 3966:tid 139835091949312] util_filter.c(803): [client 127.0.0.1:57840] reinstate empty brigade to full brigade in 'req_core' output filter
[Fri Dec 30 19:08:52.950285 2016] [core:trace8] [pid 3966:tid 139835091949312] util_filter.c(914): [client 127.0.0.1:57840] brigade contains: bytes: 145, non-file bytes: 145, eor buckets: 0, morphing buckets: 0
[Fri Dec 30 19:08:53.895646 2016] [core:trace4] [pid 3964:tid 139835914131200] mpm_common.c(557): mpm child 3966 (gen 0/slot 0) exited
[Fri Dec 30 19:08:53.895721 2016] [core:trace4] [pid 3964:tid 139835914131200] mpm_common.c(557): mpm child 3969 (gen 0/slot 3) exited

So there is a double "HTTP 200 logged", meanwhile in 2.4.x this issue does not appear (only one 200 logged as expected). I'll try to run gdb to find what is the difference, but from a first glance it seems async conn handling.
Comment 16 Luca Toscano 2017-01-02 17:17:42 UTC
Created attachment 34574 [details]
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch

Added comments and renamed the patch since it does not send anymore any FCGI_ABORT.
Comment 17 Luca Toscano 2017-01-06 16:19:36 UTC
(In reply to Luca Toscano from comment #14)
> To add some references (still to verify):
> 
> -
> https://github.com/facebook/hhvm/blob/HHVM-3.17/hphp/runtime/server/http-
> request-handler.cpp#L411
> So php-fpm seems not supporting it, meanwhile HHVM returns a 503 by default.
> These are not the full spectrum of FCGI backends but two famous ones that I
> know.

I was probably wrong about HHVM:

https://github.com/facebook/hhvm/blob/HHVM-3.17/hphp/runtime/server/fastcgi/fastcgi-session.cpp#L475-L480

void FastCGISession::onRecordImpl(const fcgi::abort_record* rec) {
  if (!m_requestId || rec->requestId != m_requestId) {
    // Garbage record
    dropConnection();
    return;
  }
}

It seems to just drop the connection if an abort is sent. Again, I am not an expert so if anybody has more info please reply :)
Comment 18 Luca Toscano 2017-01-24 12:45:19 UTC
Feedback from the dev@ mailing list, summarized by the words of Jacob Champion:

"""
I don't think we should implement FCGI_ABORT until/unless we decide to implement FCGI multiplexing. I understand that the bug's OP is operating happily with an FCGI_ABORT patch, but until they respond to you to clarify what backend they're using, we don't know if their approach is correct even for their own use case.

I don't think we should send FCGI_ABORT and then immediately close the connection without waiting for the client to respond. In my mind, sending FCGI_ABORT is a contract: "we will accept and discard a reasonable number of messages for the current request ID while we wait for you to send FCGI_END_REQUEST".

I do think that closing the FCGI connection when the client closes theirs is valuable. Eric's suggestion on #httpd to hide this behind a directive is probably wise; we've broken enough FCGI backends recently...
"""

So the next step is to modify my patch to include a module directive that guards this new behavior.
Comment 19 Luca Toscano 2017-01-27 18:44:44 UTC
Created attachment 34685 [details]
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch

First attempt to implement a conservative ProxyFCGIDetectClientDisconnect (server) directive.
Comment 20 Luca Toscano 2017-01-29 09:40:31 UTC
Created attachment 34692 [details]
httpd-trunk-mod_proxy_fcgi-client_disconnected.patch

Replaced (apr_socket_t*) ap_get_module_config(r->connection->conn_config, &core_module); with ap_get_conn_socket(r->connection);
Comment 21 Luca Toscano 2017-01-29 09:48:22 UTC
Jacob in #httpd-dev asked two interesting questions:

1) Would we get a EOF even when the client's TCP connection is in a state like half-closed? (So basically doing a correct shutdown).

2) How many times is the POLLIN (registered to on the client's TCP connection to detect disconnects) case executed for a "typical" POST request? It seems like that code would get triggered for every incoming group of bytes.
Comment 22 Yann Ylavic 2017-01-29 17:30:36 UTC
(In reply to Luca Toscano from comment #21)
> 
> 1) Would we get a EOF even when the client's TCP connection is in a state
> like half-closed? (So basically doing a correct shutdown).

Half-closed by whom (and which way)?

A shutdown() for write sends a FIN to the peer, so a poll() there would return with POLLIN and/or POLLHUP (depending on the system), and read()/recv() returns 0 (EOF).
It's still possible to poll()/read() an half-closed connection (shut down for write only), hence wait for the peer's data or FIN (read only connection).

A shutdown() for read is not "visible" remotely, so it actually has little value (i.e. just stop reading).

But I don't know how you'd like to use half-closing, so please clarify ;)

> 
> 2) How many times is the POLLIN (registered to on the client's TCP
> connection to detect disconnects) case executed for a "typical" POST
> request? It seems like that code would get triggered for every incoming
> group of bytes.

Yes, poll() is woken up for each piece of data (or FIN) available in the socket, and until there is nothing more to read immediately (i.e. unless subsequent recv()s empty the socket, poll() won't block).

So it depends on the size of the buffer used for reading, or if multiple nonblocking reads will empty the queue.
Comment 23 Luca Toscano 2017-01-30 10:42:48 UTC
Hi Yann! 

I took some notes in this task without adding too many info, my bad. I am currently trying to understand if the method used to detect a client disconnect (i.e. poll() the client TCP connection for POLLIN events) can lead to bugs or performance regressions. Following up inline:

(In reply to Yann Ylavic from comment #22)
> (In reply to Luca Toscano from comment #21)
> > 
> > 1) Would we get a EOF even when the client's TCP connection is in a state
> > like half-closed? (So basically doing a correct shutdown).
> 
> Half-closed by whom (and which way)?
> 
> A shutdown() for write sends a FIN to the peer, so a poll() there would
> return with POLLIN and/or POLLHUP (depending on the system), and
> read()/recv() returns 0 (EOF).
> It's still possible to poll()/read() an half-closed connection (shut down
> for write only), hence wait for the peer's data or FIN (read only
> connection).
> 
> A shutdown() for read is not "visible" remotely, so it actually has little
> value (i.e. just stop reading).
> 
> But I don't know how you'd like to use half-closing, so please clarify ;)

My question was if there was any corner case in which if, after a client has initiated a TCP connection close, mod-proxy-fcgi wrongly assume that the connection dropped (because of POLLIN and EOF) taking its own remediation (error logs, etc..).


> > 
> > 2) How many times is the POLLIN (registered to on the client's TCP
> > connection to detect disconnects) case executed for a "typical" POST
> > request? It seems like that code would get triggered for every incoming
> > group of bytes.
> 
> Yes, poll() is woken up for each piece of data (or FIN) available in the
> socket, and until there is nothing more to read immediately (i.e. unless
> subsequent recv()s empty the socket, poll() won't block).
> 
> So it depends on the size of the buffer used for reading, or if multiple
> nonblocking reads will empty the queue.


In this case I am wondering if my patch would introduce a performance regression for use cases like big requests coming in (like POSTs for example). We don't really care about reading those bytes (in the code introduced by my patch) but only to check if the connection with the client is still up, so there might be a better way to approach the issue.

Let me know if my questions are clearer or if you need more info! :)
(Or if I need to study more the subject!)
Comment 24 Jacob Champion 2017-01-30 18:13:34 UTC
(In reply to Luca Toscano from comment #21)
> 1) Would we get a EOF even when the client's TCP connection is in a state
> like half-closed? (So basically doing a correct shutdown).

I actually want to semi-retract this question... I don't *think* HTTP has a provision for half-closes since it's not required to be implemented over TCP. At the very least, requests cannot make use of a half-close to delimit the end of a request message body, which was my primary concern.
Comment 25 Luca Toscano 2017-02-04 15:16:03 UTC
From a discussion in dev@ about polling the client's TCP connection:

"""
It could be achievable, but not easy, with care taken to pull the
client's socket out of the pollset if anything but an error or a
connection/TLS close is detected.
It probably also shouldn't start before "last_stdin" is true.

Client side poll() may return with either:
1. HTTP data (pipelined) => client still alive, non-empty/meta brigade
=> don't abort
2. TCP close or reset => bucket EOS or an APR_E* (though
speculative-non-blocking reads won't return EOS, and may turn EOF into
SUCCESS with empty brigade!) => abort
3. TLS close notify => EOS/EOF? => abort
4. TLS renegotiation => rejected by httpd (since initiated by the
client) => an APR_E* => abort
n. ...

Probably not an exhaustive list...
"""

So even if the patch provided by Jason works and looks good, it might introduce some weird corner cases for the general consumption that might create regressions during the next release. This doesn't mean that the patch can't work for specific use cases, but it might not be the best one for all of them.

Another approach that was proposed by Yann is the following: use non blocking recv while reading from the FCGI backend, and in case of EAGAIN send a FLUSH bucket to the client in order to force ap_pass_brigade to detect if the client aborted the connection.

poll() => POLLIN => read() while data available => EAGAIN => flush => poll()

Remark: The "read() while data available => EAGAIN" may be replaced by "read()
while read buffer is full" so to avoid a (likely) useless last
non-blocking read.

Remark 2: https://httpd.apache.org/docs/2.4/developer/output-filters.html#basics

"""
FLUSH buckets are sent when the content generator (or an upstream filter) knows that there may be a delay before more content can be sent. By passing FLUSH buckets down the filter chain immediately, filters ensure that the client is not kept waiting for pending data longer than necessary.
Filters can create FLUSH buckets and pass these down the filter chain if desired. Generating FLUSH buckets unnecessarily, or too frequently, can harm network utilisation since it may force large numbers of small packets to be sent, rather than a small number of larger packets.
"""