Bug 63328 - Apparent race condition causes undeserved 500 / connection reset by peer errors
Summary: Apparent race condition causes undeserved 500 / connection reset by peer errors
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_fcgid (show other bugs)
Version: 2.4.25
Hardware: PC Linux
: P2 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-09 03:15 UTC by tlhackque
Modified: 2019-04-09 03:15 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description tlhackque 2019-04-09 03:15:40 UTC
There appears to be a race condition in mod_fcgid.

Here's what I see:

A Perl (CGI::Fast) application decides (actually is told) to exit.

In response to a POST, it issues a (303) redirect to its status page and exit()s - expecting that a new instance will be started to service the redirect.

HTTPD reports:

[Sun Apr 07 13:24:13.991499 2019] [fcgid:warn] [pid 17236] (104)Connection reset by peer: [client ...] mod_
fcgid: error reading data from FastCGI server, referer: ...
[Sun Apr 07 13:24:13.994622 2019] [core:error] [pid 17236] [...] End of script output before headers
: notice.fcgi, referer: https://.../fastbrowser

Modsec's helpful audit log says:
Apache-Error: [file "fcgid_proc_unix.c"] [line 627] [level 4] [status 104] mod_fcgid: error reading data from FastCGI server
Apache-Error: [file "util_script.c"] [line 500] [level 3] %s: %s
Apache-Handler: fcgid-script


And the web browser gets a 500 error from httpd.

What's interesting is that if the server generates a 200 response, the error doesn't happen.

Further, if the application generates the 303 without doing anything else, the 500 isn't generated; the redirect works.

The crash seems to be timing sensitive.  My working theory is that:

The (Perl application)server exits, closing the FCGI server connection.  If a 200 is provided before the exit, all goes as expected.  A redirect takes time - the browser sends the GET some time later.  If it's much later, it hits a new server instance.  If it's at just the right time, it starts to get sent to the (now exiting) server; the connection close is noticed, and the request is lost to the 500.

This reproduces consistently with a real application.  I've tried to cut it down to a reproducer, but failed.

I tried various ways to prevent this - including sending 'LastCall' - none work in the real application.

httpd 2.4.25, mod_fcgid 2.3.9.  CGI::Fast 2.15 FCGI 0.78

Here is my attempt at a small reproducer.  While I haven't found the right magic to reproduce the problem, it clearly illustrates the failing application structure. (For simplicity, this is all done with GET, but that shouldn't matter.)

Usage:

Setup shutdown.fcgi to run as a script, as, say /test.fcgi

Browse to /test.fcgi - hit refresh, you will see the Requests served counter increment.

Now Browse to /test.fcgi/shutdown - the server issues a redirect and exits.  You will see that the response has a new PID, the requests served goes back to 1, and the URL in the address bar is no /test.fcgi/LoopExit.

Or (change the if(01)), it invokes LastCall - which tells the library explicitly not to send more requests - then falls out of the loop synchronously to exit.  The GET invoked by the redirect should start a new server; instead you get the 500 error.  In the real application, the 500 errors are 100% reproducible.  I haven't found the right timing to make the reproducer fail - and if I did, I suspect that timing would not be portable to other machines.

What I expect is that once the server exits (and especially with LastCall invoked), mod_fcgid will pass incoming requests to another server instance.  Starting a new one if necessary.  (In the real app, it is guaranteed that there is only one server at this time.)  If one can't be found/started, the response should be something like "no servers available", not "Internal error" with logging that blames the server.

Here's the (very small) almost-reproducer.  The structure is the same as the real application.

#!/usr/bin/perl


use warnings;
use strict;

require CGI::Fast;

my $n;
my $q;
while( ( $q = CGI::Fast->new ) ) {
    # Variable work here

    if( $ENV{PATH_INFO} eq '/shutdown' ) {
        if( 01 ) {
            print( <<"xx" );
Status: 303 See other
Location: /test.fcgi/LoopExit

Server $$ shutdown after $n requests
xx
            exit(0);
        }
        no warnings 'once';
        $CGI::Fast::Ext_Request->LastCall;
        next;
    }

    $n++;
    print( <<"XX" );
Status: 200 OK
Content-Type: text/plain

Server $$, Requests served: $n
XX
}
# Here when CGI::Fast returns undef to shut down.
print STDERR ( "ERR: Server $$ shutdown after $n requests\n" ) if( 0 );


exit(0);

Finally, my work around is to send a buffer page - it waits 15 seconds and then does a javascript redirect.  This works every time  - but is a horrible user experience...