Bug 57771

Summary: cleanup_script uses incorrect connection ID
Product: Apache httpd-2 Reporter: Michael Thorpe <mike>
Component: mod_cgidAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal Keywords: FixedInTrunk
Priority: P2    
Version: 2.4.12   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Attachments: Simple fix
Simple fix

Description Michael Thorpe 2015-03-27 17:58:02 UTC
Created attachment 32618 [details]
Simple fix

The cleanup_script function in mod_cgid.c (line 1347 in 2.4.12) initializes req.conn_id to info->r->connection->id when it should be using the saved conn_id in the cleanup_script_info structure. This causes the wrong process (or a nonexistant process) to be cleaned up if the connection ID changes during processing (e.g., due to waiting around).

Attached is a patch for 2.4.12 that fixes this.
Comment 1 Nick Warne 2016-03-28 11:22:56 UTC
Created attachment 33706 [details]
Simple fix

Updated for apache 2.4.18
Comment 2 Nick Warne 2016-03-28 11:24:15 UTC
This issue still exists for 2.4.18.
Comment 3 Eric Covener 2016-04-26 21:24:48 UTC
Anyone have an idea about what causes the ID to change? I don't think the pool cleanups should run on a different thread then where the cgid handler ran.
Comment 4 Michael Thorpe 2016-04-26 23:03:41 UTC
(In reply to Eric Covener from comment #3)
> Anyone have an idea about what causes the ID to change? I don't think the
> pool cleanups should run on a different thread then where the cgid handler
> ran.

Connection IDs can change any time the thread is put to sleep, which can happen basically any time Apache is waiting for something (like the output of an #exec). This bug was discovered on an SSI page that had an #exec followed by some #flastmod's. The #exec'ed script doesn't get cleaned up until the end of the page, and either due to load or scheduling issues the page would get handed off to another thread between the #flastmod's and the cleanup (although it could just as easily be handed off during the page generation itself).

Nick Warne has a trace showing the problem if you need to see the timing, but the exact timing of the request isn't the problem. The problem is that mod_cgid is using the *current* connection ID (which isn't tied to the existing CGI if the connection gets handed off) instead of using the saved connection ID that was used to create the CGI process.

This is why info->conn_id exists in the first place. You can see mod_cgid.c use it just 16 lines down from where it incorrectly uses info->r->connection->conn_id (line 1386 in 2.4.18). So when the thread has been handed off, mod_cgid will try to clean up the wrong ID then spit out an error message that refers to a different ID.

The patch is very simple and obviously correct, and should be applied to trunk. Is there a reason it hasn't been?
Comment 5 Eric Covener 2016-04-26 23:41:23 UTC
(In reply to Michael Thorpe from comment #4)
> (In reply to Eric Covener from comment #3)
> > Anyone have an idea about what causes the ID to change? I don't think the
> > pool cleanups should run on a different thread then where the cgid handler
> > ran.
> 
> Connection IDs can change any time the thread is put to sleep, which can
> happen basically any time Apache is waiting for something (like the output
> of an #exec). 

While my "cgid handler" comment is invalid because this is the INCLUDES filter calling into mod_cgid, I still don't see why there'd be any thread change here.  Can you capture "trace8" of this happening for a single request on a provided simple SSI?

For me, a #exec from an SSI that sleeps for a second runs and cleans up on the same thread and never has a different connection ID, as expected:

[Tue Apr 26 19:36:23.540275 2016] [cgid:trace8] [pid 2561:tid 139709191948032] mod_cgid.c(1385): [client 127.0.0.1:42188] COV: cleanup cid 2 cid 2 pid 2686
[Tue Apr 26 19:36:24.548188 2016] [cgid:trace8] [pid 2561:tid 139709191948032] mod_cgid.c(1385): [client 127.0.0.1:42188] COV: cleanup cid 2 cid 2 pid 2728
[Tue Apr 26 19:36:25.556125 2016] [cgid:trace8] [pid 2561:tid 139709191948032] mod_cgid.c(1385): [client 127.0.0.1:42188] COV: cleanup cid 2 cid 2 pid 2730
[Tue Apr 26 19:36:26.564173 2016] [cgid:trace8] [pid 2561:tid 139709191948032] mod_cgid.c(1385): [client 127.0.0.1:42188] COV: cleanup cid 2 cid 2 pid 2732
[Tue Apr 26 19:36:27.572111 2016] [cgid:trace8] [pid 2561:tid 139709191948032]
....
Index: modules/generators/mod_cgid.c
===================================================================
--- modules/generators/mod_cgid.c       (revision 51728)
+++ modules/generators/mod_cgid.c       (working copy)
@@ -1382,6 +1382,7 @@
         return stat;
     }
 
+    ap_log_rerror(APLOG_MARK, APLOG_TRACE8, 0, info->r, "COV: cleanup cid %lu cid %lu pid %d", info->r->connection->id, info->conn_id, pid);
     if (pid == 0) {
         ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, info->r, APLOGNO(01261)
                       "daemon couldn't find CGI process for connection %lu",
Comment 6 Michael Thorpe 2016-04-27 00:53:11 UTC
(In reply to Eric Covener from comment #5)
> While my "cgid handler" comment is invalid because this is the INCLUDES
> filter calling into mod_cgid, I still don't see why there'd be any thread
> change here.

As a simple scenario, assume there are two MPM event threads:
1) Request A comes in and thread #1 starts handling it.
2) Request B comes in and thread #2 starts handling it.
3) Request A creates a CGI using conn_id=1.
4) Thread #1 puts request A to sleep while it waits for the CGI.
5) Request C comes in and thread #1 starts handling it (using conn_id=1).
6) Thread #2 finishes handling request B.
7) The CGI finishes, and thread #2 picks it up (changing its conn_id to 2) since thread #1 is still handling request C.
8) Request A finishes and thread #2 enters the cleanup_script function.
9) Thread #2 stuffs the *current* connection ID (in this case, 2) and asks the daemon for the PID for that conn_id, but that fails because no CGI was created with that conn_id.
10) Thread #2 prints an error message saying that it couldn't clean up the CGI for conn_id #1, because the error message code is correct even though the daemon-request code is wrong.

> Can you capture "trace8" of this happening for a single request on a provided simple SSI?

On a single request you'll never see the problem.  It's only when the server is loaded that requests can get moved around between threads.
Comment 7 Eric Covener 2016-04-27 01:21:03 UTC
> 4) Thread #1 puts request A to sleep while it waits for the CGI.

This bit doesn't make sense to me, but maybe someone else can clarify.
Comment 8 Nick Warne 2016-04-27 15:41:20 UTC
(In reply to Eric Covener from comment #7)
> > 4) Thread #1 puts request A to sleep while it waits for the CGI.
> 
> This bit doesn't make sense to me, but maybe someone else can clarify.

FYI Eric, I see the issue on my NRTG page, which uses a lot of #flastmod.  This small patch totally fixes it up, and I have to apply it manually each new apache release.

Nick
Comment 9 Eric Covener 2016-08-27 23:17:49 UTC
(In reply to Nick Warne from comment #8)
> (In reply to Eric Covener from comment #7)
> > > 4) Thread #1 puts request A to sleep while it waits for the CGI.
> > 
> > This bit doesn't make sense to me, but maybe someone else can clarify.
> 
> FYI Eric, I see the issue on my NRTG page, which uses a lot of #flastmod. 
> This small patch totally fixes it up, and I have to apply it manually each
> new apache release.
> 
> Nick

Sorry I didn't get back to this. Can you try http://svn.apache.org/viewvc?view=revision&revision=1758083 in your env? It's in the same neighborhood, but I think a different cause (IMO the cause ITT is not known even if the one-liner works.)
Comment 10 Eric Covener 2016-12-31 00:26:15 UTC
Fixed in 2.4.25 (no feedback but very similar problem fixed and confirmed)