Bug 63503 - Reverse proxy server - SIGSEGV
Summary: Reverse proxy server - SIGSEGV
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy (show other bugs)
Version: 2.4.39
Hardware: PC Linux
: P2 major with 6 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk, PatchAvailable
Depends on:
Blocks:
 
Reported: 2019-06-14 19:00 UTC by Don Poitras
Modified: 2020-01-14 19:28 UTC (History)
3 users (show)



Attachments
Output of 'diff -u old new' (4.09 KB, patch)
2019-06-14 19:00 UTC, Don Poitras
Details | Diff
Avoid double initialization (6.23 KB, patch)
2019-08-28 10:49 UTC, Ruediger Pluem
Details | Diff
Use a sub-pool for DNS resolutions (4.53 KB, patch)
2019-08-29 16:10 UTC, Ruediger Pluem
Details | Diff
63503_revert_r1865938_r1865944.diff (3.81 KB, patch)
2019-09-13 19:18 UTC, Ruediger Pluem
Details | Diff
63503_merge.diff (9.74 KB, patch)
2019-09-13 19:19 UTC, Ruediger Pluem
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Don Poitras 2019-06-14 19:00:42 UTC
Created attachment 36627 [details]
Output of 'diff -u old new'

Hello. Recently, we ran a stress test (using Loadrunner) against our backend server using Apache as a reverse-proxy server. We've been running this way for some years. It's not clear when the last time (or if ever) this stress test was run, but we noticed that we were getting SegVs in various locations (mainly in mod_proxy). We're running version 2.4.39 of httpd and version 1.7.0 of APR on RHEL 7.6. We're also using worker mpm and if we switch to the event or prefork mpms, we don't get any errors. Our program also runs on AIX where the event mpm isn't available, so we needed a fix. We haven't actually run the test on AIX, just Linux. We were able to reproduce the problem at 2.4.38 and 2.4.34 and didn't try any other versions.

Looking at the core files, we could see corrupted memory chains and sometimes we see a pointer to thread_mutex_cleanup(). Since this pointer is only referenced in the code as an argument to apr_pool_cleanup_register(), this led us to believe that we were seeing a case where two threads were accessing the same pool concurrently. We rebuilt httpd using --enable-pool-concurrency-check and the SegVs turned into aborts emitted by the pool concurrency check.

Looking at these new core files, it could be seen that two threads were concurrently accessing the same worker proxy connection pool. While proxy_util.c has one mutex call around the call to apr_sockaddr_info_get() (where the proxy connection pool is used), there were other calls in this module that needed the same protection.

I found some previous bug reports that might be due to this same issue:

https://bz.apache.org/bugzilla/show_bug.cgi?id=58491
https://bz.apache.org/bugzilla/show_bug.cgi?id=58529
https://bz.apache.org/bugzilla/show_bug.cgi?id=62357

We've coded a fix and I'll attach the patch. I applied our change to the top of the tree with the same changes I've tested at 2.4.39. I left the APLOGNO() empty. i.e. I have run the update-log-msg-tags perl script. Let me know if there are questions or concerns. Thanks.
Comment 1 Don Poitras 2019-06-20 19:08:23 UTC
Added the "PatchAvailable" keyword.
Comment 2 Jim Jagielski 2019-06-21 19:50:03 UTC
From what I can see, you are basically just protecting some areas w/ mutexes. Is that right? There seems to be quite a but of fluff w/ additional variables being defined and set that seems superfluous.
Comment 3 Don Poitras 2019-06-21 20:10:34 UTC
Yes, it's just mutex protection. The only variables I added were the return code fields and the indicator to determine if this was the first call to ap_proxy_initialize_worker() (in which case, the extra mutex calls are not needed) or the subsequent calls. The error logging code around PROXY_THREAD_LOCK/UNLOCK is just copied from the call that already exists in ap_proxy_determine_connection(). I tried to pattern the calls as much as possible to what's already in the code.
Comment 4 Don Poitras 2019-07-02 17:52:38 UTC
Is there a date set for 2.4.40? It would be nice to see this patch included, but realize that may not be reasonable if .40 is going out very soon. Let me know if there are other questions. Thanks!
Comment 5 Bob Huemmer 2019-07-20 15:37:21 UTC
It's been some time since there has been an update on this issue. It would really be nice to have this fix in place for the 2.4.40 release. Thanks.
Comment 6 Eric Covener 2019-08-26 14:30:51 UTC
trunk: http://svn.apache.org/r1865938 and proposed for 2.4.x
Comment 7 Ruediger Pluem 2019-08-27 15:57:50 UTC
(In reply to Don Poitras from comment #0)

> 
> Looking at these new core files, it could be seen that two threads were
> concurrently accessing the same worker proxy connection pool. While
> proxy_util.c has one mutex call around the call to apr_sockaddr_info_get()
> (where the proxy connection pool is used), there were other calls in this
> module that needed the same protection.

I know I am late with this, but can you please provide the stack traces of both threads that access the pool concurrently? I think the solution you provide creates too much locking overhead, but knowing where it clashes could be helpful to propose an alternative solution.
Comment 8 Eric Covener 2019-08-27 16:38:09 UTC
from the thread:

pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use

172256 .  Thawed  libc:gsignal (+0x37)                                               <--- abort
  1 libc:gsignal (+0x37)                                         
  2 libc:abort (+0x143)                                          
  3 libapr-1:\apr_pools\pool_concurrency_abort 768               
  4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)                    
  5 libapr-1:\thread_cond\apr_thread_cond_create 44              
  6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)      
  7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
  8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)               
  9 httpd:UNKNOWN at 0x00000000004543A0  

Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called:

172271    Thawed  libc:__GI_strncmp (+0x1680)                                           <-- in use
  1 libc:__GI_strncmp (+0x1680)
  2 libc:getenv (+0xBD)
  3 libc:__nscd_getai (+0x3D3)
  4 libc:gaih_inet.constprop.8 (+0x15F2)
  5 libc:getaddrinfo (+0x11F) 
  6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
  7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
  8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
  9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
 11 httpd:UNKNOWN at 0x00000000004543A0


Another possible fix would be to use the global mutex everywhere it's needed and remove the call to create and use the worker thread lock altogether, but that would mean changing the code that currently uses the worker proxy thread lock and it seemed cleaner to just change proxy_util.c rather than two source modules.

The two functions in ap_proxy_acquire_connection() that I serialized were found from other runs. Both apr_reslist_acquire() and connection_constructor() end up using the same worker pool that is being used by ap_proxy_determine_connection(). e.g.

pool concurrency check: pool 0x1d93198(proxy_worker_cp), thread cur 7f019bfff700 in use by 7f019abfd700, state in use -> in use

aborting thread:
1    Thread 0x7f019bfff700 (LWP 106622) 0x00007f01a735c207 in raise () from /lib64/libc.so.6
  3 libapr-1:\apr_pools\pool_concurrency_abort 768
  4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)
  5 libaprutil-1:\apr_reslist\get_container 100 (+0x8)                               
  6 libaprutil-1:\apr_reslist\apr_reslist_acquire 121 (+0x3)
  7 mod_proxy:\proxy_util\ap_proxy_acquire_connection 2327 (+0x3)
  8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1933 (+0x19)
  9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
 11 httpd:UNKNOWN at 0x00000000004543A0

And the "in-use" thread:

5    Thread 0x7f019abfd700 (LWP 106624) 0x00007f01a7424bed in connect () from /lib64/libc.so.6
  8 mod_proxy:\proxy_util\ap_proxy_determine_connection 2537 (+0x15)                 
  9 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
 10 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 11 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)

Maybe an alternative would be to use a subpool for the resolver calls?
Comment 9 Ruediger Pluem 2019-08-28 06:08:42 UTC
(In reply to Eric Covener from comment #8)
> from the thread:
> 
> pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur
> 7f25a21fc700 in use by 7f2598bed700, state in use -> in use
> 
> 172256 .  Thawed  libc:gsignal (+0x37)                                      
> <--- abort
>   1 libc:gsignal (+0x37)                                         
>   2 libc:abort (+0x143)                                          
>   3 libapr-1:\apr_pools\pool_concurrency_abort 768               
>   4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)                    
>   5 libapr-1:\thread_cond\apr_thread_cond_create 44              
>   6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)      
>   7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
>   8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)               
>   9 httpd:UNKNOWN at 0x00000000004543A0  
> 
> Here's the thead the diagnostic said had the pool 'in use' when apr_palloc()
> was called:
> 
> 172271    Thawed  libc:__GI_strncmp (+0x1680)                               
> <-- in use
>   1 libc:__GI_strncmp (+0x1680)
>   2 libc:getenv (+0xBD)
>   3 libc:__nscd_getai (+0x3D3)
>   4 libc:gaih_inet.constprop.8 (+0x15F2)
>   5 libc:getaddrinfo (+0x11F) 
>   6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
>   7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
>   8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
>   9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
>  10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
>  11 httpd:UNKNOWN at 0x00000000004543A0
> 

Thanks. My current gut feeling is that there is something rotten elsewhere and that adding the locks just fixes the results of this with a performance penalty. But I would need to investigate closer.


> 
> Maybe an alternative would be to use a subpool for the resolver calls?

I will investigate this option, but I need to study first which allocators are used.
Comment 10 Ruediger Pluem 2019-08-28 10:48:15 UTC
(In reply to Ruediger Pluem from comment #9)
> (In reply to Eric Covener from comment #8)
> > from the thread:
> > 
> > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur
> > 7f25a21fc700 in use by 7f2598bed700, state in use -> in use
> > 
> > 172256 .  Thawed  libc:gsignal (+0x37)                                      
> > <--- abort
> >   1 libc:gsignal (+0x37)                                         
> >   2 libc:abort (+0x143)                                          
> >   3 libapr-1:\apr_pools\pool_concurrency_abort 768               
> >   4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)                    
> >   5 libapr-1:\thread_cond\apr_thread_cond_create 44              
> >   6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)      
> >   7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
> >   8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)               
> >   9 httpd:UNKNOWN at 0x00000000004543A0  
> > 
> > Here's the thead the diagnostic said had the pool 'in use' when apr_palloc()
> > was called:
> > 
> > 172271    Thawed  libc:__GI_strncmp (+0x1680)                               
> > <-- in use
> >   1 libc:__GI_strncmp (+0x1680)
> >   2 libc:getenv (+0xBD)
> >   3 libc:__nscd_getai (+0x3D3)
> >   4 libc:gaih_inet.constprop.8 (+0x15F2)
> >   5 libc:getaddrinfo (+0x11F) 
> >   6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
> >   7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
> >   8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
> >   9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
> >  10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
> >  11 httpd:UNKNOWN at 0x00000000004543A0
> > 
> 
> Thanks. My current gut feeling is that there is something rotten elsewhere
> and that adding the locks just fixes the results of this with a performance
> penalty. But I would need to investigate closer.

Here I suspect a double initialization issue. Can you please revert r1865938 and add the patch I will add here next? If I am correct, it should fix the above issue. The second example will not be fixed by this. BTW: I am not sure if we can backport the volatile change in mod_proxy.h. I guess it is needed, to ensure that we really recheck, but this can be discussed later.


> 
> > 
> > Maybe an alternative would be to use a subpool for the resolver calls?
> 
> I will investigate this option, but I need to study first which allocators
> are used.

The connection pool uses the global pool allocator and this allocator has a mutex. Hence creation of the subpool and allocating memory in the pool is thread safe in the sense that parent and subpool can be used concurrently in different threads. So a subpool would be an option here as well.
Comment 11 Ruediger Pluem 2019-08-28 10:49:02 UTC
Created attachment 36740 [details]
Avoid double initialization
Comment 12 Ruediger Pluem 2019-08-28 10:50:54 UTC
For easier review the patch without whitespace changes:

Index: modules/proxy/proxy_util.c
===================================================================
--- modules/proxy/proxy_util.c  (revision 1864214)
+++ modules/proxy/proxy_util.c  (working copy)
@@ -2045,10 +2045,12 @@
                      ap_proxy_worker_name(p, worker));
     }
     else {
+        apr_global_mutex_lock(proxy_mutex);
+        /* Check again after we got the lock if we are still uninitialized */
+        if (!(worker->local_status & PROXY_WORKER_INITIALIZED)) {
         ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, s, APLOGNO(00927)
                      "initializing worker %s local",
                      ap_proxy_worker_name(p, worker));
-        apr_global_mutex_lock(proxy_mutex);
         /* Now init local worker data */
 #if APR_HAS_THREADS
         if (worker->tmutex == NULL) {
@@ -2101,12 +2103,15 @@
                  "initialized single connection worker in child %" APR_PID_T_FMT " for (%s)",
                  getpid(), worker->s->hostname_ex);
         }
+            if (rv == APR_SUCCESS) {
+                worker->local_status |= (PROXY_WORKER_INITIALIZED);
+            }
+        }
         apr_global_mutex_unlock(proxy_mutex);

     }
     if (rv == APR_SUCCESS) {
         worker->s->status |= (PROXY_WORKER_INITIALIZED);
-        worker->local_status |= (PROXY_WORKER_INITIALIZED);
     }
     return rv;
 }
Index: modules/proxy/mod_proxy.h
===================================================================
--- modules/proxy/mod_proxy.h   (revision 1864214)
+++ modules/proxy/mod_proxy.h   (working copy)
@@ -470,7 +470,7 @@
 /* Worker configuration */
 struct proxy_worker {
     proxy_hashes    hash;       /* hash of worker name */
-    unsigned int local_status;  /* status of per-process worker */
+    volatile unsigned int local_status;  /* status of per-process worker */
     proxy_conn_pool     *cp;    /* Connection pool to use */
     proxy_worker_shared   *s;   /* Shared data */
     proxy_balancer  *balancer;  /* which balancer am I in? */
Comment 13 Don Poitras 2019-08-28 19:29:30 UTC
Testing a new fix requires multiple people to be involved and some hours of work to get the test done. Seeing that the proposed fix won't address the "second example", it's not something we'd be able to test. If you'd like us to test something meant to fix all the failures, we'd probably be able to do that.
Comment 14 Ruediger Pluem 2019-08-29 08:02:24 UTC
(In reply to Don Poitras from comment #13)
> Testing a new fix requires multiple people to be involved and some hours of
> work to get the test done. Seeing that the proposed fix won't address the
> "second example", it's not something we'd be able to test. If you'd like us
> to test something meant to fix all the failures, we'd probably be able to do
> that.

Ok. Will try to do this.
@Eric any comments from you reviewing my approach? Do you think it gets into the correct direction from your point of view?
Comment 15 Eric Covener 2019-08-29 10:50:00 UTC
(In reply to Ruediger Pluem from comment #14)
> (In reply to Don Poitras from comment #13)
> > Testing a new fix requires multiple people to be involved and some hours of
> > work to get the test done. Seeing that the proposed fix won't address the
> > "second example", it's not something we'd be able to test. If you'd like us
> > to test something meant to fix all the failures, we'd probably be able to do
> > that.
> 
> Ok. Will try to do this.
> @Eric any comments from you reviewing my approach? Do you think it gets into
> the correct direction from your point of view?

I did also think a 2nd thread was concurrently in ap_proxy_initialize_worker() which would leave both using the original worker->cp->pool due to the inconsistent way it's guarded.

I guess the fourth backtrace has some missing frames (from the numbering) and is also in apr_sockaddr_info_get()? I am not sure. But if it's right then both are DNS. 

If we can use a sub-pool, we can add it as volatile and key off of it solving the (potential?) ABI problem with the double-check after grabbing the global mutex.
Comment 16 Ruediger Pluem 2019-08-29 11:22:28 UTC
(In reply to Eric Covener from comment #15)
> (In reply to Ruediger Pluem from comment #14)
> > (In reply to Don Poitras from comment #13)
> > > Testing a new fix requires multiple people to be involved and some hours of
> > > work to get the test done. Seeing that the proposed fix won't address the
> > > "second example", it's not something we'd be able to test. If you'd like us
> > > to test something meant to fix all the failures, we'd probably be able to do
> > > that.
> > 
> > Ok. Will try to do this.
> > @Eric any comments from you reviewing my approach? Do you think it gets into
> > the correct direction from your point of view?
> 
> I did also think a 2nd thread was concurrently in
> ap_proxy_initialize_worker() which would leave both using the original
> worker->cp->pool due to the inconsistent way it's guarded.
> 
> I guess the fourth backtrace has some missing frames (from the numbering)
> and is also in apr_sockaddr_info_get()? I am not sure. But if it's right
> then both are DNS. 

I agree that lines are missing in the forth backtrace. What do you mean by both? 2nd and 4th backtrace?

> 
> If we can use a sub-pool, we can add it as volatile and key off of it
> solving the (potential?) ABI problem with the double-check after grabbing
> the global mutex.

So with a sub-pool of worker->cp->pool dedicated for the apr_sockaddr_info_get call you propose to mark worker->cp->addr as volatile and recheck it after getting the lock just as I did in my patch with worker->local_status?
Comment 17 Eric Covener 2019-08-29 11:49:47 UTC
(In reply to Ruediger Pluem from comment #16)
> (In reply to Eric Covener from comment #15)
> > (In reply to Ruediger Pluem from comment #14)
> > > (In reply to Don Poitras from comment #13)
> > > > Testing a new fix requires multiple people to be involved and some hours of
> > > > work to get the test done. Seeing that the proposed fix won't address the
> > > > "second example", it's not something we'd be able to test. If you'd like us
> > > > to test something meant to fix all the failures, we'd probably be able to do
> > > > that.
> > > 
> > > Ok. Will try to do this.
> > > @Eric any comments from you reviewing my approach? Do you think it gets into
> > > the correct direction from your point of view?
> > 
> > I did also think a 2nd thread was concurrently in
> > ap_proxy_initialize_worker() which would leave both using the original
> > worker->cp->pool due to the inconsistent way it's guarded.
> > 
> > I guess the fourth backtrace has some missing frames (from the numbering)
> > and is also in apr_sockaddr_info_get()? I am not sure. But if it's right
> > then both are DNS. 
> 
> I agree that lines are missing in the forth backtrace. What do you mean by
> both? 2nd and 4th backtrace?

both pairs of threads / both crash scenarios.

> > If we can use a sub-pool, we can add it as volatile and key off of it
> > solving the (potential?) ABI problem with the double-check after grabbing
> > the global mutex.
> 
> So with a sub-pool of worker->cp->pool dedicated for the
> apr_sockaddr_info_get call you propose to mark worker->cp->addr as volatile
> and recheck it after getting the lock just as I did in my patch with
> worker->local_status?


That is what I was thinking. but maybe this is not as simple/similar due to the way worker->cp is already lazily created?
Comment 18 Ruediger Pluem 2019-08-29 16:10:08 UTC
Created attachment 36745 [details]
Use a sub-pool for DNS resolutions

How about the attached for using a sub-pool? The complete patch needs to be a merge of this one and my previous one.
Comment 19 Ruediger Pluem 2019-09-04 06:49:07 UTC
Any feedback on the proposed approach? If the concept looks fine I would create a merged patch for testing.
Comment 20 Yann Ylavic 2019-09-04 09:54:49 UTC
(In reply to Ruediger Pluem from comment #19)
> Any feedback on the proposed approach? If the concept looks fine I would
> create a merged patch for testing.
I share the same analysis, DNS resolution races with constructors hence pool corruption and further crash.

Regarding the volatile(s) added to structs, I think the compatibility issue can be addressed by using volatile accesses instead of volatile declaration, i.e. define something like:

#define AP_VOLATILIZE_T(T, x) (*(T volatile *)&(x))

and use it wherever we access worker->local_status or worker->cp->addr (which then need to be declared volatile).
Comment 21 Yann Ylavic 2019-09-04 10:00:28 UTC
(In reply to Yann Ylavic from comment #20)
> (which then need to be declared volatile).
typo: .. need *not* be ..
Comment 22 Eric Covener 2019-09-04 10:52:35 UTC
(In reply to Ruediger Pluem from comment #19)
> Any feedback on the proposed approach? If the concept looks fine I would
> create a merged patch for testing.

+1, I was hoping to be able to recreate but looking less and less likely I will be able to give it any time.
Comment 23 Ruediger Pluem 2019-09-13 19:17:41 UTC
I will attach 2 patches in a second. They are both against trunk.
63503_revert_r1865938_r1865944.diff will revert the changes already done in trunk with the additional locking (current proposal to solve the issue). This is needed to apply the second patch which contains the alternate proposal for solving this issue and which gets tested hopefully.
63503_merge.diff contains the alternate proposal for solving this issue.

r1865938 contains some additional goodness for better logging. Once the alternate solution is seen as they way to go I will take care that these improvements won't get lost when commiting to trunk.
Comment 24 Ruediger Pluem 2019-09-13 19:18:25 UTC
Created attachment 36782 [details]
63503_revert_r1865938_r1865944.diff
Comment 25 Ruediger Pluem 2019-09-13 19:19:03 UTC
Created attachment 36783 [details]
63503_merge.diff
Comment 26 Ruediger Pluem 2019-10-09 08:31:59 UTC
(In reply to Ruediger Pluem from comment #23)
> I will attach 2 patches in a second. They are both against trunk.
> 63503_revert_r1865938_r1865944.diff will revert the changes already done in
> trunk with the additional locking (current proposal to solve the issue).
> This is needed to apply the second patch which contains the alternate
> proposal for solving this issue and which gets tested hopefully.
> 63503_merge.diff contains the alternate proposal for solving this issue.
> 
> r1865938 contains some additional goodness for better logging. Once the
> alternate solution is seen as they way to go I will take care that these
> improvements won't get lost when commiting to trunk.

Any reviewers / testers?
Comment 27 Don Poitras 2019-10-09 12:52:57 UTC
Ruediger,
  Yes, we are trying to test the patch. We needed to update to 2.4.41 to address open CVE's at 2.4.39 and we're getting errors that we hadn't seen before (this is before applying your patch), so we need to get that figured out before running the stress tests that show the concurrency problem. We might have to fall back to 2.4.39 and apply your changes there to test if we can't get the 2.4.41 issue figured out soon (the assumption at this point is that we've misconfigured something in the backend server rather than some real issue with 2.4.41). I'll update this track with our progress. I didn't want to post till we had some actual results.
Comment 28 Don Poitras 2019-10-11 13:25:11 UTC
Good news! We were able to reproduce the error, applied the patch, re-ran the test and saw no error messages and no core files. We're satisfied that this fix will work for us and we'll include in our next shipment. Thanks Ruediger for writing the comprehensive fix. Much appreciated.
Comment 29 Ruediger Pluem 2019-10-11 15:12:45 UTC
Committed to trunk as r1868294, r1868295 and r1868296.
Comment 30 Ruediger Pluem 2019-11-23 08:08:19 UTC
Proposed for backport as r1870213.
Comment 31 Ruediger Pluem 2020-01-14 19:28:53 UTC
Backported to 2.4.x as r1872790. Will be part of 2.4.42.