Bug 69110 - cordumps (pthread_mutex_lock)
Summary: cordumps (pthread_mutex_lock)
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_dav_fs (show other bugs)
Version: 2.5-HEAD
Hardware: PC Linux
: P2 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-04 21:46 UTC by KC Tessarek
Modified: 2024-07-03 20:30 UTC (History)
0 users



Attachments
apr global lock forked/threaded stress case (6.91 KB, patch)
2024-06-12 13:17 UTC, Joe Orton
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description KC Tessarek 2024-06-04 21:46:30 UTC
The config.log, error_log and the coredumps are available at: https://evermeet.cx/pub/logs/httpd/
(That is, if httpd is not coredumping...)

Linux atvie01s 6.8.10-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 17 21:20:54 UTC 2024 x86_64 GNU/Linux

This happens after 1 or 2 days running httpd. This has never happened before upgrading the OS to Fedora 40. I recompiled httpd after upgrading to f40.

[2024-06-02 13:07:46.916495] [core:notice] [pid 1512889] AH00051: child pid 1690224 exit signal Segmentation fault (11), possible coredump in /usr/local/apache
Fatal glibc error: pthread_mutex_lock.c:450 (__pthread_mutex_lock_full): assertion failed: e != ESRCH || !robust
[2024-06-02 13:07:59.930976] [core:notice] [pid 1512889] AH00051: child pid 1690223 exit signal Abort (6), possible coredump in /usr/local/apache
Fatal glibc error: pthread_mutex_lock.c:450 (__pthread_mutex_lock_full): assertion failed: e != ESRCH || !robust
[2024-06-02 13:08:26.960042] [core:notice] [pid 1512889] AH00051: child pid 1690222 exit signal Abort (6), possible coredump in /usr/local/apache
Fatal glibc error: pthread_mutex_lock.c:450 (__pthread_mutex_lock_full): assertion failed: e != ESRCH || !robust

The event MPM config:

 StartServers             5
 MinSpareThreads         25
 MaxSpareThreads        125
 ThreadsPerChild         25
 MaxRequestWorkers      150
 MaxConnectionsPerChild   0


Please let me know what other info I can provide. Please note that this is a prod machine.
Comment 1 Joe Orton 2024-06-05 07:08:20 UTC
We have the backported patch for r1914438 in Fedora 40 - are you using WebDAV? Can you get a backtrace? e.g. "coredumpctl gdb 1690222" for that pid.
Comment 2 Joe Orton 2024-06-05 07:18:31 UTC
Ah, I missed the link. And looks like you're using a self-built httpd not Fedora httpd so  r1914438 is not relevant. You have some third-party modules linked in:
                                                                                           
warning: Can't open file /var/local/apache/modules/mod_authnz_pam.so during file-backed mapping note processing
warning: Can't open file /var/local/apache/modules/mod_markdown.so during file-backed mapping note processing

Please obtain a backtrace using gdb - the core dump isn't useful without access to the exact binary you're using.
Comment 3 KC Tessarek 2024-06-05 07:19:50 UTC
I am not using a Fedora package for httpd, which is why I also put the config.log output along with the coredumps at the address I posted.

Unfortunately I removed the coredumps from my machine, since there were thousands of them.

Yes, I am using WebDAV, even though it is not heavily used. Maybe a propfind operation per 10 minutes or so. (Using it as a sync endpoint foe my note taking app.)

Next time I won't remove the coredumps...

But the one you are looking for should be one of dumps that are bigger than 10MB at https://evermeet.cx/pub/logs/httpd/
Comment 4 KC Tessarek 2024-06-05 07:24:12 UTC
I'll copy the dumps back to my machine and get you the backtrace.

Btw, regarding the 3rd party modules. They are only loaded, but not used by any directives. I used them 2 years ago, but stopped using them shortly after. Just forgot to remove them from the config file.

I'll comment them out the next time I have to restart the server.

I am going to get you the backtrace tomorrow. I'm heading to bed. It's way past my bedtime. ;-)

And thank you for looking into this!
Comment 5 KC Tessarek 2024-06-05 23:27:50 UTC
Here's the backtrace:

Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
Downloading source file /usr/src/debug/glibc-2.39-13.fc40.x86_64/nptl/pthread_kill.c
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7f362d6006c0 (LWP 1690344))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f365f42c1b3 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#2  0x00007f365f3d465e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f365f3bc902 in __GI_abort () at abort.c:79
#4  0x00007f365f3bd767 in __libc_message_impl (fmt=fmt@entry=0x7f365f544b20 "Fatal glibc error: %s:%s (%s): assertion failed: %s\n") at ../sysdeps/posix/libc_fatal.c:132
#5  0x00007f365f3cc7a7 in __libc_assert_fail (assertion=assertion@entry=0x7f365f540e73 "e != ESRCH || !robust", file=file@entry=0x7f365f540e5e "pthread_mutex_lock.c", line=line@entry=450,
    function=function@entry=0x7f365f5496f0 <__PRETTY_FUNCTION__.1> "__pthread_mutex_lock_full") at __libc_assert_fail.c:31
#6  0x00007f365f42d66c in __pthread_mutex_lock_full (mutex=0x7f365fe5e000) at pthread_mutex_lock.c:450
#7  0x00007f365f42d745 in ___pthread_mutex_lock (mutex=<optimized out>) at pthread_mutex_lock.c:86
#8  0x00007f365f5ac0b9 in proc_mutex_pthread_acquire_ex (timeout=-1, mutex=0x16054e8) at locks/unix/proc_mutex.c:780
#9  proc_mutex_pthread_acquire (mutex=0x16054e8) at locks/unix/proc_mutex.c:843
#10 0x00007f365f5a39bc in apr_global_mutex_lock (mutex=0x16054d0) at locks/unix/global_mutex.c:106
#11 0x00000000004e6c9f in ssl_mutex_on ()
#12 0x00000000004eb544 in ssl_scache_store ()
#13 0x00000000004e4d9f in ssl_callback_NewSessionCacheEntry ()
#14 0x00007f365fdb7315 in ssl_update_cache (s=0x7f35c801c4b0, mode=2) at ssl/ssl_lib.c:4536
#15 0x00007f365fe26c02 in tls_construct_new_session_ticket (s=0x7f35c801c4b0, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4309
#16 0x00007f365fe096c9 in write_state_machine (s=0x7f35c801c4b0) at ssl/statem/statem.c:894
#17 state_machine (s=0x7f35c801c4b0, server=1) at ssl/statem/statem.c:487
#18 0x00000000004de15e in ssl_io_filter_handshake ()
#19 0x00000000004dfa2f in ssl_io_filter_input ()
#20 0x00000000004d40d9 in ssl_hook_process_connection ()
#21 0x0000000000474bf0 in ap_run_process_connection ()
#22 0x000000000054bb95 in process_socket ()
#23 0x000000000054c4ff in worker_thread ()
#24 0x00007f365f42a1b7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#25 0x00007f365f4ac39c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Comment 6 KC Tessarek 2024-06-06 02:21:45 UTC
Just crashed again with:

#0  0x00007f76975462b3 in __pthread_mutex_lock_full (mutex=0x7f7697f74000) at pthread_mutex_lock.c:514
514                 ENQUEUE_MUTEX_PI (mutex);
[Current thread is 1 (Thread 0x7f767ea006c0 (LWP 2011368))]
(gdb) bt
#0  0x00007f76975462b3 in __pthread_mutex_lock_full (mutex=0x7f7697f74000) at pthread_mutex_lock.c:514
#1  0x00007f7697546745 in ___pthread_mutex_lock (mutex=<optimized out>) at pthread_mutex_lock.c:86
#2  0x00007f76976c50b9 in proc_mutex_pthread_acquire_ex (timeout=-1, mutex=0x19b14e8) at locks/unix/proc_mutex.c:780
#3  proc_mutex_pthread_acquire (mutex=0x19b14e8) at locks/unix/proc_mutex.c:843
#4  0x00007f76976bc9bc in apr_global_mutex_lock (mutex=0x19b14d0) at locks/unix/global_mutex.c:106
#5  0x00000000004e6c9f in ssl_mutex_on ()
#6  0x00000000004eb544 in ssl_scache_store ()
#7  0x00000000004e4d9f in ssl_callback_NewSessionCacheEntry ()
#8  0x00007f7697ecd315 in ssl_update_cache (s=0x7f7658000f30, mode=2) at ssl/ssl_lib.c:4536
#9  0x00007f7697f3cc02 in tls_construct_new_session_ticket (s=0x7f7658000f30, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4309
#10 0x00007f7697f1f6c9 in write_state_machine (s=0x7f7658000f30) at ssl/statem/statem.c:894
#11 state_machine (s=0x7f7658000f30, server=1) at ssl/statem/statem.c:487
#12 0x00000000004de15e in ssl_io_filter_handshake ()
#13 0x00000000004dfa2f in ssl_io_filter_input ()
#14 0x00000000004d40d9 in ssl_hook_process_connection ()
#15 0x0000000000474bf0 in ap_run_process_connection ()
#16 0x000000000054bb95 in process_socket ()
#17 0x000000000054c4ff in worker_thread ()
#18 0x00007f76975431b7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#19 0x00007f76975c539c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Comment 7 Joe Orton 2024-06-06 07:29:40 UTC
Thanks a lot. For the first case of the assert() failure, can you get the variable information for frame 6 below the assert():

#5  0x00007f365f3cc7a7 in __libc_assert_fail (assertion=assertion@entry=0x7f365f540e73 "e != ESRCH || !robust", file=file@entry=0x7f365f540e5e "pthread_mutex_lock.c", line=line@entry=450,
    function=function@entry=0x7f365f5496f0 <__PRETTY_FUNCTION__.1> "__pthread_mutex_lock_full") at __libc_assert_fail.c:31
#6  0x00007f365f42d66c in __pthread_mutex_lock_full (mutex=0x7f365fe5e000) at pthread_mutex_lock.c:450

e.g. in gdb do -

up 6
info locals

There is nothing that's changed in this area of httpd recently (mod_ssl session cache locking) so it's not obvious what the problem is. It looks like you're using the Fedora apr package and nothing has changed there either.
Comment 8 KC Tessarek 2024-06-06 10:38:21 UTC
Thanks for looking into this. Yes, I am using the Fedora apr and apr-util packages.

Here's the info you requested:

(gdb) up 6
#6  0x00007f365f42d66c in __pthread_mutex_lock_full (mutex=0x7f365fe5e000) at pthread_mutex_lock.c:450
450                     assert (e != ESRCH || !robust);
(gdb) info locals
private = 128
e = <optimized out>
kind = 0
robust = <optimized out>
newval = 1690344
assume_other_futex_waiters = <optimized out>
oldval = <optimized out>
id = 1690344
__PRETTY_FUNCTION__ = "__pthread_mutex_lock_full"
(gdb)
Comment 9 KC Tessarek 2024-06-07 11:30:23 UTC
FYI: As I have suspected, the 3rd party modules have nothing to do with the crash. Would have been strange, since they were only loaded but not used.

I removed them from being loaded but the server just coredumped again.
Comment 10 KC Tessarek 2024-06-09 05:54:46 UTC
Any idea what is going on? I'd appreciate any insight. It is rather problematic that my server coredumps almost every other day.
Comment 11 Yann Ylavic 2024-06-10 13:41:46 UTC
Could it be a linkage issue, i.e. is httpd running with the same APR version it was compiled against?
Comment 12 KC Tessarek 2024-06-11 03:20:38 UTC
There is only one version of APR on my machine, but APR almost never changes, so this httpd binary is definitely using the same APR vwrsion it was linled again.
I compiled it before this all started and APR or any other dependency was not updated since.

However, it is dynamically linked, so using another (compatible) version should be possible. If not, there is a problem with the dynamic linking concept.
Comment 13 Joe Orton 2024-06-11 09:20:01 UTC
I have been able to reproduce similar crashes under load testing on Fedora 40. I wonder if there are two problems here, there is a segfault like your comment 6 in the ENQUEUE_MUTEX_PI() macro, and then once that thread fails the mutex is in a bad state and starts failing in the assert() call in other threads.

#0  0x00007fac1bb932b3 in __pthread_mutex_lock_full (mutex=0x7fac1b8e7000) at pthread_mutex_lock.c:514
514		    ENQUEUE_MUTEX_PI (mutex);
(gdb) where
#0  0x00007fac1bb932b3 in __pthread_mutex_lock_full (mutex=0x7fac1b8e7000) at pthread_mutex_lock.c:514
#1  0x00007fac1bb93745 in ___pthread_mutex_lock (mutex=<optimized out>) at pthread_mutex_lock.c:86
#2  0x00007fac1bd120c9 in proc_mutex_pthread_acquire_ex (mutex=0x5594e5e67440, timeout=-1) at locks/unix/proc_mutex.c:787
#3  proc_mutex_pthread_acquire (mutex=0x5594e5e67440) at locks/unix/proc_mutex.c:850
#4  0x00007fac1bd099bc in apr_global_mutex_lock (mutex=0x5594e5e67428) at locks/unix/global_mutex.c:106
#5  0x00007fac1b3f11ce in ssl_mutex_on.isra () from /etc/httpd/modules/mod_ssl.so
#6  0x00007fac1b3e80a5 in ssl_scache_store () from /etc/httpd/modules/mod_ssl.so
#7  0x00007fac1b3e8197 in ssl_callback_NewSessionCacheEntry () from /etc/httpd/modules/mod_ssl.so
#8  0x00007fac1b307315 in ssl_update_cache () from /lib64/libssl.so.3
#9  0x00007fac1b376c02 in tls_construct_new_session_ticket () from /lib64/libssl.so.3
#10 0x00007fac1b3596c9 in state_machine () from /lib64/libssl.so.3
#11 0x00007fac1b3de25e in ssl_io_filter_handshake () from /etc/httpd/modules/mod_ssl.so
#12 0x00007fac1b3df68c in ssl_io_filter_input () from /etc/httpd/modules/mod_ssl.so
#13 0x00007fac1b3d3cc2 in ssl_hook_process_connection () from /etc/httpd/modules/mod_ssl.so
#14 0x00005594e43d7b2a in ap_run_process_connection (c=c@entry=0x7fab84005160) at server/connection.c:42
#15 0x00007fac1b6b1c16 in process_socket (thd=thd@entry=0x5594e5e71c88, p=<optimized out>, sock=<optimized out>, cs=<optimized out>, my_child_num=my_child_num@entry=0, 
    my_thread_num=my_thread_num@entry=17) at /usr/src/debug/httpd-2.4.59-2.fc40.x86_64/server/mpm/event/event.c:1086
#16 0x00007fac1b6b2636 in worker_thread (thd=0x5594e5e71c88, dummy=<optimized out>) at /usr/src/debug/httpd-2.4.59-2.fc40.x86_64/server/mpm/event/event.c:2179
#17 0x00007fac1bb901b7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#18 0x00007fac1bc121a4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(gdb) up 4
#4  0x00007fac1bd099bc in apr_global_mutex_lock (mutex=0x5594e5e67428) at locks/unix/global_mutex.c:106
106	    rv = apr_proc_mutex_lock(mutex->proc_mutex);
(gdb) print *mutex->proc_mutex->os->pthread_interproc 
$8 = {__data = {__lock = -2147436777, __count = 1, __owner = 0, __nusers = 4294967295, __kind = 178, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\027\267\000\200\001\000\000\000\000\000\000\000\377\377\377\377\262", '\000' <repeats 22 times>, __align = 6442497815}
Comment 14 KC Tessarek 2024-06-11 09:34:45 UTC
It seems that Fedora 40 is the culprit. I didn't have any issues no matter the traffic I got on my server before.
The question now is what in Fedora 40 is responsible for these crashes. Is it glibc, a 3rd party lib like openssl, nghttp2, ...?
Comment 15 Joe Orton 2024-06-11 09:54:50 UTC
I've asked our glibc maintainers for help debugging it. It's possible that APR/httpd is doing something wrong and some assumption we made was wrong - or that something has changed/broken in libc.
Comment 16 Yann Ylavic 2024-06-11 09:56:26 UTC
Joe, is there an (ungraceful) restart or so involved in your reproducer, something that exercices EONWERDEAD or just the pthread mutex refcounting?
Comment 17 Joe Orton 2024-06-12 10:29:25 UTC
Working theory - quite speculative -

1) it is a refcounting issue, the mutex is getting destroyed to soon
2) it is happening whenever an event child exits

Yann, is it correct that the _child_init function creates another cleanup? 

https://github.com/apache/apr/blob/trunk/locks/unix/proc_mutex.c#L693

I can't see why. Also not convinced that the munmap() should not be done only inside the _unref call rather than on every cleanup.
Comment 18 Yann Ylavic 2024-06-12 12:47:47 UTC
Yes proc_mutex_pthread_child_init() creates another cleanup, supposedly we want each child using pthread shared mutexes to decrement the refcount when exiting (i.e. pchild is destroyed), since apr_proc_mutex_child_init() won't register the [apr_proc_mutex_]cleanup by itself?
apr_proc_mutex_create() does register apr_proc_mutex_cleanup() but even if it's inherited by fork in httpd we don't destroy pconf for children processes.

Maybe proc_mutex_pthread_child_init() should register proc_mutex_pthread_cleanup() rather than proc_mutex_pthread_unref() so that munmap() is called on cleanup too, but I think any mmap()ping is killed on exit anyway so it wouldn't change much.
What we want is is calling pthread_mutex_destroy() when the last user exits, should pthread_mutex_create() "leak" something, and proc_mutex_pthread_unref() is enough for this.

> Working theory - quite speculative -
> 
> 1) it is a refcounting issue, the mutex is getting destroyed to soon
> 2) it is happening whenever an event child exits

Do you mean that the crash/UAF happens on every child exit?
The only case where a child should destroy the mutex is on graceful restart when the last child of a generation exits after the parent process has already started the new generation (i.e. pconf of the previous generation was cleared in the parent), but then nothing should have destroyed the mutex since this child holds a ref still.
For ungraceful restart, pchild can be killed while connections are being handled still, but since the parent process shouldn't have cleared pconf before all the children are waitpid()ed, proc_mutex_pthread_unref() in the child should be a noop here.

Sorry I don't see where this issue comes from, so far..
Do you have a reproducer? I tried to stress test a bit but couldn't trigger it.
Comment 19 Joe Orton 2024-06-12 13:12:08 UTC
> inherited by fork in httpd we don't destroy pconf for children processes.

The mutexes are (all?) created in pglobal not pconf - the unixd.c atexit() handler  installed by ap_unixd_mpm_set_signals is inherited into forked children and destroys pglobal - no? So it looks to me like a double-destroy.

atm I'm working on a modified APR test case which tries to roughly replicate what event does, will attach it.

I can trigger a EINVAL from pthread_mutex_unlock() and from tracing/printf-debugging it is definitely calling pthread_mutex_destroy() before all the users have gone away. Switching exit() for _exit() in the child also stopped the crashes.

When I remove the extra cleanup, that seems to go away.
Comment 20 Joe Orton 2024-06-12 13:13:28 UTC
Not a double-destroy, but "destroy too soon".
Comment 21 Joe Orton 2024-06-12 13:17:07 UTC
Created attachment 39773 [details]
apr global lock forked/threaded stress case
Comment 22 KC Tessarek 2024-06-12 13:18:03 UTC
This is an interesting issue, but why does it trigger in Fedora, but apparently nowhere else?
The issue you described does not seem to be distro specific.
Comment 23 Yann Ylavic 2024-06-12 13:19:20 UTC
> The mutexes are (all?) created in pglobal not pconf - the unixd.c atexit()
> handler  installed by ap_unixd_mpm_set_signals is inherited into forked
> children and destroys pglobal - no?

The atexit() in ap_unixd_mpm_set_signals() is for -X/ONE_PROCESS mode only, which is indeed to destroy pglobal on exit but there we are exiting the single httpd process.

> 
> atm I'm working on a modified APR test case which tries to roughly replicate
> what event does, will attach it.

Thanks, will look at it.
Comment 24 Joe Orton 2024-06-12 13:28:29 UTC
Ah, good point, thanks. In that case, yes, switching exit() for _exit() likely this test case is not reproducing any crashes either.
Comment 25 Joe Orton 2024-06-13 07:59:18 UTC
It seems the reason that this is triggering on F40 is that we are using the LMDB backend (backported) for apr-util apr_dbm by default. F39 and below used Berkeley DB.

We do *not* have r1915094 applied in Fedora apr-util, though, to try to roughly emulate the same default locking provision in apr_dbm* as previously. It seems like the pthreads mutex use in LMDB is triggering this, though it's weird that the crashes seem occur in the mutexes used by httpd rather than the LMDB internal ones.

With r1915094 applied to Fedora apr-util I can no longer trigger any crashes, at least with r1914438 applied to httpd. But if we apply r1915094 to Fedora apr-util it will likely degrade the DAV locking safety for people who build upstream httpd on Fedora system apr... a fine mess!

The test case I was using to trigger the crashes was the a combination of ab -c with https + the stress test I have for the DAV locking issues: https://people.redhat.com/~jorton/lockbomb.c
Comment 26 KC Tessarek 2024-06-13 09:07:36 UTC
> With r1915094 applied to Fedora apr-util I can no longer trigger any crashes, at least with r1914438 applied to httpd. But if we apply r1915094 to Fedora apr-util it will likely degrade the DAV locking safety for people who build upstream httpd on Fedora system apr... a fine mess!

This means we need a new apr-util release and a new httpd release at the same time (with the respective commits you linked) and Fedora has to provide a apr-util package asap. Am I correct? Well, the Fedora apr-util can always include that patch. But apparenly we just need a new httpd release that includes the patch for httpd.

I'm also not quite sure how dav plays into this. While I am using DAV, it is not highly used. The most traffic (which could be seen as stress) on my httpd is either on standard h2/http1.1 or a reverse proxy code path.

Well, either way, since I am not afraid to compile stuff myself, I guess I will have to apply those 2 fixes and compile apr-util myself. That should fix the issue for me until new releases are out in whatever timeframe. Releases are not very frequent for httpd or apr*...

I am still a bit puzzled as to why nobody else has encountered this. My system is not special. It's rather a standard setup. Weird.

Anyway, thanks a bunch for looking into this. I'll start patching and compiling tomorrow. ;-)
Comment 27 Joe Orton 2024-06-13 09:30:55 UTC
I guess it's a bit unusual to use upstream httpd on Fedora-provided apr*. WebDAV is relevant because mod_dav_fs uses apr_dbm (and hence LMDB) for storing DAV lock data.
Comment 28 KC Tessarek 2024-06-13 20:48:59 UTC
Well, it is upstream httpd, but it is still a release tarball (2.4.59), not git master. I can't remember setting the version to 2.5-HEAD. Maybe something went wrong when I was finally able to submit the bug (I had problems and Daniel helped me to overcome the system's blocking attempts).

The WebDAV component is hardly used, which is why I find it interesting that it's triggered by it. I thought it was only happening, when there's a certains tress level on the server. And this is ceryainly not coming from webDAV - at least not on my server.
Comment 29 KC Tessarek 2024-06-13 22:07:15 UTC
Hmm, I just downloaded apr-util-1.6.3 but there is no dbm/apr_dbm_lmdb.c
I guess I will have to download the Fedora spec file and create an RPM package with that fix included instead. (I missed the backported part.)

Will r1914438 be included in httpd 2.4.60?
Comment 30 KC Tessarek 2024-06-14 00:32:15 UTC
Apart from my previous question, the stack trace does not even mention webdav, but openssl. How do you see from the stack trace that web dav is the problem?

Additionally I looked at the fedora httpd package and it does not include r1914438.
Now I am slightly confused. Does that mean the Fedora package is also prone to these coredumps?

I was able to patch apr-util, but applying https://github.com/apache/httpd/commit/455147a36049efc443921ea523d01aa62e047fa3.patch to 2.4.59 does not work. I guess I have to backport the code from trunk.

Anyway, what is the solution to my problem now? Since the stack trace does not mention web-dav-fs and fedora's httpd package does not include any webdav patches, I am not sure anymore what is going on. I thought I understood your explanation, but now I am rather lost and don't know what exactly I have to patch.
Comment 31 KC Tessarek 2024-06-17 19:50:32 UTC
Something else must have changed in Fedora 40. I haven't had a crash in 8 days.
Comment 32 KC Tessarek 2024-06-21 18:52:16 UTC
Ok, it just crashed again. I am very puzzled by this.

Is there any progress? What exactly shall I patch? Can you please reply to my previous comments. Sorry for all the comments, but it is impossible to edit a comment.
Comment 33 Joe Orton 2024-06-25 07:17:52 UTC
I have pushed an apr-util update to Fedora updates-testing which disable locking in the LMDB backend, can you try it?

https://bodhi.fedoraproject.org/updates/FEDORA-2024-6a7e2a7d47
Comment 34 KC Tessarek 2024-06-25 20:49:02 UTC
Yes, I'll install it within the next 10 minutes.

I haven't checked the build yet. Does it only include 
#define DEFAULT_ENV_FLAGS (MDB_NOSUBDIR|MDB_NOSYNC|MDB_NOLOCK)

If so, I built that as mentioned in https://bz.apache.org/bugzilla/show_bug.cgi?id=69110#c30 and I only waited for you to tell me, whether this was the only thing that I have to patch and install.
Comment 35 KC Tessarek 2024-06-25 20:55:12 UTC
Installed. Let's see what happens...
Comment 36 KC Tessarek 2024-07-02 05:32:39 UTC
It's been a few days and it was ok so far. I've upgraded httpd to 2.4.60 thus it Apache was "only" running for 6 days.
At one point it took 9 days before the coredump happened, so I can't say if it really is fixed with the new apr-util. 
But I am optimistic. ;-)
Comment 37 Joe Orton 2024-07-03 09:59:29 UTC
Thanks for testing & updates. To answer your question above - yes the only thing that apr-util build changes is to add MDB_NOLOCK to the #define DEFAULT_ENV_FLAGS line.
Comment 38 KC Tessarek 2024-07-03 20:30:21 UTC
Thanks for your reply.

I upgraded to 2.4.61 today.

If no coredump occurs within the next 15 days, we can close this issue. Fingers crossed.