Bug 68830 - Deadlock when writing to file
Summary: Deadlock when writing to file
Status: RESOLVED DUPLICATE of bug 61786
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: All (show other bugs)
Version: 2.4.56
Hardware: Sun Solaris
: P2 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-26 07:55 UTC by thutop.dorje
Modified: 2024-05-21 13:24 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description thutop.dorje 2024-03-26 07:55:26 UTC
libapr 1.6.2, compiled with flags -m64 -KPIC, with Sun C++ 5.12 SunOS_sparc
httpd 2.4.56
Solaris SPARC 11.4 

When starting http, it spawns 5 processes. Occasionally (once a month) it hangs and does not accept requests anymore. It's even impossible to telnet port 80. I did a core dump of each of the 5 processes and found out that two of them seems to be stuck in the same function apr_file_write(). 

------------  lwp# 6 / thread# 6  ---------------
 ffffffff730cb048 write    (b, ffffffff7e75b9ff, 1)
 ffffffff57d28ca8 apr_file_write (0x100267338?, 0xffffffff7e75b9ff?, 0xffffffff7e75b9e0?, 0x0?, 0x0?, 0x0?) + 1c8
 ffffffff57d28ff4 apr_file_putc (0x1?, 0x100267338?, 0x0?, 0x0?, 0x1?, 0x0?) + 1c
 ffffffff57d392c8 apr_pollset_wakeup (0x100266a50?, 0x0?, 0x0?, 0x0?, 0x100266a18?, 0x0?) + 20
 00000001000b4888 TO_QUEUE_APPEND (0x100221228?, 0x10032b700?, 0x0?, 0x0?, 0x100221258?, 0x10032b7b0?) + f0
 00000001000b6f5c process_socket (0x1002675c0?, 0x10032b378?, 0x10032b400?, 0x0?, 0x1?, 0x3?) + 8bc
 00000001000ba14c worker_thread (0x1002675c0?, 0x100215af0?, 0xffffffff7f582240?, 0xfffc00?, 0x1?, 0x3?) + 494
 ffffffff57d407a0 dummy_worker (0x1002675c0?, 0x0?, 0x1?, 0xffffffff57d40788?, 0x0?, 0x0?) + 18
 ffffffff730c58b4 _lwp_start (0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)

2/5 processes have this same stack. From /proc/path, the only similarities apart from the libraries is the log file. So I guess they are dead locked on the httpd's log file but I'm not 100% sure. Here is how I configured the log:

ErrorLog /data/log/HTTP.apache.log
LogLevel info
<IfModule log_config_module>
    LogFormat "%h %f %t \t%I \t%O \t%D" common
    CustomLog "|/usr/bin/logger -tHTTP -pdaemon.info" common
</IfModule>

It seems this issue started to happen when we migrated from Solaris 11.3 to 11.4. 
I have not yet had the opportunity to try with libapr 1.7.4, mainly because I did not find a way to reproduce the issue systematically, I have to wait for it to happen. Also there seems to be not difference between 1.6.2 and 1.7.4 for this part of the core.

I'm no expert in this field but as far as I know, regarding the write() function, POSIX does not guarantee the concurrency between processes, and it seems there is not locking mechanism for unbuffered writes.
Comment 1 Yann Ylavic 2024-04-15 10:19:42 UTC
Possibly a duplicate of bug 61786, can you reproduce with libapr-1.6.5?
Comment 2 Yann Ylavic 2024-04-15 10:30:32 UTC
Or apr-1.7.4 which includes the fix too (r1819937 in 1.7.x, r1819938 in 1.6.x).
Comment 3 thutop.dorje 2024-04-15 12:31:09 UTC
Not sure it's the same issue because the call stacks are different but I will test it anyway with apr 1.7.4. I've not found a way to reproduce the problem so I will have to let it run until it happen again.
Comment 4 Yann Ylavic 2024-04-15 15:05:56 UTC
It could be same issue because if the listener thread does not poll/empty the wakeup pipe after the first wake up (which is the bug), the following wake ups will fill the pipe and block indefinitely when it's full (your backtrace for lwp/thread #6).
Comment 5 thutop.dorje 2024-05-21 13:24:30 UTC
We tested in parallel multiple Apache instances, half of them with libapr 1.6.2 and the other half with 1.7.4. After a few weeks, all the 1.6.2 instances were blocking, and to this day the 1.7.4 did not block again.

As we are satisfied with 1.7.4 and don't have other way to reproduce the issue I will mark the case as resolved. Thanks.

*** This bug has been marked as a duplicate of bug 61786 ***