Bug 58008

Summary: Crash, when Apache is shutting down
Product: Rivet Reporter: Mikhail T. <mi+apache>
Component: mod_rivetAssignee: Apache Rivet Mailing list account <rivet-dev>
Status: RESOLVED FIXED    
Severity: normal CC: mxmanghi
Priority: P2    
Version: 2.2.3   
Target Milestone: mod_rivet   
Hardware: PC   
OS: FreeBSD   
Attachments: no channel unregistering

Description Mikhail T. 2015-06-07 12:15:33 UTC
Wen trying to shut Apache down, it crashes from self-induced SIGABRT with the following stack:

#0  0x00000008019501aa in thr_kill () from /lib/libc.so.7
#1  0x0000000801950116 in raise () from /lib/libc.so.7
#2  0x000000080194e8f9 in abort () from /lib/libc.so.7
#3  0x0000000808b7d0ff in Rivet_Panic (arg1=<optimized out>) at apache-2/mod_rivet.c:1049
#4  0x00000008088e0d84 in Tcl_PanicVA (format=<optimized out>, argList=<optimized out>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclPanic.c:99
#5  0x00000008088e0e72 in Tcl_Panic (format=0x18c3b <error: Cannot access memory at address 0x18c3b>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclPanic.c:153
#6  0x00000008088be542 in CutChannel (chan=<optimized out>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclIO.c:3118
#7  CloseChannel (interp=<optimized out>, chanPtr=<optimized out>, errorCode=<optimized out>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclIO.c:2978
#8  FlushChannel (interp=<optimized out>, chanPtr=<optimized out>, calledFromAsyncFlush=<optimized out>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclIO.c:2863
#9  0x00000008088bb7ec in Tcl_Close (interp=0x81c582010, chan=<optimized out>) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclIO.c:3411
#10 0x00000008088bbfdc in Tcl_UnregisterChannel (interp=0x81c582010, chan=0x81c5c4390) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclIO.c:1247
#11 0x0000000808b7c8d1 in Rivet_ChildHandlers (s=0x80246b050, init=0) at apache-2/mod_rivet.c:1325
#12 0x0000000808b7c97f in Rivet_ChildExit (data=0x80246b050) at apache-2/mod_rivet.c:1399
#13 0x00000008011b24be in apr_pool_destroy () from /opt/lib/libapr-1.so.0
#14 0x0000000805733872 in clean_child_exit (code=0) at prefork.c:218
#15 0x0000000805733807 in just_die (sig=15) at prefork.c:344
#16 0x00000008015f38eb in ?? () from /lib/libthr.so.3
#17 0x00000008015f2ffc in ?? () from /lib/libthr.so.3
#18 <signal handler called>
#19 0x000000080195136a in _select () from /lib/libc.so.7
#20 0x00000008015f0f72 in ?? () from /lib/libthr.so.3
#21 0x000000080891be70 in NotifierThreadProc (clientData=<optimized out>) at tclUnixNotfy.c:1216
#22 0x00000008015ee775 in ?? () from /lib/libthr.so.3

(I'm using rivet-2.2.3, but this latest version is not yet mentioned in the list of valid versions in Bugzilla.)
Comment 1 Massimo Manghi 2015-06-09 14:54:02 UTC
Please, would you tell us what are the global parameters in the configuration? For example the output of 

puts [::rivet::inspect -all]

or 

parray [array set [::rivet::inspect -all]]

would help. 

Consider also to update your trunk working copy or re-export from trunk if you didn't create your own wc
Comment 2 Massimo Manghi 2015-06-09 16:03:30 UTC
hold on, your code is of the 2.2 series, not trunk! What version of rivet are you using actually?
Comment 3 Mikhail T. 2015-06-09 16:21:53 UTC
(In reply to Massimo Manghi from comment #1)
> puts [::rivet::inspect -all]

Here it is in a table: http://aldan.algebra.com/~mi/tmp/t.rvt

(In reply to Massimo Manghi from comment #2)
> What version of rivet are you using actually?

As I wrote from the very beginning:

>> (I'm using rivet-2.2.3, but this latest version is not yet mentioned in the list of valid versions in Bugzilla.)

Hope, this helps.
Comment 4 Massimo Manghi 2015-06-09 16:27:05 UTC
Yes, you're right, I added it now. How many virtual hosts do you have in your configuration?
Comment 5 Mikhail T. 2015-06-09 16:31:27 UTC
(In reply to Massimo Manghi from comment #4)
> Yes, you're right, I added it now. How many virtual hosts do you have in
> your configuration?

Quite a few (not even sure -- easily 10).

Worse, some of them use Websh... I had a problem with the coexistence before (you may remember Bug 54162) -- maybe, we are witnessing a regression on that once-fixed problem?
Comment 6 Massimo Manghi 2015-06-09 16:39:18 UTC
I remember that bug. You have at hand a method that might give a strong indication there is a conflict or not: disable your WebSH based web sites and see what happens
Comment 7 Massimo Manghi 2015-06-09 16:41:39 UTC
Forgive my wrong wording of the recommendation: I mean to disable WebSh based stuff by not loading WebSH altogether and see what happens
Comment 8 Mikhail T. 2015-06-09 16:44:12 UTC
No, removing the websh-related pieces (the LoadModule and the websh-using vhost-definition) did not help :-/

(If you'd like to investigate in situ, send me your passwd-entry and ssh-key.)
Comment 9 Mikhail T. 2015-06-09 16:59:38 UTC
More specifically, ALL of the child httpd-processes crash. One of them often dies from SEGFAULT (signal 11):

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000008011b28c3 in apr_pool_clear () from /opt/lib/libapr-1.so.0
(gdb) where
#0  0x00000008011b28c3 in apr_pool_clear () from /opt/lib/libapr-1.so.0
#1  0x000000080573342f in child_main (child_num_arg=5) at prefork.c:597
#2  0x00000008057327fe in make_child (s=0x80246b050, slot=5) at prefork.c:800
#3  0x0000000805732d35 in perform_idle_server_maintenance (p=0x802421028) at prefork.c:902
#4  0x00000008057315b5 in prefork_run (_pconf=0x802421028, plog=0x80246f028, s=0x80246b050) at prefork.c:1090
#5  0x000000000043c63b in ap_run_mpm (pconf=0x802421028, plog=0x80246f028, s=0x80246b050) at mpm_common.c:94
#6  0x0000000000430649 in main (argc=2, argv=0x7fffffffe640) at main.c:777
(gdb) info thread
  Id   Target Id         Frame 
* 4    Thread 802406400 (LWP 101604) 0x00000008011b28c3 in apr_pool_clear () from /opt/lib/libapr-1.so.0
  3    Thread 802407c00 (LWP 100778) 0x0000000801879cea in chdir () from /lib/libc.so.7
* 1    Thread 802406400 (LWP 101604) 0x00000008011b28c3 in apr_pool_clear () from /opt/lib/libapr-1.so.0

while all others die in Rivet_Panic as already submitted.
Comment 10 Massimo Manghi 2015-06-09 21:59:31 UTC
Created attachment 32801 [details]
no channel unregistering

I can't reproduce the problem, but I think it's probably pedantic to unregister a channel just before the child exits (and consequently all its memory released)

would you please try if this patch has any side effects?

If you had a chance to test the code in trunk using the worker MPM the feedback would be highly valuable.
Comment 11 Massimo Manghi 2015-06-10 08:36:35 UTC
I tested your case on 2 machines with mixed results: it's been impossible to reproduce the problem on my PC at home, whereas a quick test on a PC at work showed a similar backtrace. In this case one of the calls in the backtrace was to TclX which wasn't called at all from both mod_rivet and the application code. I deinstalled TclX but the problem still existed with a slightly different bt output
The problem seems to be around Tcl_UnregisterChannel, but given your configuration I don't understand how since it should be called only once

Notice also that the other backtrace you showed has no calls to mod_rivet in the stack.We should understand if there is still a inter-module interference somehow

Are you running Apache 2.2 or 2.4? Which patchlevel version exactly?
Comment 12 Mikhail T. 2015-06-10 13:21:28 UTC
With the patch applied, the crash happens a few steps later, when the interpreter itself is deleted:

#0  0x00000008019501aa in thr_kill () from /lib/libc.so.7
#1  0x0000000801950116 in raise () from /lib/libc.so.7
#2  0x000000080194e8f9 in abort () from /lib/libc.so.7
#3  0x0000000808b7d08f in ?? () from /opt/libexec/apache24/mod_rivet.so
#4  0x00000008088e0d84 in Tcl_PanicVA (format=<optimized out>, argList=<optimized out>)
    at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclPanic.c:99
#5  0x00000008088e0e72 in Tcl_Panic (format=0x18b90 <error: Cannot access memory at address 0x18b90>)
    at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclPanic.c:153
#6  0x000000080881a60f in Tcl_AsyncDelete (async=0x8024fbed0) at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclAsync.c:283
#7  0x000000080881c021 in DeleteInterpProc (interp=0x81c582010)
    at /home/ports/lang/tcl86/work/tcl8.6.4/generic/tclBasic.c:1423
#8  0x0000000808b7c90f in ?? () from /opt/libexec/apache24/mod_rivet.so
#9  0x00000008011b24be in apr_pool_destroy () from /opt/lib/libapr-1.so.0
#10 0x0000000805733872 in clean_child_exit (code=0) at prefork.c:218
#11 0x0000000805733807 in just_die (sig=15) at prefork.c:344
#12 0x00000008015f38eb in ?? () from /lib/libthr.so.3
#13 0x00000008015f2ffc in ?? () from /lib/libthr.so.3
#14 <signal handler called>
#15 0x000000080195136a in _select () from /lib/libc.so.7
#16 0x00000008015f0f72 in ?? () from /lib/libthr.so.3
#17 0x000000080891be70 in NotifierThreadProc (clientData=<optimized out>) at tclUnixNotfy.c:1216
#18 0x00000008015ee775 in ?? () from /lib/libthr.so.3
#19 0x0000000000000000 in ?? ()

> Are you running Apache 2.2 or 2.4? Which patchlevel version exactly?

Using apache-2.4.12.
Comment 13 Massimo Manghi 2015-12-05 11:22:04 UTC
it should be OK now with rivet 2.2.4