Issue 107490 - cppu: various crashes during exit of remote uno bridge process
Summary: cppu: various crashes during exit of remote uno bridge process
Status: CLOSED FIXED
Alias: None
Product: udk
Classification: Code
Component: code (show other issues)
Version: OOO320m7
Hardware: All Linux, all
: P3 Trivial with 1 vote (vote)
Target Milestone: 4.0.0
Assignee: Stephan Bergmann
QA Contact: issues@udk
URL:
Keywords:
: 111223 112305 113591 (view as issue list)
Depends on: 112602 112603
Blocks: 90439 112782
  Show dependency tree
 
Reported: 2009-12-07 10:53 UTC by caolanm
Modified: 2010-10-06 14:23 UTC (History)
2 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
suggested collection of changes (10.46 KB, patch)
2009-12-07 10:54 UTC, caolanm
no flags Details | Diff
update patch, in addition ThreadAdmin has to outlive ORequestThread (11.03 KB, text/plain)
2010-06-22 14:33 UTC, caolanm
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description caolanm 2009-12-07 10:53:57 UTC
a) uno_initEnvironment in bridges/source/remote/urp/urp_environment.cxx spawns
off two threads, a reader and a writer. 

If we take the writer thread as the simpler case of the two, then 
OWriterThread::run runs until m_bAbort is true. 
m_bAbort is only true when OWriterThread::abortThread is called.
Only RemoteEnvironment_thisDispose calls abortThread

In the case of a remote uno-bridge using process, such as a command line python
script that connects to a OOo server then this remote uno-environment will
belong to the EnvironmentsData singleton in cppu/source/uno/lbenv.cxx 

That singleton's dtor will be called on exit (or dlclose). Due to the
CPPU_LEAK_STATIC_DATA define we never call dispose on the bridge. So there is no
route to call dispose on the remote bridge, and no way that the thread can be
shutdown. 

So it's not just a leak of static data, but a failure to shut down threads. The
threads remain blocked on mutexes that end up getting deleted. Threads enter
after the mutexes go away when the main process exits, everything is mush at
this stage so various crashes happen randomly in the still-alive threads.

b) So, assuming for the moment that there's no way to avoid the reader/writer
threading issues except to call dispose on them during shutdown then we have
some other issues.

We have a DisposedCallerAdmin, a ThreadPool and a DisposedCallerAdmin singleton.
They are destroyed correctly in the reverse order of creation during shutdown.
The EnvironmentsData singleton is created first, then the ThreadPool, then the
DisposedCallerAdmin, so dtor order is DisposedCallerAdmin, ThreadPool and
EnvironmentsData.

b.1) The ThreadPool uses the DisposedCallerAdmin, but the DisposedCallerAdmin
will be destroyed before it is.
b.2) The remote uno environment's dispose that should be called from the
DisposedCallerAdmin dtor uses the ThreadPool and DisposedCallerAdmin. But at the
time of the EnvironmentsData dtor they have already been destroyed.

c) The remote uno-environments's dispose wants to re-enter EnvironmentsData
during dispose.

So.. attached is a patch which removes the -DCPPU_LEAK_STATIC_DATA from cppu to
make sure that the remote uno-env reader/writer threads get a chance to exit,
and then uses some boost::shared_ptr stuff to ensure a sane dtor-ordering cycle
on the dependencies. i.e. 

theThreadPoolHolder holds a reference on theDisposedCallerAdminHolder, so the
threadpool can be guaranteed that thedisposedcalleradmin exists. Classes like
ORequestThread also hold a reference on theThreadPoolHolder. 

When we put a thread into our g_pThreadpoolHashSet, then we add a reference to
the theThreadPoolHolder in there as well, so the threadpool itself exists until
all its threads exit. In the patch, I just went for the simple solution of
making it a hash_set of references to the holder.

for (c) I'm not at sure that it makes complete sense to hold a mutex in the
EnvironmentsData::~EnvironmentsData as attempt to reenter EnvironmentsData from
any dispose connected calls block (the reader thread does this). But for
simplicity here I just went for a isDisposing boolean and returned early from
the offending method before attempting to acquire the offending mutex when its
set. Alternatively removing the mutexguard from the dtor might be a solution.

https://bugzilla.redhat.com/show_bug.cgi?id=541876 has the practical bug details
of crashes during exit of a sample python script
Comment 1 caolanm 2009-12-07 10:54:55 UTC
Created attachment 66532 [details]
suggested collection of changes
Comment 2 olistraub 2010-01-15 12:06:47 UTC
I applied the patch to the OOG680_m5 we're using, and it helped to reduce the
number of crashes. However, I now see a similar crash:

==30720== Thread 6:
==30720== Invalid read of size 4
==30720==    at 0x39D8A08293: pthread_mutex_lock (in /lib64/libpthread-2.5.so)
==30720==    by 0x7210593: osl_acquireMutex (mutex.c:144)
==30720==    by 0x720AB9B: osl::Mutex::acquire() (mutex.hxx:74)
==30720==    by 0x720ABF2: osl::Guard<osl::Mutex>::Guard(osl::Mutex*)
(mutex.hxx:153)
==30720==    by 0x7245AFB: rtl_moduleCount_release (unload.cxx:160)
==30720==    by 0xE6C69A9: bridges::cpp_uno::shared::Bridge::~Bridge()
(bridge.cxx:231)
==30720==    by 0xE6C6B85: bridges::cpp_uno::shared::freeMapping(_uno_Mapping*)
(bridge.cxx:60)
==30720==    by 0x6FC3E5A: uno_revokeMapping (lbmap.cxx:705)
==30720==    by 0xE6C69F8: bridges::cpp_uno::shared::Bridge::release()
(bridge.cxx:197)
==30720==    by 0xE6C8ED9:
bridges::cpp_uno::shared::freeUnoInterfaceProxy(_uno_ExtEnvironment*, void*)
(unointerfaceproxy.cxx:65)
==30720==    by 0x6FC8154: s_stub_defenv_revokeInterface (lbenv.cxx:417)
==30720==    by 0x6FCEC95: s_environment_invoke_v(_uno_Environment*,
_uno_Environment*, void (*)(__va_list_tag (*) [1]), __va_list_tag (*) [1])
(EnvStack.cxx:300)
==30720==    by 0x6FCED31: uno_Environment_invoke_v (EnvStack.cxx:319)
==30720==    by 0x6FCEDCD: uno_Environment_invoke (EnvStack.cxx:328)
==30720==    by 0x6FC6DDE: defenv_revokeInterface (lbenv.cxx:473)
==30720==    by 0xE6C8D93:
bridges::cpp_uno::shared::releaseProxy(_uno_Interface*) (unointerfaceproxy.cxx:104)
==30720==    by 0x10C1EC60: bridges_remote::Uno2RemoteStub::~Uno2RemoteStub()
(stub.cxx:336)
==30720==    by 0x10C1F263:
bridges_remote::freeUno2RemoteStub(_uno_ExtEnvironment*, void*) (stub.cxx:296)
==30720==    by 0x6FC8154: s_stub_defenv_revokeInterface (lbenv.cxx:417)
==30720==    by 0x6FCEC95: s_environment_invoke_v(_uno_Environment*,
_uno_Environment*, void (*)(__va_list_tag (*) [1]), __va_list_tag (*) [1])
(EnvStack.cxx:300)
==30720==    by 0x6FCED31: uno_Environment_invoke_v (EnvStack.cxx:319)
==30720==    by 0x6FCEDCD: uno_Environment_invoke (EnvStack.cxx:328)
==30720==    by 0x6FC6DDE: defenv_revokeInterface (lbenv.cxx:473)
==30720==    by 0x10C1FFBF: thisRelease (stub.cxx:66)
==30720==    by 0x10C0A5BB: bridges_urp::ServerMultiJob::execute() (urp_job.cxx:701)
==30720==    by 0x10C0CAC0: doit (urp_job.cxx:77)
==30720==    by 0x6FD323F: cppu_threadpool::JobQueue::enter(long, unsigned char)
(jobqueue.cxx:126)
==30720==    by 0x6FD3C53: cppu_threadpool::ORequestThread::run() (thread.cxx:199)
==30720==    by 0x6FD3E9A: cppu_requestThreadWorker (thread.cxx:56)
==30720==    by 0x72111FE: osl_thread_start_Impl (thread.c:279)
==30720==  Address 0xab9a098 is 16 bytes inside a block of size 40 free'd
==30720==    at 0x4A06B3E: free (vg_replace_malloc.c:323)
==30720==    by 0x7210540: osl_destroyMutex (mutex.c:125)
==30720==    by 0x720C16D: osl::Mutex::~Mutex() (mutex.hxx:65)
==30720==    by 0x7244E71: __tcf_0 (unload.cxx:140)
==30720==    by 0x39D7E33354: exit (in /lib64/libc-2.5.so)
==30720==    by 0x39D7E1D97A: (below main) (in /lib64/libc-2.5.so)

Can we do something similar here?
Comment 3 caolanm 2010-01-15 12:49:06 UTC
Do you have a route to reproduce (some of the time at least) this that you could
share ? In this case its ORequestThread that still running after main, a similar
problem undoubtedly.
Comment 4 olistraub 2010-01-18 11:24:59 UTC
Sorry, I need quite a big setup to reproduce this. However, I've now understood
what happens here:
Process A (from which I reported the stack trace) wants to exit. At this point
in time, the remote bridge still has threads open for communication. Depending
on the timing, it will result in a crash/deadlock (directly related to your
fix). It seems that with different timing, it crashes here.

I was able to fix this issue by explicitly destroying the remote bridges (using
XBridgeFactory and calling ->dispose() on all living bridges) before exitting.
Is this something that could be done internally?
Comment 5 olistraub 2010-04-07 08:44:12 UTC
Another crash, related to this fix sometimes happens in my environment:
Crash in ThreadAdmin::remove(ORequestThread) because this (ThreadAdmin) is NULL
Called by ORequestThread::onTerminated()
Called by cppu_requestThreadWorker(void*)
Called by osl_thread_start_Impl(void*)
Called by start_thread

It seems that the ORequestThread is performing its cleanup routine
(onTerminated()) after the ThreadAdmin has already been released (the main
thread is already in its cleanup code during exit().
Pereventing the crash should be quite easy, by checking for
ThreadAdmin::getInstance() != NULL in onTerminated.
Comment 6 caolanm 2010-04-07 08:57:29 UTC
do you have a reproducible scenario ?
Comment 7 olistraub 2010-04-07 09:22:51 UTC
In my environment it is currently reproducible (but it's based on the Openoffice
SDK, not OpenOffice itself).
Comment 8 Stephan Bergmann 2010-05-03 15:25:25 UTC
*** Issue 111223 has been marked as a duplicate of this issue. ***
Comment 9 caolanm 2010-06-22 14:33:25 UTC
Created attachment 70147 [details]
update patch, in addition ThreadAdmin has to outlive ORequestThread
Comment 10 caolanm 2010-06-22 15:08:43 UTC
cmc->olistraub: valgrinding a 1000 iteration loop of e.g.

unopkg remove --shared com.sun.PresenterScreen-linux_x86_64
unopkg list --shared
unopkg add --shared presenter-screen.oxt
unopkg list --shared

regularly gave me something like your last trace. Indeed ORequestThread can
outlive the ThreadAdmin singleton. Updated patch adds in that dependency as well. 

With the two subissues attached also added in to the mix, I don't see any error
output so far. I'd be interested if you find this version makes a difference to you
Comment 11 olistraub 2010-06-28 07:29:54 UTC
I tried to apply to patches to the OOG680_m5 source (yes, it's already ancient,
but it works for us...):
-couldn't apply patch to 112603, since the directory structure is different
-applied patch to 112602
-applied your new patch

I had to remove the patch to cppu/util/target.pmk (CPPU_LEAK_STATIC_DATA)
Comment 12 olistraub 2010-06-28 07:31:41 UTC
Accidentally hit the "Submit" button...

The CPPU_LEAK_STATIC_DATA stuff caused a crash in regcomp during OO build in
OOG680_m5. I assume it works in newer versions...

However, after removing the patch to this makefile, everything seems to work. I
have not seen any crash since using the new build.

Thanks, Oli
Comment 13 Stephan Bergmann 2010-06-28 09:58:35 UTC
*** Issue 112305 has been marked as a duplicate of this issue. ***
Comment 14 Stephan Bergmann 2010-06-28 10:01:30 UTC
@kr: Any reason not to move on with this issue?
Comment 15 kay.ramme 2010-06-28 10:17:25 UTC
Stephan, if the patches fix the problems, just go ahead.
Comment 16 Stephan Bergmann 2010-06-28 10:24:14 UTC
.
Comment 17 Stephan Bergmann 2010-07-07 13:54:51 UTC
applied attached openoffice.org-3.3.0.ooo107490.cppu.lifecycle.patch (plus
removal of now obsolete CPPU_LEAK_STATIC_DATA) as
<http://hg.services.openoffice.org/cws/sb126/rev/93a8d0356e64>; found no
problems so far (with build/subsequenttests on unxsoli4)
Comment 18 Stephan Bergmann 2010-07-12 10:50:31 UTC
+ <http://hg.services.openoffice.org/cws/sb126/rev/ba3c777f49bc>
Comment 19 Stephan Bergmann 2010-07-15 09:27:22 UTC
.
Comment 20 Stephan Bergmann 2010-08-09 14:30:58 UTC
*** Issue 113591 has been marked as a duplicate of this issue. ***
Comment 21 Stephan Bergmann 2010-10-06 14:23:52 UTC
.