112605 – [sw] crash during shutdown: no SolarMutex

Issue 112605 - [sw] crash during shutdown: no SolarMutex

Summary: [sw] crash during shutdown: no SolarMutex

Status:	CONFIRMED

Alias:	None

Product:	Writer
Classification:	Application
Component:	code (show other issues)
Version:	DEV300m80
Hardware:	All All

Importance:	P4 Trivial (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-06-22 16:03 UTC by mst.ooo
Modified:	2017-05-20 11:19 UTC (History)
CC List:	8 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mst.ooo 2010-06-22 16:03:51 UTC

just captured this most peculiar critter:

alDisplay::~SalDisplay()
SalDisplay::DeInitRandR()
GtkXLib::~GtkXLib()
==26317== Thread 7:
==26317== Invalid read of size 4
==26317==    at 0xA6B2081: Application::GetSolarMutex() (svapp.cxx:528)
==26317==    by 0x1DE39D19: sw::UnoImplPtr<SwXTextCursor::Impl>::~UnoImplPtr()
(unobaseclass.hxx:126)
==26317==    by 0x1DE34AF1: SwXTextCursor::~SwXTextCursor() (unoobj.cxx:919)
==26317==    by 0x74F7C0D: cppu::OWeakObject::release() (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppuhelpergcc3.so.3)
==26317==    by 0x1DE39A38:
cppu::WeakImplHelper12<com::sun::star::lang::XServiceInfo,
com::sun::star::beans::XPropertySet, com::sun::star::beans::XPropertyState,
com::sun::star::beans::XMultiPropertyStates,
com::sun::star::container::XEnumerationAccess,
com::sun::star::container::XContentEnumerationAccess,
com::sun::star::util::XSortable, com::sun::star::document::XDocumentInsertable,
com::sun::star::text::XSentenceCursor, com::sun::star::text::XWordCursor,
com::sun::star::text::XParagraphCursor,
com::sun::star::text::XRedline>::release() (implbase12.hxx:120)
==26317==    by 0x1DE2C714: SwXTextCursor::release() (unoobj.cxx:3199)
==26317==    by 0x101932F7: ??? (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libgcc3_uno.so)
==26317==    by 0x75ABAA9: ??? (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppu.so.3)
==26317==    by 0x75B4108: ??? (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppu.so.3)
==26317==    by 0x75B4181: uno_Environment_invoke_v (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppu.so.3)
==26317==    by 0x75B41B4: uno_Environment_invoke (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppu.so.3)
==26317==    by 0x75A73A9: ??? (in
inst/OO_m80_pooltest_li/OOo_DEV300m80_Linux_x86_install-arc_en-US/openoffice.org/ure/lib/libuno_cppu.so.3)
==26317==  Address 0x4 is not stack'd, malloc'd or (recently) free'd
==26317== 
==26317== 
==26317== Process terminating with default action of signal 11 (SIGSEGV)
==26317==  Access not within mapped region at address 0x4
==26317==    at 0xA6B2081: Application::GetSolarMutex() (svapp.cxx:528)
==26317==    by 0x1DE39D19: sw::UnoImplPtr<SwXTextCursor::Impl>::~UnoImplPtr()
(unobaseclass.hxx:126)
==26317==    by 0x1DE34AF1: SwXTextCursor::~SwXTextCursor() (unoobj.cxx:919)


so it seems that during shutdown the SolarMutex may actually be 0... but
Application::GetSolarMutex() returns a reference, not a pointer...
[ but a crash during shutdown won't impact the user much and isn't so important... ]

Comment 1 philipp.lohmann 2010-06-23 07:58:51 UTC

Reference or no; any call to any vcl class or even object after DeInitVCL has
been called is an error and likely to crash. Your problem is more that there is
a reference of SwXTextCursor flying around after the application (including the
framework) is already gone.

Comment 2 Mathias_Bauer 2010-06-23 08:43:32 UTC

It can't be avoided that such references exist - if they are held in a Java
program it's the garbage collection that decides when they are released.

The real problem is that we never implemented a life time control of UNO
objects. VCL shouldn't be shut down as long as any external reference exist. COM
has such a mechanism and many years ago I suggested something similar to Markus
Meyer for UNO objects and services. Unfortunately that wasn't seen as important
enough to judge the effort of its implementation.

Alternatively the life time of the SolarMutex could be controlled outside of
VCL. But AFAIK and as you probably know even better, that most probably also
isn't easy.

Comment 3 philipp.lohmann 2010-06-23 09:03:17 UTC

easy would not be the problem. Just that would be no solution. So you have a
SolarMutex outside vcl's lifetime and then what ? The mutex is just for
protecting the underlying vcl (or vcl derived) object, which you would probably
want to access -> crash in object instead of solar mutex. No gain.

Comment 4 Mathias_Bauer 2010-06-23 10:00:29 UTC

The SolarMutex protects all code in VCL and all other code where the developers
where too lazy or not able to create their own mutex. You are right if the
object in question has some dependency on VCL code. But if not (and that might
be something easier to achieve as we are only talking about the code in a dtor),
a SolarMutex outside of VCL could help. 

Besides that, I still would prefer the other solution: if a refcounted object
requires something from VCL (e.g. SolarMutex) there must be a way to avoid VCL
shutdown as long as the object exists. A possible way could be to add a member
to all UNO objects using the SolarMutex that "refcounts" the VCL based
application and prevents that it is shut down.

We also could fix that in UNO - every bridge can hold such a reference so that
the application can't go down as long as an external bridge exist. That would be
something as the COM solution I mentioned. In the olden times we counted an
reference on our Application class for every COM connection plus one reference
if the application also was started by a user action. Even if the user called
"File-Exit", the application did not terminate until all external connections
had been released.

Comment 5 philipp.lohmann 2010-06-23 10:10:15 UTC

So any number of code gone wrong (hanging server, forgotten refcount) can then
prevent the application from ever quitting again. Just saying.

However preventing vcl "shutdown" is simple, just don't call DeInitVCL until you
mean it. That is prevent framework from calling DeInitVCL before all such
references are cleared.

Comment 6 Mathias_Bauer 2010-06-23 10:25:42 UTC

> So any number of code gone wrong (hanging server, forgotten refcount) can then
> prevent the application from ever quitting again. Just saying.

Sure. So as all code in office that loops prevents the office from being shut down. 

> That is prevent framework from calling DeInitVCL before all such references are 
> cleared.

I would even prevent OOo from leaving the message loop as then it still can
receive user requests. That's exactly as we did it when the only external
connections we had were COM connections.

Comment 7 philipp.lohmann 2010-06-23 10:43:51 UTC

Just that looping code (aka bugs) is still possible should not prevent us from
being at least moderately safe from forgotten references; no need to make the
situation worse IMHO. And WRT "exactly as we did", why should external
connections preventing the user from exiting the application a good thing ?
He'll invariably think that OOo hangs.

If this "exit prevention" need be I'd at least want to have some kind of message
box querying "Connections are still open, do you really want to quit" or some such.

Comment 8 Mathias_Bauer 2010-06-23 10:54:07 UTC

The user will still be able to close the application (means: all its windows) -
but it will stay in memory (and in its message loop) until all external
connections are closed. In the meantime the user also will be able to open new
documents from his desktop, thus again establishing the "internal" refcount on
the application.

So a hanging external connection in the worst case would just prevent OOo from
being removed from memory - it would act as a UI less quickstarter. ;-)

Comment 9 philipp.lohmann 2010-06-23 12:31:05 UTC

... with the added benefit that you cannot logout or shutdown anymore because of
a still running application. Great :-(

Comment 10 caolanm 2010-06-24 09:25:37 UTC

My own 2 cents is that I'd like to see a much longer stack to see some more
details of this particular example, i.e. --num-callers=50 or something. We get
quite a few crash-on-shutdown traces in fedora as our crash-reporting
infrastructure is global over all apps and over their entire lifetime. And they
fall into two main categories. 

I've seen ones like this one with a11y shutdown where the a11y shutdown and
release of uno references is called effectively from "spi_atk_bridge_exit_func"
which nightmarishly is called from atexit from the accessibility libs which
makes avoiding this tricky.

The other family is due to the current cppu shutdown being crippled to attempt
to "leak" resources to avoid accessing dead ones, but this has the side effect
of not calling the dtors which cleanly bring down the remotebridge reader/writer
threads leaving them generally blocked on mutexes and then after those global
mutex gets destroyed on main processes exit the threads unblock and trample all
over the corrupt corpse of OOo. The common case is unopkg crashes which are no
fault of unopkg. See issue 107490 for some thoughts on fixing that up.

Comment 11 kay.ramme 2010-06-24 09:35:16 UTC

... somehow I have been added to this issue.

Regarding remote references to (Uno) objects in the office process, it is either 
correct to 
* programmatically shutdown any remote bridges (including in process runtimes 
such as Java), or
* to ensure that reachable objects are valid.

We need to define the meaning of "file/exit" accordingly, as already suggested.

The object life cycle in OOo is (partly) ref-count based, objects reachable are 
alive, objects becoming unreachable are (mostly synchronously) destroyed. 
Ideally VCL would follow this approach, e.g. create a "Display" with some 
controls on it, the controls hold references to the "Display", ensuring that it 
is kept alive at least wrt memory.

Philipp is right about VCL's SolarMutex, the SolarMutex as well as VCL must not 
be in use after DeInitVCL (by they way, wasn't there any debugging aid to ensure 
this?). This bears the question what the sense of Tools SolarMutex is. Either 
use the VCL SolarMutex or the OSL GlobalMutex.

Comment 12 Mathias_Bauer 2010-06-24 10:01:01 UTC

The reason for tools::SolarMutex is that you can use it without linking against
vcl. This was necessary to remove vcl build dependencies from some libraries.

Comment 13 kay.ramme 2010-06-24 10:09:05 UTC

Matthias, thanks for clarifying. That's what I expected. 

What about doing it "right" and utilizing the OSL GlobalMutex for VCL and 
friends? The minimum would be to move the implementation of the VCL SolarMutex to 
the Tools, AFAIR VCL depends on Tools anyway.

Comment 14 Mathias_Bauer 2010-06-24 10:32:01 UTC

That would indeed be a very welcomed change.
AFAIR the implementation if SolarMutex is special and can't be changed that way.
pl should know that better than anyone else. :-)

Comment 15 kay.ramme 2010-06-24 10:44:25 UTC

I remember, you are right. The SolarMutex is platform native implemented, ...

Comment 16 mst.ooo 2010-06-24 10:53:47 UTC

just for the record: the sw::UnoImplPtr that takes the SolarMutex is new in
DEV300m71.
i've introduced it to solve issue 105557, which led to crashes due to race
conditions.
the entire UNO API implementation of the writer uses the SolarMutex for locking.
and this crash doesn't seem to happen "often", i think i've seen it only once.

hmmm... i wonder if the mythical UNO threading framework would solve this problem :)

Comment 17 kay.ramme 2010-06-24 12:22:08 UTC

It could, the steps are:

-1- consolidate / reduce the SalYieldMutex implementations in VCL: There is 
still some craft in there from a mythical Java client.
-2- Create a Uno Purpose Bridge for the SolarMutex: http://wiki.services.openoffice.org/wiki/Uno/Cpp/Spec/Purpose_Bridge_Implementat
ion_Helper (seems that some of may pages have been deleted by WikiBot).
-3- Adapt the belonging components to be loaded into the "solarmutex" 
environment.
-4- Remove any left overs, including all the guards etc.

Step -1- is not trivial and error prone.

Comment 18 kay.ramme 2010-06-24 12:26:55 UTC

SolarMutex implementations can be found in:

vcl/os2/source/app/salinst.cxx
vcl/unx/headless/svpinst.cxx
vcl/unx/source/app/salinst.cxx
vcl/unx/gtk/app/gtkinst.cxx
vcl/win/source/app/salinst.cxx
vcl/aqua/source/app/salinst.cxx

Comment 19 Marcus 2017-05-20 11:19:45 UTC

Reset assigne to the default "issues@openoffice.apache.org".