Issue 80300 - regcomp failed during final packaging again
Summary: regcomp failed during final packaging again
Status: CLOSED IRREPRODUCIBLE
Alias: None
Product: porting
Classification: Code
Component: code (show other issues)
Version: 680m223
Hardware: All Linux, all
: P2 Trivial (vote)
Target Milestone: ---
Assignee: kay.ramme
QA Contact: issues@porting
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-02 23:06 UTC by pavel
Modified: 2008-12-06 20:23 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Patch to make dl_fini wait 10s after destructing the vtablefactory ... (592 bytes, patch)
2007-08-29 12:11 UTC, kay.ramme
no flags Details | Diff
Patch to let the pyuno gc threads wait 5s before actually trying to release an object (686 bytes, patch)
2007-08-29 12:13 UTC, kay.ramme
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description pavel 2007-08-02 23:06:55 UTC
Hi,

in my SRC680_m224 build for *many language*, it failed again:

ERROR:  /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/regcomp -register -
br /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/types
.rdb -br /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/pyuno_services.rdb 
-r /data/oo/BuildDir/tmp/ooopackaging/i_74791186091826/unxlngx6.pro/OpenO
ffice/deb/services.rdb/fi_inprogress_1/services.rdb -c vnd.openoffice.pymodule:mailmerge -l 
com.sun.star.loader.Python 2>&1 |

This time, the languae was fr. fi was ok cs was ok, ...

My dmesg output contains:

typesconfig[26487]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffba5180 
error 6
typesconfig[16372]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffff8db160 
error 4
typesconfig[16373]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffff8db160 
error 6
typesconfig[31234]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffe2e640 
error 4
typesconfig[31235]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffe2e640 
error 6
typesconfig[16504]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffdb6dc0 
error 4
typesconfig[16505]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffdb6dc0 
error 6
typesconfig[28670]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffe45100 
error 4
typesconfig[28671]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffe45100 
error 6
regcomp.bin[8648]: segfault at 00002aaaabec81a0 rip 00002aaaac89d3b2 rsp 0000000040400fe0 
error 4

so regcomp.bin SIGSEGVed.

I can provide the IP/login as always.
Comment 1 pavel 2007-08-02 23:07:52 UTC
It happened to me also once in m223 during my builds.
Comment 2 pavel 2007-08-02 23:10:05 UTC
That machine is running 11 days and the log is full of typesconfig crashes - thus we know where I started 
builds 8) And there are also 3 regcomp.bin crashes right now.

cmc,kendy: have you seen something similar?
Comment 3 jkim 2007-08-06 17:40:53 UTC
FYI, ko failed for me on FreeBSD/amd64 -CURRENT.
Comment 4 pavel 2007-08-07 06:52:38 UTC
jkim: with the same message in the log?

What gcc are you using?

I use gcc version 4.0.2 20050901 (prerelease) (SUSE Linux).

kendy, cmc: can you please try to build for all languages and see if it fails with similar problem?
Comment 5 pavel 2007-08-07 08:34:30 UTC
just a note: my m225 build was OK, no break. I'll start several other builds to get it again...
Comment 6 jkim 2007-08-07 19:04:17 UTC
pjanik: yes, it was exactly the same error as far as I remember although I
didn't save the error message.

GCC version was:

%cc --version
cc (GCC) 4.2.1 20070719 [FreeBSD]
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Interestingly I have done m224 build twice for ko and I was not able to
reproduce it again.
Comment 7 caolanm 2007-08-09 07:51:36 UTC
m225 all languages built for me on x86_64
Comment 8 kay.ramme 2007-08-13 15:29:59 UTC
Looking into this ...
Comment 9 pavel 2007-08-20 14:25:01 UTC
Kay: do you have any news on this?

I'd like to see this fixed for 2.3... I know this is a bit tough, but...
Comment 10 kay.ramme 2007-08-20 16:39:28 UTC
Pavel, is this still happening? jkim and cmc could build successfully ... and
your last comment was about trying to reproduce it.

Though, I may have an ugly fix for this.
Comment 11 pavel 2007-08-20 16:40:44 UTC
Yes, it unfortunately still happens - I had to re-run the build of OOG680_m1 for it.
Comment 12 pavel 2007-08-20 16:42:57 UTC
My build machine (you still have account there):

 06:38:11 up 1 day,  2:04,  1 user,  load average: 1.19, 1.23, 1.25

regcomp.bin[21942]: segfault at 00002aaaabec81a0 rip 00002aaaac683693 rsp 0000000040400fa8 error 
4

-> It was OOG680_m2 build, not m1 build.
Comment 13 pavel 2007-08-21 07:03:32 UTC
My next build of many languages failed for language sv:

I'm now rebuilding module instsetoo_native again...
Comment 14 kay.ramme 2007-08-21 08:15:41 UTC
Pavel, I just logged into your box, is there a core dump or something flying
around I can use to see what went wrong ? Is this only x64 or are you observing
this on other platforms as well?
Comment 15 kay.ramme 2007-08-23 09:49:42 UTC
Thanks to Pavel I have a core dump and a stack-trace, and guess what, it dies
because the Pyton GC again :-( :

#0  ... in ~ImplIntrospectionAccess (this=0x2aaaacb19c00) at Reference.hxx:115
#1  ... in ~Invocation_Impl (this=0x2aaaacb04710) at Reference.hxx:115
#2  ... in pyuno::PyUNO_del (self=0x2aaaac61b120) at Reference.hxx:115
#3  ... in PyDict_Next () from
/data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libpython2.3.so.1.0
#4  ... in PyMethod_Fini () from
/data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libpython2.3.so.1.0
#5  ... in pyuno::GCThread::run (this=0x2aaaacb02c10) at
/data/oo/BuildDir/ooo_OOG680_m2_src/pyuno/source/module/pyuno_gc.cxx:88
#6  ... in threadFunc (param=0x2aaaacb02c10) at thread.hxx:200
#7  ... in osl_thread_start_Impl (pData=<value optimized out>) at thread.c:279
#8  ... in start_thread () from /lib64/tls/libpthread.so.0
#9  ... in clone () from /lib64/tls/libc.so.6

the other thread showing that the process nearly terminated while the GC finally
awakes:

#0  ... in rtl_arena_free (arena=0x2aec72ea21a0, addr=0x2aaaaaacd000,
size=<value optimized out>) at alloc_arena.c:456
#1  ... in rtl_cache_slab_free (cache=0x2aec72ea0b20, addr=<value optimized
out>) at alloc_cache.c:594
#2  ... in rtl_cache_magazine_clear (cache=0x2aec72ea0b20, mag=0x2aaaac48d870)
at alloc_cache.c:663
#3  ... in rtl_cache_deactivate (cache=0x2aec72ea0b20) at alloc_cache.c:1004
#4  ... in rtl_cache_fini () at alloc_cache.c:1709
#5  ... in __do_global_dtors_aux () from
/data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libuno_sal.so.3
#6  ... in ?? ()
#7  ... in _fini () from
/data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libuno_sal.so.3
#8  ... in ?? ()
#9  ... in _dl_fini () from /lib64/ld-linux-x86-64.so.2

So, it seems that we need to find a quick fix for the Python GC thingy, while
later on we also need to address the C++ statics vs. atexit "feature" once and
for all :-)


Comment 16 kay.ramme 2007-08-23 10:31:30 UTC
Adding Joerg (the pyuno maintainer) to cc: ...

->Joerg: The g_destructorsOfStaticObjectsHaveBeenCalled variable is actually true :

#5  0x00002aaaac15c6ea in pyuno::GCThread::run (this=0x2aaaacb02c10) at
/data/oo/BuildDir/ooo_OOG680_m2_src/pyuno/source/module/pyuno_gc.cxx:88
Current language:  auto; currently c++
(gdb) p g_destructorsOfStaticObjectsHaveBeenCalled   
$1 = true
(gdb) 

This is obviously racy, couldn't we just join the GC threads in the d'tor of the
StaticDestructorGuard ?

Thanks for your help


        Kay
Comment 17 kay.ramme 2007-08-23 14:58:02 UTC
.
Comment 18 joergbudi 2007-08-23 19:48:52 UTC
Hi,

a synchronisation will not work anymore, when _dl_fini() is already running,
because destructors of static objects for other shared libraries have already
been called (thus e.g. the uno objects, that shall be destroyed by the gc
threads may crash even then).

Just add a _exit() in regcomp, and everything is fine until we have a destructor
API.

Bye,

Joerg
Comment 19 kay.ramme 2007-08-28 15:33:10 UTC
I put some more thoughts on this: The point basically is, that any
multi-threaded program may face exactly this issue, in case it does not join all
threads before termination and one of these threads executes in the context of a
shared library.

Actually "pthread_exit" called in the main thread seems to address this problem,
though I am not sute that any result other than "0" may be passed back to the
parent process ... ->Jörg: This may serve the needs you actually described in
the other issue under the terms of a "process termination synchronization API".

To make a long story short, I am still working on this ... ;-)
Comment 20 kay.ramme 2007-08-29 11:54:57 UTC
I created a short program actually showing the race:

http://wiki.services.openoffice.org/wiki/User:Kr/exit_race.c

The race being observable on Linux and Solaris, did not try windows yet.

Next step is, to make this easy reproducable in our build scenario ...
Comment 21 kay.ramme 2007-08-29 12:11:40 UTC
Created attachment 47839 [details]
Patch to make dl_fini wait 10s after destructing the vtablefactory ...
Comment 22 kay.ramme 2007-08-29 12:13:53 UTC
Created attachment 47840 [details]
Patch to let the pyuno gc threads wait 5s before actually trying to release an object
Comment 23 kay.ramme 2007-08-29 12:17:10 UTC
I just added two patches for bridges and pyuno, delaying the pyuno GC for 5s and
the dl_fini after destructing the vtables for 10s. Applying these two patches
leads to reliable crashes in regcomp when trying to register a pyuno component
... any fix for pyuno must survive these patches ;-)
Comment 24 Stephan Bergmann 2007-08-29 13:11:41 UTC
#desc20:  "[...] and one of these threads executes in the context of a
shared library."  In how far are shared libraries relevant here?

Also, this issue might be considered a duplicate of issue 63473.
Comment 25 kay.ramme 2007-08-29 13:39:49 UTC
->SB: Good question. You probably have seen my little proof regarding the race
between any threads and "exit", happening without any shared libraries (except
the standard ones) being involved. 

Actually, the proof is only about data and not about text (code) (which too
check is somewhat harder). Assuming that the mapping of an executables image is
not handled differently than the image of any shared library, than there is _no_
dependency against using shared libraries at all.

So, you are right, my previous comment was wrong in both respects, neither using
"pthread_exit" in main would help, nor seems there to be a special relationship
to shared libraries ;-)
Comment 26 kay.ramme 2007-08-29 13:58:20 UTC
->SB: This issue is duplicate to

http://udk.openoffice.org/issues/show_bug.cgi?id=63473

in the sense, that it wouldn't happen, if pyuno was terminating its threads
correctly. 

Despite that, I would expect 63473 to be more a documentation / best practice
issue ...
Comment 27 kay.ramme 2007-09-27 12:31:21 UTC
Posted recently a proposal regarding threads lifecycle
(http://wiki.services.openoffice.org/wiki/User:Kr/A_Thread%27s_Life) 
on dev@OOo (http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=20781) .

Following this proposal, the pyuno bridge must join the GC threads latest during
de-initialization. Going to make a patch for this ...


Comment 28 pavel 2007-11-10 13:55:51 UTC
kr: do you have a patch?
Comment 29 kay.ramme 2007-11-12 08:28:28 UTC
Hi Pavel,

workaround (though ugly) for this is, to use "_exit" in regcomp.bin. A clean
solution (as well as a fix) is on the way, see my proposal regarding daemon
thread termination at
http://wiki.services.openoffice.org/wiki/User:Kr/A_Thread%27s_Life respectively
my mail on dev@OOo
(http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=20781).

Unfortunately there are still minor issues in the example implementation wrt the
Windows loader lock which I am hopefully be able to tackle this week.

Sorry for being slow

     Kay
Comment 30 pavel 2007-12-02 09:46:05 UTC
kr: sorry to bring this up again, but now I have two machines building x86_64.

On the old one, I have 100% probability that regcomp.bin crashes.

On the new one, I have 30% probability that it crashes.

By patching regcomp.bin you mean simply adding _exit as the last command in SAL_IMPLEMENT_MAIN_WITH_ARGS(argc, argv) in cpputools/source/registercomponent/registercomponent.cxx?

Comment 31 kay.ramme 2007-12-03 10:41:43 UTC
->Pavel: Yes, exactly.
Comment 32 pavel 2007-12-04 05:27:22 UTC
kr: OK, I'm now using

http://ftp.linux.cz/pub/localization/OpenOffice.org/devel/build/Patches/SRC680/i80300-workaround-
crashing-regcomp.diff

and the first build:

regcomp.bin[3631] general protection rip:2b7514dbef46 rsp:401fff80 error:0

Comment 33 kay.ramme 2007-12-04 15:16:37 UTC
->Pavel, if I understand correctly, this means it still dies abnormally, right?
What if you move the "_exit" before the "if ( xComponent.is() )" ?
Comment 34 joergbudi 2007-12-04 19:55:26 UTC
@pjanik: can u provide a callstack from a core ?
Comment 35 pavel 2007-12-05 09:45:35 UTC
kr: This patch

http://ftp.linux.cz/pub/localization/OpenOffice.org/devel/build/Patches/SRC680/i80300-workaround-
crashing-regcomp-1.diff

and regcomp.bin still segfaults randomly when building *many* languages.

Comment 36 pavel 2008-11-26 09:24:57 UTC
JFYI: I haven't seen this bug for a long time now.

Comment 37 kay.ramme 2008-11-28 12:27:58 UTC
Pavel, I suggest to close this for the moment ... if it re-happens, we may want
to open it again.
Comment 38 pavel 2008-11-28 13:47:29 UTC
Yes.
Comment 39 Mechtilde 2008-12-06 20:23:16 UTC
worksforme -> closed