Apache OpenOffice (AOO) Bugzilla – Issue 80300
regcomp failed during final packaging again
Last modified: 2008-12-06 20:23:16 UTC
Hi, in my SRC680_m224 build for *many language*, it failed again: ERROR: /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/regcomp -register - br /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/types .rdb -br /data/oo/BuildDir/ooo_SRC680_m224_src/solver/680/unxlngx6.pro/bin/pyuno_services.rdb -r /data/oo/BuildDir/tmp/ooopackaging/i_74791186091826/unxlngx6.pro/OpenO ffice/deb/services.rdb/fi_inprogress_1/services.rdb -c vnd.openoffice.pymodule:mailmerge -l com.sun.star.loader.Python 2>&1 | This time, the languae was fr. fi was ok cs was ok, ... My dmesg output contains: typesconfig[26487]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffba5180 error 6 typesconfig[16372]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffff8db160 error 4 typesconfig[16373]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffff8db160 error 6 typesconfig[31234]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffe2e640 error 4 typesconfig[31235]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffe2e640 error 6 typesconfig[16504]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffdb6dc0 error 4 typesconfig[16505]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffdb6dc0 error 6 typesconfig[28670]: segfault at 0000000000000000 rip 0000000000400f9b rsp 00007fffffe45100 error 4 typesconfig[28671]: segfault at 0000000000000000 rip 0000000000400fe1 rsp 00007fffffe45100 error 6 regcomp.bin[8648]: segfault at 00002aaaabec81a0 rip 00002aaaac89d3b2 rsp 0000000040400fe0 error 4 so regcomp.bin SIGSEGVed. I can provide the IP/login as always.
It happened to me also once in m223 during my builds.
That machine is running 11 days and the log is full of typesconfig crashes - thus we know where I started builds 8) And there are also 3 regcomp.bin crashes right now. cmc,kendy: have you seen something similar?
FYI, ko failed for me on FreeBSD/amd64 -CURRENT.
jkim: with the same message in the log? What gcc are you using? I use gcc version 4.0.2 20050901 (prerelease) (SUSE Linux). kendy, cmc: can you please try to build for all languages and see if it fails with similar problem?
just a note: my m225 build was OK, no break. I'll start several other builds to get it again...
pjanik: yes, it was exactly the same error as far as I remember although I didn't save the error message. GCC version was: %cc --version cc (GCC) 4.2.1 20070719 [FreeBSD] Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Interestingly I have done m224 build twice for ko and I was not able to reproduce it again.
m225 all languages built for me on x86_64
Looking into this ...
Kay: do you have any news on this? I'd like to see this fixed for 2.3... I know this is a bit tough, but...
Pavel, is this still happening? jkim and cmc could build successfully ... and your last comment was about trying to reproduce it. Though, I may have an ugly fix for this.
Yes, it unfortunately still happens - I had to re-run the build of OOG680_m1 for it.
My build machine (you still have account there): 06:38:11 up 1 day, 2:04, 1 user, load average: 1.19, 1.23, 1.25 regcomp.bin[21942]: segfault at 00002aaaabec81a0 rip 00002aaaac683693 rsp 0000000040400fa8 error 4 -> It was OOG680_m2 build, not m1 build.
My next build of many languages failed for language sv: I'm now rebuilding module instsetoo_native again...
Pavel, I just logged into your box, is there a core dump or something flying around I can use to see what went wrong ? Is this only x64 or are you observing this on other platforms as well?
Thanks to Pavel I have a core dump and a stack-trace, and guess what, it dies because the Pyton GC again :-( : #0 ... in ~ImplIntrospectionAccess (this=0x2aaaacb19c00) at Reference.hxx:115 #1 ... in ~Invocation_Impl (this=0x2aaaacb04710) at Reference.hxx:115 #2 ... in pyuno::PyUNO_del (self=0x2aaaac61b120) at Reference.hxx:115 #3 ... in PyDict_Next () from /data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libpython2.3.so.1.0 #4 ... in PyMethod_Fini () from /data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libpython2.3.so.1.0 #5 ... in pyuno::GCThread::run (this=0x2aaaacb02c10) at /data/oo/BuildDir/ooo_OOG680_m2_src/pyuno/source/module/pyuno_gc.cxx:88 #6 ... in threadFunc (param=0x2aaaacb02c10) at thread.hxx:200 #7 ... in osl_thread_start_Impl (pData=<value optimized out>) at thread.c:279 #8 ... in start_thread () from /lib64/tls/libpthread.so.0 #9 ... in clone () from /lib64/tls/libc.so.6 the other thread showing that the process nearly terminated while the GC finally awakes: #0 ... in rtl_arena_free (arena=0x2aec72ea21a0, addr=0x2aaaaaacd000, size=<value optimized out>) at alloc_arena.c:456 #1 ... in rtl_cache_slab_free (cache=0x2aec72ea0b20, addr=<value optimized out>) at alloc_cache.c:594 #2 ... in rtl_cache_magazine_clear (cache=0x2aec72ea0b20, mag=0x2aaaac48d870) at alloc_cache.c:663 #3 ... in rtl_cache_deactivate (cache=0x2aec72ea0b20) at alloc_cache.c:1004 #4 ... in rtl_cache_fini () at alloc_cache.c:1709 #5 ... in __do_global_dtors_aux () from /data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libuno_sal.so.3 #6 ... in ?? () #7 ... in _fini () from /data/oo/BuildDir/ooo_OOG680_m2_src/solver/680/unxlngx6.pro/lib/libuno_sal.so.3 #8 ... in ?? () #9 ... in _dl_fini () from /lib64/ld-linux-x86-64.so.2 So, it seems that we need to find a quick fix for the Python GC thingy, while later on we also need to address the C++ statics vs. atexit "feature" once and for all :-)
Adding Joerg (the pyuno maintainer) to cc: ... ->Joerg: The g_destructorsOfStaticObjectsHaveBeenCalled variable is actually true : #5 0x00002aaaac15c6ea in pyuno::GCThread::run (this=0x2aaaacb02c10) at /data/oo/BuildDir/ooo_OOG680_m2_src/pyuno/source/module/pyuno_gc.cxx:88 Current language: auto; currently c++ (gdb) p g_destructorsOfStaticObjectsHaveBeenCalled $1 = true (gdb) This is obviously racy, couldn't we just join the GC threads in the d'tor of the StaticDestructorGuard ? Thanks for your help Kay
.
Hi, a synchronisation will not work anymore, when _dl_fini() is already running, because destructors of static objects for other shared libraries have already been called (thus e.g. the uno objects, that shall be destroyed by the gc threads may crash even then). Just add a _exit() in regcomp, and everything is fine until we have a destructor API. Bye, Joerg
I put some more thoughts on this: The point basically is, that any multi-threaded program may face exactly this issue, in case it does not join all threads before termination and one of these threads executes in the context of a shared library. Actually "pthread_exit" called in the main thread seems to address this problem, though I am not sute that any result other than "0" may be passed back to the parent process ... ->Jörg: This may serve the needs you actually described in the other issue under the terms of a "process termination synchronization API". To make a long story short, I am still working on this ... ;-)
I created a short program actually showing the race: http://wiki.services.openoffice.org/wiki/User:Kr/exit_race.c The race being observable on Linux and Solaris, did not try windows yet. Next step is, to make this easy reproducable in our build scenario ...
Created attachment 47839 [details] Patch to make dl_fini wait 10s after destructing the vtablefactory ...
Created attachment 47840 [details] Patch to let the pyuno gc threads wait 5s before actually trying to release an object
I just added two patches for bridges and pyuno, delaying the pyuno GC for 5s and the dl_fini after destructing the vtables for 10s. Applying these two patches leads to reliable crashes in regcomp when trying to register a pyuno component ... any fix for pyuno must survive these patches ;-)
#desc20: "[...] and one of these threads executes in the context of a shared library." In how far are shared libraries relevant here? Also, this issue might be considered a duplicate of issue 63473.
->SB: Good question. You probably have seen my little proof regarding the race between any threads and "exit", happening without any shared libraries (except the standard ones) being involved. Actually, the proof is only about data and not about text (code) (which too check is somewhat harder). Assuming that the mapping of an executables image is not handled differently than the image of any shared library, than there is _no_ dependency against using shared libraries at all. So, you are right, my previous comment was wrong in both respects, neither using "pthread_exit" in main would help, nor seems there to be a special relationship to shared libraries ;-)
->SB: This issue is duplicate to http://udk.openoffice.org/issues/show_bug.cgi?id=63473 in the sense, that it wouldn't happen, if pyuno was terminating its threads correctly. Despite that, I would expect 63473 to be more a documentation / best practice issue ...
Posted recently a proposal regarding threads lifecycle (http://wiki.services.openoffice.org/wiki/User:Kr/A_Thread%27s_Life) on dev@OOo (http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=20781) . Following this proposal, the pyuno bridge must join the GC threads latest during de-initialization. Going to make a patch for this ...
kr: do you have a patch?
Hi Pavel, workaround (though ugly) for this is, to use "_exit" in regcomp.bin. A clean solution (as well as a fix) is on the way, see my proposal regarding daemon thread termination at http://wiki.services.openoffice.org/wiki/User:Kr/A_Thread%27s_Life respectively my mail on dev@OOo (http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=20781). Unfortunately there are still minor issues in the example implementation wrt the Windows loader lock which I am hopefully be able to tackle this week. Sorry for being slow Kay
kr: sorry to bring this up again, but now I have two machines building x86_64. On the old one, I have 100% probability that regcomp.bin crashes. On the new one, I have 30% probability that it crashes. By patching regcomp.bin you mean simply adding _exit as the last command in SAL_IMPLEMENT_MAIN_WITH_ARGS(argc, argv) in cpputools/source/registercomponent/registercomponent.cxx?
->Pavel: Yes, exactly.
kr: OK, I'm now using http://ftp.linux.cz/pub/localization/OpenOffice.org/devel/build/Patches/SRC680/i80300-workaround- crashing-regcomp.diff and the first build: regcomp.bin[3631] general protection rip:2b7514dbef46 rsp:401fff80 error:0
->Pavel, if I understand correctly, this means it still dies abnormally, right? What if you move the "_exit" before the "if ( xComponent.is() )" ?
@pjanik: can u provide a callstack from a core ?
kr: This patch http://ftp.linux.cz/pub/localization/OpenOffice.org/devel/build/Patches/SRC680/i80300-workaround- crashing-regcomp-1.diff and regcomp.bin still segfaults randomly when building *many* languages.
JFYI: I haven't seen this bug for a long time now.
Pavel, I suggest to close this for the moment ... if it re-happens, we may want to open it again.
Yes.
worksforme -> closed