Apache OpenOffice (AOO) Bugzilla – Issue 70166
prune textenc bloat
Last modified: 2013-07-29 17:55:10 UTC
So cloned this part of the discussion from i#60696, > A small test on Windows revealed that after removing all textenc functions > from the export and the textenc.lib from the makefile map the following > symbols show up as unresolved: Ah - so tooling is everything ;-) if you poke at http://go-oo.org/ooo-build/bin/oosize - there is a (Linux only) tool that shows object size breakdown by source directory: eg. you can look at 'svx' and say: wow that is a big pile of code, then break it down by source directory and source file to find the offendors (some of which are quite suprising); anyhow it shows: michael@t60p:/opt/OpenOffice/ood680-m4/sal/textenc> /opt/OpenOffice/HEAD/bin/oosize --ls Found module '/data/OpenOffice/ood680-m4/sal' Size breakdown context.c 664 tenchelp.c 1036 unichars.c 1064 textcvt.c 1128 converter.c 1144 convertsinglebytetobmpunicode.cxx 1244 converteuctw.c 2140 convertgb18030.c 2176 tcvtutf8.c 2176 convertiso2022kr.c 2316 convertiso2022jp.c 2340 convertbig5hkscs.c 2420 tcvtutf7.c 2428 tcvtmb.c 2728 convertiso2022cn.c 3472 tcvtbyte.c 3672 tencinfo.c 9108 textenc.cxx 1490820 so - as you see, 1 file is a huge proportion of the problem here; luckily that file exports only 1 symbol ;-) ImplTextEncodingData const * Impl_getTextEncodingData(rtl_TextEncoding nEncoding) By commenting out the contents of that I get: 1797832 ../../unxlngi6.pro.before/lib/libuno_sal.so 273544 libuno_sal.so # after So - AFAICS this is really the obvious place to cut to loose 85% of the code at a single stroke ;-) [ of course, we'd prolly need to re-instate / in-line at least some of the common ASCII / common Win32 code-page / UTF8 stuff ] but it should be fairly trivial to save a nice whack of code there in the common case. I'm even interested enough to have a go myself now ;-)
So - the punch line (from several runs of different components & text entry) is that on Unix / UTF-8 we need the following encodings: ASCII_US, MS_1252, UTF8, ISO_8859_1, IBM_863, IBM_860, IBM_850, IBM_857, IBM_861 What -amazes- me is the number of esoteric IBM encodings required to just start OO.o, how did we manage that for heaven's sake ? Matthias - any chance of a dump from Impl_getTextEncodingData on your system to see what text encodings are commonly required on Win32 ? [ or do you think this should be a unix-only optimisation ? ].
Created attachment 39618 [details] pretty numbers
So - another curious issue - digging into why we get the strange IBM textencodings I got stumped, an (accurate) trace for one is: #0 Impl_getTextEncodingData (nEncoding=6) at /data/OpenOffice/ood680-m4/sal/textenc/textenc.cxx:244 #1 0xb744a749 in rtl_createTextToUnicodeConverter () from ./libuno_sal.so.3 #2 0xb7439d27 in rtl_string2UString () from ./libuno_sal.so.3 #3 0xb77c84f4 in String (this=0x8679b28, pByteStr=0xb337b740 "symbol", eTextEncoding=6, nCvtFlags=819) at ./strucvt.cxx:101 #4 0xb324336a in SwViewOption (this=0x8679b28) at /data/OpenOffice/ood680-m4/sw/source/ui/config/viewopt.cxx:448 The code is: SwViewOption::SwViewOption() : sSymbolFont( RTL_CONSTASCII_STRINGPARAM( "symbol" ) ), nZoom( 100 ), where sSymbolFont is a 'String': UniString( const sal_Char* pByteStr, xub_StrLen nLen, rtl_TextEncoding eTextEncoding, sal_uInt32 nCvtFlags = BYTESTRING_TO_UNISTRING_CVTFLAGS ); So - it looks like we should get an error (right?) ;-> some more head scratching, no warning on compilation, poking at the assembler: addl $_GLOBAL_OFFSET_TABLE_, %ebx pushl $819 pushl $6 leal 20(%esi), %edi leal .LC0@GOTOFF(%ebx), %eax pushl %eax pushl %esi call String::String(char const*, unsigned short, unsigned long)@PLT We do indeed pass 6 - and we call the method with that signature; Then you re-read the UniString constructor signatures and slap the forehead; we're calling: UniString( const sal_Char* pByteStr, rtl_TextEncoding eTextEncoding, sal_uInt32 nCvtFlags = BYTESTRING_TO_UNISTRING_CVTFLAGS ); instead - but the 'eTextEncoding' is swallowing the length ;-) - doh [!] Now - admittedly this line comes from a patch of ours ;-) but there are plenty of other silly examples where this is not so: SdrLayerAdmin::SdrLayerAdmin(SdrLayerAdmin* pNewParent): aLayer(1024,16,16), aLSets(1024,16,16), pModel(NULL) { sal_Char aTextControls[] = "Controls"; aControlLayerName = String(aTextControls, sizeof(aTextControls-1)); pParent=pNewParent; } void SdrFormatter::TakeUnitStr(MapUnit eUnit, XubString& rStr) { switch(eUnit) { // Metrisch case MAP_100TH_MM : { sal_Char aText[] = "/100mm"; rStr = UniString(aText, sizeof(aText-1)); ... etc. etc. So - on 1 hand good news, we don't -really- use that many text encodings ;-) On the other hand: bad news, there are most likely a large number of cockups in the code here [ isn't C++ polymorphism sexy ;-] I'd like to break the UniString constructor set to remove the ambiguity; the same problem exists for ByteString [FWIW], where (perhaps) it causes more damage. I -suspect- that removing the plain: ByteString( const sal_Unicode* pUniStr, rtl_TextEncoding eTextEncoding, sal_uInt32 nCvtFlags = UNISTRING_TO_BYTESTRING_CVTFLAGS ); constructor without the length is what we want to do (?)
Michael, I think we can address two things here. First we can make sure that any table will be loaded only when one of the 23 symbols in textenc is used at all. The simplest approach would be making the functions a small wrapper that loads the textenc.lib on demand and passes the call along. This will enable us to link against sal statically without loading the famous tables. Then we can try to optimize the table usage in a second step or try to avoid superfluous textenc codes in the startup. What do you think?
> Michael, I think we can address two things here. :-) sure, the optimisation, and then fixing the code. I can hack on the optimisation easily enough - what concerns me is finding a good solution for the evil polymorphism problem. It seems to me that at root the problem is: typedef sal_uInt16 rtl_TextEncoding; Which is not a -real- type, the compiler will happily coerce 'strlen("foo")' to type rtl_TextEncoding without batting an eye-lid. I'm not enough of a C++ guru to know what combination of wrapper classes, 'explicit' keywords [etc.] will let us a) find all instances of bad behavior here, and b) stop them from happening in the future. [ I'm personally inclined to bin all the *String polymorphic constructors but ... ;-]. Unfortunately - it seems that the method with just the string & textenc is (correctly) used in several places (even inside tools/ eg.) but also used incorrectly a -lot- on startup we have: encoding count IBM_863 140 IBM_860 64 IBM_850 26 IBM_857 6 IBM_861 2 from a good few sites: #4 0xb293dba3 in SdrLayerAdmin::SdrLayerAdmin () from ./libsvx680li.so #4 0xb2940344 in SdrModel::TakeUnitStr () from ./libsvx680li.so #4 0xb2819535 in CreateGraphicObjectFromURL () from ./libsvx680li.so etc. [ and presumably myriad as yet undetected versions too ]. The 2nd problem is easy enough - doing the dynamic loading of that data, I'll knock that off in the next 20 mins before bed. > This will enable us to link against sal statically without loading > the famous tables. quite :-) but what I'm -really- concerned about is the code cleanliness / error-prone-ness that I see here, it scares me to think that a load of strings may be being strangely mangled at some computational expense for no good reason :-)
Created attachment 39620 [details] patch
simple patch, under-tested :-) needs someone who can grok the effort to make double-check-locking to do that good stuff, and testing on Win32 too I guess. Oh - and I forgot the vital 'scp' piece ... [ bother ], too late now anyway.
updated patch with map file, scp2 patch etc. at: http://go-oo.org/ooo-build/patches/src680/size-sal-textenc.diff
Created attachment 39676 [details] Updated patch, now builds and works on Win32, too
Thanks for the patch. Stephan Bergmann will review the code when he is back from vacation (should happen this week). I've set the target to "2.x" but depending on Stephan's opinion we can change this.
1 If I understand correctly, the original motivation for this issue (comming from issue 60696) was to reduce size of sal so that some other lib can statically link against it and not grow too large. A nice side benefit of only demand-loading most part of sal/textenc might be that loading sal itself is faster then, as many of the tables in sal/textenc contain pointers that need relocation upon library loading. (I know this problem for a long time, but never got around to do anything about it.) 2 I think this issue should be split in two: Keep this issue for splitting off the new demand-loaded library. Create a new issue for the tools string constructor problems. Michael, can you please create that new issue? In my opinion, it would be best to make this issue depend on the new issue then (and not fix this issue before the new issue is fixed, i.e., drop the TOOLS_STRING_MISSUSE from the patch), but if anybody needs this issue solved quickly, then I am also fine with any other strategy. 3 I am not sure there is a reasonable static list of text encodings that are needed early on (and are thus special-cased in the patched Impl_getTextEncodingData), as the list presumably depends on the user's locale (though I did not check that out yet). (However, the exact list is irrelevant when the goal is to reduce the size of the sal library; it is only important when the goal is to reduce startup time by postponing loading of the sal/textenc tables.)
Hi Stephan: > 1 If I understand correctly, the original motivation ... Yep - should be good huh ;-) > 2 I think this issue should be split in two: Keep this issue for > splitting off the new demand-loaded library. Sure - the whole bug-spawns-bug effect here is quite acute though. I'm not convinced we'll ever get them all fixed :-) > 3 I am not sure there is a reasonable static list of text encodings > that are needed early on (and are thus special-cased in the patched > Impl_getTextEncodingData), as the list presumably depends on the > user's locale Empirically that list works well for me. I can believe on Win32 that you have a random / custom encoding per language. However on Unix/Linux on a correctly configured system -everything- should be in UTF-8, so we can feasibly avoid loading any of that other library until it's really needed. So - hopefully, this is not -so- controversial ? I've split off the other issue (which is more so), can we get this in as-is ? and rip out the #ifdef FIXED_WORLD later ? :-)
@mmeeks: (1) "on Unix/Linux on a correctly configured system -everything- should be in UTF-8" Not sure about East Asian conventions, but I can imagine they prefer some specific CJK encodings over UTF-8. Anyway, we agree that the point is valid at least for Windows. (2) "can we get this in as-is?" Do you mean: into OOo 2.1? I do not see a chance: I just hurried to get my last CWS for OOo 2.1 in (with a bunch of other patches) before last CWS integration on November 2; there simply is no time to include this as well. Plus, as far as I understand, this issue is a prerequisite for issue 60696, but that issue is only targeted OOo 2.x, so why hurry? Plus, I am still not sure (1) above is addressed adequately (if we want to shrink size of sal, no need to even special-case UTF-8 etc.; if we want to avoid loading tables early, have to make sure special cases are adequate; which of the two *do* we want to achieve with this issue, anyway?).
mmeeks "I've split off the other issue": Appears to be issue 70752.
Hi Stephan: > (1) "on Unix/Linux on a correctly configured system -everything- should > be in UTF-8" Not sure about East Asian conventions I am sure - and on Linux (and Unix) they -should- be using UTF-8, elsewhere lies total screaming madness. > Anyway, we agree that the point is valid at least for Windows. Sure, but we at least do no more harm there. > (2) "can we get this in as-is?" Do you mean: into OOo 2.1? Nope, I mean into a suitable CWS, and actually integrated into HEAD ;-) > so why hurry ? We have your attention for now, and doing other work depends on knowing that this will be included. > Plus, I am still not sure (1) above is addressed adequately (if we want > to shrink size of sal, no need to even special-case UTF-8 etc.; if we > want to avoid loading tables early, have to make sure special cases > are adequate; which of the two *do* we want to achieve with this > issue, anyway?). We want to shrink the size of sal, and also (in passing) get a nice win on Unix. On Unix, as you see - the special cases should be adequate; I measured each component on startup (cf. the textenc.ods spreadsheet) and also the 'special' cases are rather small in comparison with what we split out - this seems like hair splitting. My concern is that I -really- don't want to get another issue jammed in the "but we could do something nicer for Win32, so I reject this improvement for Unx" ;-) state [ as the bug we split this from was from Aug 4th -> Oct 6th ]. Perhaps you can reassure me :-)
oh; and as an extra cleanup by adjusting the makefile.mk to move the following into the 'textenc' library, we save another 4k or so [ but it's a little cleaner ;-] $(SLO)$/convertiso2022kr.obj \ $(SLO)$/convertsinglebytetobmpunicode.obj \ $(SLO)$/convertbig5hkscs.obj \ $(SLO)$/converteuctw.obj \ $(SLO)$/convertgb18030.obj \ $(SLO)$/convertiso2022cn.obj \ $(SLO)$/convertiso2022jp.obj \ $(SLO)$/tcvtutf7.obj \ $(SLO)$/tcvtmb.obj \ do you want me to create a CWS for this to commit to ? hopefully it could be aggregated with something else to save pain ? :-)
@mmeeks: Your arguments convinced me. :) Just go ahead and integrate the patch as-is into a CWS of yours.
sb: So, the patch has changed a bit, the newest version is at http://svn.gnome.org/viewcvs/*checkout*/ooo-build/trunk/patches/src680/size-sal-textenc.diff . And I need your advice ;-) There was a problem, because osl_loadModule() used UnicodeToText() -> rtl_createUnicodeToTextConverter() -> Impl_getTextEncodingData() -> osl_loadModule() again. To solve it, I used a hack that can be seen in other sal files as well. I declared: extern "C" oslModule SAL_CALL osl_psz_loadModule(const sal_Char *pszModuleName, sal_Int32 nRtldMode); and used it instead of osl_loadModule(). Of course this does not work for Win32, so this is enclosed in #ifdef UNX, and the normal osl_loadModule() is used in the Win32 case. Could you please have a look & test it for Win32? I have no idea if it works there or not :-( Thank you in advance!
Created attachment 42041 [details] Patch for http://svn.gnome.org/viewcvs/*checkout*/ooo-build/trunk/patches/src680/size-sal-textenc.diff
1 With my attached ooo-build.patch your patch compiles and smoketests on SRC680m199 wntmsci10.pro. 2 pTables = (TextEncodingFunction)osl_psz_getSymbol(aModule, pSymbol); will not compile on -enable-werror Solaris. You need to make available a osl_psz_getFunctionSymbol variant of osl_getFunctionSymbol instead. 3 osl_psz_loadModule and osl_psz_getFunctionSymbol should really be declared in a (module local) header. 4 This starts to get somewhat ugly; Impl_getTextEncodingData needs to make sure it does not call itself recursively, and does so building on different, platform specific assumptions. Very brittle.
1) Thank you! :-) So - would the patch be acceptable for you if I incorporated 2) and 3)? Do you have idea what to do better about 4), please?
I believe I already solved 2) in (not yet integrated) issue 76025.
@kendy: I do not have an idea for (4) at the moment. I have mixed feelings about integrating with (4) open. On the one hand, I would like to avoid introducing such IMO brittle things. On the other hand, if you do want this to be integrated and nobody comes up with a decent solution for (4): so be it.
we should get this integrated I guess ...
sb: I have a new version of the patch, could you please check it? It introduces osl_loadModuleAscii() for bypassing the conversion that is dangerous (and not working) in this case. On unx, it is just a renamed osl_psz_loadModule, on Win32 I had to do a new function for that. Unfortunately - I have no Win32 machine now [and not enough Win32 knowledge either ;-)] - could you please have a look & fix it if not correct? If everything's OK, I can commit it to CWS kendy18. Thank you in advance!
Created attachment 47133 [details] The newest version.
[I forgot - I also made osl_loadModuleAscii() public, because osl_getAsciiFunctionSymbol() is now public as well.]
- sal/inc/osl/module.h: -- osl_loadModuleAscii does not only differ from osl_loadModule in that it receives an ASCII string instead of a Unicode one, but also in that it receives a system path instead of a URL. That needs to be made more explicit (function name, parameter name, documentation). -- An appropriate @since tag is missing from the documentation of osl_loadModuleAscii. - sal/osl/w32/module.c: -- hro, please comment on the Windows-version of osl_loadModuleAscii. -- (void) nRtldMode; /* avoid warnings */ has to be moved after the local variable definitions to be valid C89. - sal/textenc/makefile.mk: -- The duplication of SAL_COMMON_OBJFILES in sal and sal_textenc is bad. -- For the URE-internal sal_textenc lib no UDK_MAJOR is needed (e.g., just sal_textenc.so on Unix/ELF and sal_textenc.dll on Windows), so the CDEFS+=-DPLUGIN_NAME stuff is not necessary and the STRING(PLUGIN_NAME) stuff in textenc.cxx can be simplified. - sal/textenc/tables.cxx: -- Is a verbatim copy of sal/textenc/textenc.cxx:1.6, right? Wouldn't it be better to keep this file named textenc.cxx as part of sal_textenc and instead create a new file for sal? -- The data for those textencodings that are handled directly by sal is now doubled in sal and sal_textenc. - sal/textenc/textenc.cxx: -- aImpl8090SameToUniTab etc. is now doubled in textenc.cxx and tables.cxx. -- Tabs instead of spaces. -- Use rtl/instance.hxx to initialize pTables. -- On CWS sb71 I found out that osl_loadModule with a plain filename is a bad thing, so introduced osl_loadModuleRelative (see <http://wiki.services.openoffice.org/wiki/ODF_Toolkit/Efforts/OOo_without_URE>). That means that this use of osl_loadModuleAscii with a plain filename would also need to be changed. -- No C-style casts in C++ code, please. -- The fprintf error handling is not acceptable. (std::abort instead?) - sal/util/sal.map: -- osl_loadModuleAscii would go into a OOo 2.4 section. - sal/util/saltextenc.map: -- It is somewhat unfortunate that a symbol starting "Impl_" is exported, as that is traditionally used to denote non-exported symbols. - The new URE-internal lib sal_textenc would also have to be added to ure/source/README and scp2/source/ure/ure.scp.
@hro: "please comment on the Windows-version of osl_loadModuleAscii."
@keyndy: Patch for windows looks good (just a copy of osl_loadModule without the converting stuff). But you should use the explicit Ansi API instead of using the macros. So use LoadLibraryA instead of LoadLibrary and LoadLibraryExA instead of LoadLibraryEx.
set target 3.0
ping?
I'm adding this comment to all open issues with Issue Type == PATCH. We have 220 such issues, many of them quite old. I apologize for that. We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0. If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know. On the other hand, if the patch is no longer relevant, please let us know that as well. If you have any general questions or want to discuss this further, please send a note to our dev mailing list: dev@openoffice.apache.org Thanks! -Rob