Issue 70166 - prune textenc bloat
Summary: prune textenc bloat
Status: CONFIRMED
Alias: None
Product: porting
Classification: Code
Component: code (show other issues)
Version: 680m185
Hardware: All Linux, all
: P3 Trivial (vote)
Target Milestone: 4.x
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-06 20:59 UTC by mmeeks
Modified: 2013-07-29 17:55 UTC (History)
9 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
pretty numbers (13.48 KB, application/vnd.oasis.opendocument.spreadsheet)
2006-10-06 21:04 UTC, mmeeks
no flags Details
patch (22.91 KB, patch)
2006-10-06 23:25 UTC, mmeeks
no flags Details | Diff
Updated patch, now builds and works on Win32, too (23.23 KB, patch)
2006-10-10 18:11 UTC, tml
no flags Details | Diff
Patch for http://svn.gnome.org/viewcvs/*checkout*/ooo-build/trunk/patches/src680/size-sal-textenc.diff (1.47 KB, patch)
2007-01-09 16:27 UTC, Stephan Bergmann
no flags Details | Diff
The newest version. (27.88 KB, patch)
2007-07-27 10:40 UTC, kendy
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description mmeeks 2006-10-06 20:59:49 UTC
So cloned this part of the discussion from i#60696,

> A small test on Windows revealed that after removing all textenc functions
> from the export and the textenc.lib from the makefile map the following
> symbols show up as unresolved:

Ah - so tooling is everything ;-) if you poke at
http://go-oo.org/ooo-build/bin/oosize - there is a (Linux only) tool that shows
object size breakdown by source directory: eg. you can look at 'svx' and say:
wow that is a big pile of code, then break it down by source directory and
source file to find the offendors (some of which are quite suprising); anyhow it
shows:

michael@t60p:/opt/OpenOffice/ood680-m4/sal/textenc>
/opt/OpenOffice/HEAD/bin/oosize --ls
Found module '/data/OpenOffice/ood680-m4/sal'
Size breakdown
context.c       664
tenchelp.c      1036
unichars.c      1064
textcvt.c       1128
converter.c     1144
convertsinglebytetobmpunicode.cxx       1244
converteuctw.c  2140
convertgb18030.c        2176
tcvtutf8.c      2176
convertiso2022kr.c      2316
convertiso2022jp.c      2340
convertbig5hkscs.c      2420
tcvtutf7.c      2428
tcvtmb.c        2728
convertiso2022cn.c      3472
tcvtbyte.c      3672
tencinfo.c      9108
textenc.cxx     1490820

so - as you see, 1 file is a huge proportion of the problem here; luckily that
file exports only 1 symbol ;-)

ImplTextEncodingData const *
Impl_getTextEncodingData(rtl_TextEncoding nEncoding)

By commenting out the contents of that I get:

1797832 ../../unxlngi6.pro.before/lib/libuno_sal.so
 273544  libuno_sal.so # after

So - AFAICS this is really the obvious place to cut to loose 85% of the code at
a single stroke ;-) [ of course, we'd prolly need to re-instate / in-line at
least some of the common ASCII / common Win32 code-page / UTF8 stuff ] but it
should be fairly trivial to save a nice whack of code there in the common case.
I'm even interested enough to have a go myself now ;-)
Comment 1 mmeeks 2006-10-06 21:03:26 UTC
So - the punch line (from several runs of different components & text entry) is
that on Unix / UTF-8 we need the following encodings:

ASCII_US, MS_1252, UTF8, ISO_8859_1, IBM_863, IBM_860, IBM_850, IBM_857, IBM_861

What -amazes- me is the number of esoteric IBM encodings required to just start
OO.o, how did we manage that for heaven's sake ?

Matthias - any chance of a dump from Impl_getTextEncodingData on your system to
see what text encodings are commonly required on Win32 ? [ or do you think this
should be a unix-only optimisation ? ].
Comment 2 mmeeks 2006-10-06 21:04:10 UTC
Created attachment 39618 [details]
pretty numbers
Comment 3 mmeeks 2006-10-06 21:58:17 UTC
So - another curious issue - digging into why we get the strange IBM
textencodings I got stumped, an (accurate) trace for one is:

#0  Impl_getTextEncodingData (nEncoding=6) at
/data/OpenOffice/ood680-m4/sal/textenc/textenc.cxx:244
#1  0xb744a749 in rtl_createTextToUnicodeConverter () from ./libuno_sal.so.3
#2  0xb7439d27 in rtl_string2UString () from ./libuno_sal.so.3
#3  0xb77c84f4 in String (this=0x8679b28, pByteStr=0xb337b740 "symbol",
eTextEncoding=6, nCvtFlags=819) at ./strucvt.cxx:101
#4  0xb324336a in SwViewOption (this=0x8679b28) at
/data/OpenOffice/ood680-m4/sw/source/ui/config/viewopt.cxx:448

The code is:

SwViewOption::SwViewOption() :
	sSymbolFont( RTL_CONSTASCII_STRINGPARAM( "symbol" ) ),
	nZoom( 100 ),

where sSymbolFont is a 'String':

UniString( const sal_Char* pByteStr, xub_StrLen nLen,
	   rtl_TextEncoding eTextEncoding,
	   sal_uInt32 nCvtFlags = BYTESTRING_TO_UNISTRING_CVTFLAGS );

So - it looks like we should get an error (right?) ;-> some more head
scratching, no warning on compilation, poking at the assembler:

	addl	$_GLOBAL_OFFSET_TABLE_, %ebx
	pushl	$819
	pushl	$6
	leal	20(%esi), %edi
	leal	.LC0@GOTOFF(%ebx), %eax
	pushl	%eax
	pushl	%esi
	call	String::String(char const*, unsigned short, unsigned long)@PLT

We do indeed pass 6 - and we call the method with that signature;

Then you re-read the UniString constructor signatures and slap the forehead;
we're calling:

UniString( const sal_Char* pByteStr,
	   rtl_TextEncoding eTextEncoding,
	   sal_uInt32 nCvtFlags = BYTESTRING_TO_UNISTRING_CVTFLAGS );

instead - but the 'eTextEncoding' is swallowing the length ;-) - doh [!]

Now - admittedly this line comes from a patch of ours ;-) but there are plenty
of other silly examples where this is not so:

SdrLayerAdmin::SdrLayerAdmin(SdrLayerAdmin* pNewParent):
	aLayer(1024,16,16),
	aLSets(1024,16,16),
	pModel(NULL)
{
	sal_Char aTextControls[] = "Controls";
	aControlLayerName = String(aTextControls, sizeof(aTextControls-1));
	pParent=pNewParent;
}

void SdrFormatter::TakeUnitStr(MapUnit eUnit, XubString& rStr)
{
	switch(eUnit)
	{
		// Metrisch
		case MAP_100TH_MM   :
		{
			sal_Char aText[] = "/100mm";
			rStr = UniString(aText, sizeof(aText-1));
...

etc. etc.

So - on 1 hand good news, we don't -really- use that many text encodings ;-)
On the other hand: bad news, there are most likely a large number of cockups in
the code here [ isn't C++ polymorphism sexy ;-]

I'd like to break the UniString constructor set to remove the ambiguity; the
same problem exists for ByteString [FWIW], where (perhaps) it causes more
damage. I -suspect- that removing the plain:

ByteString( const sal_Unicode* pUniStr,
	    rtl_TextEncoding eTextEncoding,
	    sal_uInt32 nCvtFlags = UNISTRING_TO_BYTESTRING_CVTFLAGS );

constructor without the length is what we want to do (?)
Comment 4 Mathias_Bauer 2006-10-06 22:08:10 UTC
Michael, I think we can address two things here.
First we can make sure that any table will be loaded only when one of the 23
symbols in textenc is used at all. The simplest approach would be making the
functions a small wrapper that loads the textenc.lib on demand and passes the
call along. This will enable us to link against sal statically without loading
the famous tables.

Then we can try to optimize the table usage in a second step or try to avoid
superfluous textenc codes in the startup.

What do you think?
Comment 5 Mathias_Bauer 2006-10-06 22:08:15 UTC
Michael, I think we can address two things here.
First we can make sure that any table will be loaded only when one of the 23
symbols in textenc is used at all. The simplest approach would be making the
functions a small wrapper that loads the textenc.lib on demand and passes the
call along. This will enable us to link against sal statically without loading
the famous tables.

Then we can try to optimize the table usage in a second step or try to avoid
superfluous textenc codes in the startup.

What do you think?
Comment 6 mmeeks 2006-10-06 22:20:47 UTC
> Michael, I think we can address two things here.

   :-) sure, the optimisation, and then fixing the code. I can hack on the
optimisation easily enough - what concerns me is finding a good solution for the
evil polymorphism problem. It seems to me that at root the problem is:

typedef sal_uInt16 rtl_TextEncoding;

   Which is not a -real- type, the compiler will happily coerce 'strlen("foo")'
to type rtl_TextEncoding without batting an eye-lid. I'm not enough of a C++
guru to know what combination of wrapper classes, 'explicit' keywords [etc.]
will let us a) find all instances of bad behavior here, and b) stop them from
happening in the future. [ I'm personally inclined to bin all the *String
polymorphic constructors but ... ;-]. Unfortunately - it seems that the method
with just the string & textenc is (correctly) used in several places (even
inside tools/ eg.) but also used incorrectly a -lot- on startup we have:

encoding  count
IBM_863	140
IBM_860	64
IBM_850	26
IBM_857	6
IBM_861	2

from a good few sites:

  #4  0xb293dba3 in SdrLayerAdmin::SdrLayerAdmin () from ./libsvx680li.so
  #4  0xb2940344 in SdrModel::TakeUnitStr () from ./libsvx680li.so
  #4  0xb2819535 in CreateGraphicObjectFromURL () from ./libsvx680li.so

etc. [ and presumably myriad as yet undetected versions too ].

The 2nd problem is easy enough - doing the dynamic loading of that data, I'll
knock that off in the next 20 mins before bed.

> This will enable us to link against sal statically without loading
> the famous tables.

   quite :-) but what I'm -really- concerned about is the code cleanliness /
error-prone-ness that I see here, it scares me to think that a load of strings
may be being strangely mangled at some computational expense for no good reason :-)
Comment 7 mmeeks 2006-10-06 23:25:35 UTC
Created attachment 39620 [details]
patch
Comment 8 mmeeks 2006-10-06 23:29:40 UTC
simple patch, under-tested :-) needs someone who can grok the effort to make
double-check-locking to do that good stuff, and testing on Win32 too I guess.
Oh - and I forgot the vital 'scp' piece ... [ bother ], too late now anyway.
Comment 9 mmeeks 2006-10-07 18:55:24 UTC
updated patch with map file, scp2 patch etc. at:
http://go-oo.org/ooo-build/patches/src680/size-sal-textenc.diff
Comment 10 tml 2006-10-10 18:11:24 UTC
Created attachment 39676 [details]
Updated patch, now builds and works on Win32, too
Comment 11 Mathias_Bauer 2006-10-16 11:59:28 UTC
Thanks for the patch. Stephan Bergmann will review the code when he is back from
vacation (should happen this week).
I've set the target to "2.x" but depending on Stephan's opinion we can change this.
Comment 12 Stephan Bergmann 2006-10-19 15:04:05 UTC
1  If I understand correctly, the original motivation for this issue (comming
from issue 60696) was to reduce size of sal so that some other lib can
statically link against it and not grow too large.  A nice side benefit of only
demand-loading most part of sal/textenc might be that loading sal itself is
faster then, as many of the tables in sal/textenc contain pointers that need
relocation upon library loading.  (I know this problem for a long time, but
never got around to do anything about it.)

2  I think this issue should be split in two:  Keep this issue for splitting off
the new demand-loaded library.  Create a new issue for the tools string
constructor problems.  Michael, can you please create that new issue?  In my
opinion, it would be best to make this issue depend on the new issue then (and
not fix this issue before the new issue is fixed, i.e., drop the
TOOLS_STRING_MISSUSE from the patch), but if anybody needs this issue solved
quickly, then I am also fine with any other strategy.

3  I am not sure there is a reasonable static list of text encodings that are
needed early on (and are thus special-cased in the patched
Impl_getTextEncodingData), as the list presumably depends on the user's locale
(though I did not check that out yet).  (However, the exact list is irrelevant
when the goal is to reduce the size of the sal library; it is only important
when the goal is to reduce startup time by postponing loading of the sal/textenc
tables.)
Comment 13 mmeeks 2006-10-23 17:41:44 UTC
Hi Stephan:

> 1  If I understand correctly, the original motivation ...

    Yep - should be good huh ;-)

> 2  I think this issue should be split in two:  Keep this issue for
> splitting off the new demand-loaded library.

    Sure - the whole bug-spawns-bug effect here is quite acute though. I'm not
convinced we'll ever get them all fixed :-)

> 3  I am not sure there is a reasonable static list of text encodings
> that are needed early on (and are thus special-cased in the patched
> Impl_getTextEncodingData), as the list presumably depends on the
> user's locale

    Empirically that list works well for me. I can believe on Win32 that you
have a random / custom encoding per language. However on Unix/Linux on a
correctly configured system -everything- should be in UTF-8, so we can feasibly
avoid loading any of that other library until it's really needed.

    So - hopefully, this is not -so- controversial ? I've split off the other 
issue (which is more so), can we get this in as-is ? and rip out the #ifdef
FIXED_WORLD later ? :-)
Comment 14 Stephan Bergmann 2006-10-24 08:38:40 UTC
@mmeeks:

(1)  "on Unix/Linux on a correctly configured system -everything- should be in
UTF-8"  Not sure about East Asian conventions, but I can imagine they prefer
some specific CJK encodings over UTF-8.  Anyway, we agree that the point is
valid at least for Windows.

(2)  "can we get this in as-is?"  Do you mean: into OOo 2.1?  I do not see a
chance:  I just hurried to get my last CWS for OOo 2.1 in (with a bunch of other
patches) before last CWS integration on November 2; there simply is no time to
include this as well.  Plus, as far as I understand, this issue is a
prerequisite for issue 60696, but that issue is only targeted OOo 2.x, so why
hurry?  Plus, I am still not sure (1) above is addressed adequately (if we want
to shrink size of sal, no need to even special-case UTF-8 etc.; if we want to
avoid loading tables early, have to make sure special cases are adequate; which
of the two *do* we want to achieve with this issue, anyway?).
Comment 15 Stephan Bergmann 2006-10-24 08:42:33 UTC
mmeeks "I've split off the other issue":  Appears to be issue 70752.
Comment 16 mmeeks 2006-10-24 12:16:29 UTC
Hi Stephan:

> (1)  "on Unix/Linux on a correctly configured system -everything- should
> be in UTF-8"  Not sure about East Asian conventions

    I am sure - and on Linux (and Unix) they -should- be using UTF-8, elsewhere
lies total screaming madness.

> Anyway, we agree that the point is valid at least for Windows.

     Sure, but we at least do no more harm there.

> (2)  "can we get this in as-is?"  Do you mean: into OOo 2.1?

     Nope, I mean into a suitable CWS, and actually integrated into HEAD ;-)

> so why hurry ?

     We have your attention for now, and doing other work depends on knowing
that this will be included.

>  Plus, I am still not sure (1) above is addressed adequately (if we want
> to shrink size of sal, no need to even special-case UTF-8 etc.; if we
> want to avoid loading tables early, have to make sure special cases
> are adequate; which of the two *do* we want to achieve with this
> issue, anyway?).

     We want to shrink the size of sal, and also (in passing) get a nice win on
Unix. On Unix, as you see - the special cases should be adequate; I measured
each component on startup (cf. the textenc.ods spreadsheet) and also the
'special' cases are rather small in comparison with what we split out - this
seems like hair splitting.

     My concern is that I -really- don't want to get another issue jammed in the
"but we could do something nicer for Win32, so I reject this improvement for
Unx" ;-) state [ as the bug we split this from was from Aug 4th -> Oct 6th ].
Perhaps you can reassure me :-)
Comment 17 mmeeks 2006-10-24 12:19:42 UTC
oh; and as an extra cleanup by adjusting the makefile.mk to move the following
into the 'textenc' library, we save another 4k or so [ but it's a little cleaner ;-]

    $(SLO)$/convertiso2022kr.obj \
    $(SLO)$/convertsinglebytetobmpunicode.obj \
    $(SLO)$/convertbig5hkscs.obj \
    $(SLO)$/converteuctw.obj \ 
    $(SLO)$/convertgb18030.obj \ 
    $(SLO)$/convertiso2022cn.obj \
    $(SLO)$/convertiso2022jp.obj \
    $(SLO)$/tcvtutf7.obj \
    $(SLO)$/tcvtmb.obj \

do you want me to create a CWS for this to commit to ? hopefully it could be
aggregated with something else to save pain ? :-)
Comment 18 Stephan Bergmann 2006-10-24 16:09:23 UTC
@mmeeks:  Your arguments convinced me.  :)  Just go ahead and integrate the
patch as-is into a CWS of yours.
Comment 19 kendy 2007-01-08 14:45:00 UTC
sb: So, the patch has changed a bit, the newest version is at 
http://svn.gnome.org/viewcvs/*checkout*/ooo-build/trunk/patches/src680/size-sal-textenc.diff .

And I need your advice ;-)  There was a problem, because osl_loadModule() used 
UnicodeToText() -> rtl_createUnicodeToTextConverter() -> 
Impl_getTextEncodingData() -> osl_loadModule() again.  To solve it, I used a 
hack that can be seen in other sal files as well.  I declared:

extern "C" oslModule SAL_CALL osl_psz_loadModule(const sal_Char *pszModuleName, 
sal_Int32 nRtldMode);

and used it instead of osl_loadModule().  Of course this does not work for 
Win32, so this is enclosed in #ifdef UNX, and the normal osl_loadModule() is 
used in the Win32 case.  Could you please have a look & test it for Win32?  I 
have no idea if it works there or not :-(

Thank you in advance!
Comment 21 Stephan Bergmann 2007-01-09 16:33:32 UTC
1  With my attached ooo-build.patch your patch compiles and smoketests on
SRC680m199 wntmsci10.pro.

2  pTables = (TextEncodingFunction)osl_psz_getSymbol(aModule, pSymbol); will not
compile on -enable-werror Solaris.  You need to make available a
osl_psz_getFunctionSymbol variant of osl_getFunctionSymbol instead.

3  osl_psz_loadModule and osl_psz_getFunctionSymbol should really be declared in
a (module local) header.

4  This starts to get somewhat ugly; Impl_getTextEncodingData needs to make sure
it does not call itself recursively, and does so building on different, platform
specific assumptions.  Very brittle.
Comment 22 kendy 2007-06-01 17:58:01 UTC
1) Thank you! :-)

So - would the patch be acceptable for you if I incorporated 2) and 3)?  Do you 
have idea what to do better about 4), please?
Comment 23 nospam4obr 2007-06-04 06:59:53 UTC
I believe I already solved 2) in (not yet integrated) issue 76025.
Comment 24 Stephan Bergmann 2007-06-06 12:45:38 UTC
@kendy:  I do not have an idea for (4) at the moment.  I have mixed feelings
about integrating with (4) open.  On the one hand, I would like to avoid
introducing such IMO brittle things.  On the other hand, if you do want this to
be integrated and nobody comes up with a decent solution for (4): so be it.
Comment 25 mmeeks 2007-07-09 13:57:36 UTC
we should get this integrated I guess ...
Comment 26 kendy 2007-07-27 10:39:46 UTC
sb: I have a new version of the patch, could you please check it?  It 
introduces osl_loadModuleAscii() for bypassing the conversion that is dangerous 
(and not working) in this case.  On unx, it is just a renamed 
osl_psz_loadModule, on Win32 I had to do a new function for that.  
Unfortunately - I have no Win32 machine now [and not enough Win32 knowledge 
either ;-)] - could you please have a look & fix it if not correct?

If everything's OK, I can commit it to CWS kendy18.  Thank you in advance!
Comment 27 kendy 2007-07-27 10:40:39 UTC
Created attachment 47133 [details]
The newest version.
Comment 28 kendy 2007-07-27 10:44:48 UTC
[I forgot - I also made osl_loadModuleAscii() public, because 
osl_getAsciiFunctionSymbol() is now public as well.]
Comment 29 Stephan Bergmann 2007-08-08 12:51:16 UTC
- sal/inc/osl/module.h:

-- osl_loadModuleAscii does not only differ from osl_loadModule in that it
receives an ASCII string instead of a Unicode one, but also in that it receives
a system path instead of a URL.  That needs to be made more explicit (function
name, parameter name, documentation).

-- An appropriate @since tag is missing from the documentation of
osl_loadModuleAscii.

- sal/osl/w32/module.c:

-- hro, please comment on the Windows-version of osl_loadModuleAscii.

-- (void) nRtldMode; /* avoid warnings */ has to be moved after the local
variable definitions to be valid C89.

- sal/textenc/makefile.mk:

-- The duplication of SAL_COMMON_OBJFILES in sal and sal_textenc is bad.

-- For the URE-internal sal_textenc lib no UDK_MAJOR is needed (e.g., just
sal_textenc.so on Unix/ELF and sal_textenc.dll on Windows), so the
CDEFS+=-DPLUGIN_NAME stuff is not necessary and the STRING(PLUGIN_NAME) stuff in
textenc.cxx can be simplified.

- sal/textenc/tables.cxx:

-- Is a verbatim copy of sal/textenc/textenc.cxx:1.6, right?  Wouldn't it be
better to keep this file named textenc.cxx as part of sal_textenc and instead
create a new file for sal?

-- The data for those textencodings that are handled directly by sal is now
doubled in sal and sal_textenc.

- sal/textenc/textenc.cxx:

-- aImpl8090SameToUniTab etc. is now doubled in textenc.cxx and tables.cxx.

-- Tabs instead of spaces.

-- Use rtl/instance.hxx to initialize pTables.

-- On CWS sb71 I found out that osl_loadModule with a plain filename is a bad
thing, so introduced osl_loadModuleRelative (see
<http://wiki.services.openoffice.org/wiki/ODF_Toolkit/Efforts/OOo_without_URE>).
 That means that this use of osl_loadModuleAscii with a plain filename would
also need to be changed.

-- No C-style casts in C++ code, please.

-- The fprintf error handling is not acceptable.  (std::abort instead?)

- sal/util/sal.map:

-- osl_loadModuleAscii would go into a OOo 2.4 section.

- sal/util/saltextenc.map:

-- It is somewhat unfortunate that a symbol starting "Impl_" is exported, as
that is traditionally used to denote non-exported symbols.

- The new URE-internal lib sal_textenc would also have to be added to
ure/source/README and scp2/source/ure/ure.scp.
Comment 30 Stephan Bergmann 2007-08-09 08:34:58 UTC
@hro:  "please comment on the Windows-version of osl_loadModuleAscii."
Comment 31 hennes.rohling 2007-08-09 17:10:49 UTC
@keyndy: Patch for windows looks good (just a copy of osl_loadModule without the
converting stuff).

But you should use the explicit Ansi API instead of using the macros. So use
LoadLibraryA instead of LoadLibrary and LoadLibraryExA instead of LoadLibraryEx.
Comment 32 Martin Hollmichel 2008-01-25 16:58:41 UTC
set target 3.0
Comment 33 Martin Hollmichel 2008-08-29 13:34:47 UTC
ping?
Comment 34 Rob Weir 2013-03-11 14:59:40 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list:  dev@openoffice.apache.org

Thanks!

-Rob