Apache OpenOffice (AOO) Bugzilla – Issue 88162
Removing blank lines from a .xcu file to reduce its size
Last modified: 2009-10-09 08:12:00 UTC
Phenomenon: .xcu files in the directory /opt/openoffice.org2.4/share/registry/res/ include much blank lines. That might affect performance. Facts: For instance, res/ja/org/openoffice/Office/DataAccess.xcu has 3247 lines. 128 lines out of them have a content and the rest has a blank line. The rate, 128 divided by 3247, is 3.94%. Total amount of .xcu files in the directory res/ja reaches 312221 lines and 2875301 bytes (approx. 2.7MB). If the blank lines are removed, the total could be reduced to 18846 lines and 747025 bytes (approx. 0.7MB). An attachemnt file lines.txt shows the statictics of OOo 2.4 Linux English plus a Japanese language pack. Each line has four columns: rate, the number of meaningful lines, the number of total lines, and file name. A part of the file is shown below. Rate Lines Lines Filename ======= ======= ======= =========================================== 03.94% 128 3247 res/ja/org/openoffice/Office/DataAccess.xcu 05.01% 4342 86650 res/ja/org/openoffice/Office/TableWizard.xcu ...(omitted)... 100.00% 8668 8668 data/org/openoffice/Office/TableWizard.xcu 100.00% 14113 14113 data/org/openoffice/Office/Labels.xcu The above file can be produced with the following commands: cd /opt/openoffice.org2.4/share/registry f=`find * -name '*.xcu'` perl -ne '$n++ unless m/\A\s*\Z/; if (eof) { $r=$n*100.0/$.; printf "%05.2f%%\t%d\t%d\t%s\n", $r, $n, $., $ARGV; $n=0; close ARGV }' $f | sort -n -k 1 -n -k 3 > lines.txt Proposal: To remove blank lines, a XSLT stylesheet officecfg/util/alllang.xsl could be slightly tweaked by adding the following line in the beggining. <xsl:strip-space elements="*" />
Created attachment 52745 [details] statictics of OOo 2.4
Created attachment 52746 [details] An experimental patch
kso->sb: Stephan, please take over.
.
The attached patch_88162_2008-04-11.diff is problematic, as it would also strip space that should be preserved, for example if an elements text content consists entirely of spaces. According to mib (who had a quick glance at alllang.xsl), the problem is the way the xsl file is written (to copy everything by default and remove unwanted element content, instead of dropping everything by default and only keeping wanted elements), but changing that would probably be more work and risk than I am willing to put into OOo 3.0 at this time.
Thank you for consideration. Here is another patch patch_88162_2008-06-09.diff. Let's see how it works. Reproduce the phenomenon by referring to the outputs from dmake. ======================================================================== cd $SRC_ROOT/officecfg/registry/data/org/openoffice/Office xsltproc --verbose \ --nonet \ --stringparam xcs ../../../../../registry/schema/org/openoffice/Office/Common.xcs \ --stringparam schemaRoot ../../../../../registry/schema \ --stringparam locale en-US \ ../../../../../util/alllang.xsl \ Common.xcu 2>&1 | less ======================================================================== Outputs produced with a current XSLT template alllang.xsl includes many blank lines. ======================================================================== <?xml version="1.0"?> <oor:component-data xmlns:oor="http://openoffice.org/2001/registry" xmlns:install="http://openoffice.org/2004/installation" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" oor:name="Common" oor:package="org.openoffice.Office"> <node oor:name="Menus"> <node oor:name="Wizard"> <node oor:name="m16"> <prop oor:name="Title"> <value xml:lang="en-US">Install fonts from the web...</value> </prop> </node> </node> </node> </oor:component-data> ======================================================================== Look at a diagnose message additionally attached with a --verbose option. The message implys that templates for text are missing. ======================================================================== xsltProcessOneNode: no template found for text xsltDefaultProcessOneNode: copy text xsltCopyText: copy text ======================================================================== A template that will take care of texts has been experimentally added in the XSLT template file alllang.xsl. ======================================================================== <!-- catch any unprocessed texts and dispose them --> <xsl:template match = "text()" mode="locale"/> ======================================================================== Now outputs produced with a revised alllang.xsl does not have any unwanted blank lines. ======================================================================== <?xml version="1.0"?> <oor:component-data xmlns:oor="http://openoffice.org/2001/registry" xmlns:install="http://openoffice.org/2004/installation" xmlns:xs="http://www.w3.org/20 01/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" oor:name="Common" oor:package="org.openoffice.Office"> <node oor:name="Menus"> <node oor:name="Wizard"> <node oor:name="m16"> <prop oor:name="Title"> <value xml:lang="en-US">Install fonts from the web...</value> </prop> </node> </node> </node> </oor:component-data> ========================================================================
Created attachment 54325 [details] A revised patch
Some attempts that I made before such as addition of <xsl:template match = "text()"/> might not be on the right truck. It seems that missing <xsl:template match = "/"/> would be the real cause of this issue. I would be trying to look into this again.
A patch for the following situation is being attached: xsltproc ... --stringparam locale XX .../util/alllang.xsl ... \ .../misc/merge/org/openoffice/Office/Common.xcu
Created attachment 62483 [details] A patch for xsltproc ... --stringparam locale XX .../util/alllang.xsl .../misc/merge/org/openoffice/Office/Common.xcu
With current implementation DEV300_m48/officecfg/unxsoli4.pro/misc/registry/res/ja/org/openoffice has 42 files with, at told, 372303 lines and 3215964 bytes. With the patch, patch_88162_2009-05-25_officecfg_util_alllang.xsl_about_locale_to_DEV300_m48.diff, the figures have turned into be 42 files with, at told, 18530 lines and 706034 bytes.
According to "XSL Transformations (XSLT) Version 1.0" by W3C [1], texts in the source file including whitespaces will be copied by the built-in template rules to the destination file if the text nodes are evaluated. === cited === 5.8 Built-in Template Rules There is a built-in template rule to allow recursive processing to continue in the absence of a successful pattern match by an explicit template rule in the stylesheet. ...(omitted)... There is also a built-in template rule for text and attribute nodes that copies text through: <xsl:template match="text()|@*"> <xsl:value-of select="."/> </xsl:template> ============= What we have learned here is that to prevent unnecessary copy of whitespaces, it would be better if we explicitly specify what templates should be evaluated. <xsl:apply-templates/> does not limit templates. Thus, all templates including build-in ones will be evaluated and consequently, unnecessary whitespaces will be copied from source to destination by the template with match="text()". <xsl:apply-templates select = "node|prop|value"/> does explicitly specify what templates should be evaluated. Therefore, no build-in template will be evaluated, or no unnecessary whitespaces will be copied. This fashion might be applied to other existing stylesheets in the module. [1] http://www.w3.org/TR/xslt#built-in-rule
@tora: Thank you for your new patch. I will see to get it integrated into OOo 3.2.
@sb: Thank you for your considerations.
applied attached patch_88162_2009-05-25_officecfg_util_alllang.xsl_about_locale_to_DEV300_m48.diff as <http://hg.services.openoffice.org/hg/cws/sb113/rev/84d8584def62>
@sb: Thanks a lot.
@jsk: please verify that localizations other than "en-US" are still correct (this change impacts the "fallback" locale "en-US" much less than all the others, so I also provided "de" installsets)
Close