Issue 88162 - Removing blank lines from a .xcu file to reduce its size
Summary: Removing blank lines from a .xcu file to reduce its size
Status: CLOSED FIXED
Alias: None
Product: utilities
Classification: Unclassified
Component: code (show other issues)
Version: OOo 2.4.0
Hardware: All All
: P3 Trivial (vote)
Target Milestone: OOo 3.2
Assignee: joerg.skottke
QA Contact: Unknown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-11 07:35 UTC by tora3
Modified: 2009-10-09 08:12 UTC (History)
3 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
statictics of OOo 2.4 (14.93 KB, text/plain)
2008-04-11 07:38 UTC, tora3
no flags Details
An experimental patch (562 bytes, patch)
2008-04-11 07:41 UTC, tora3
no flags Details | Diff
A revised patch (1.04 KB, patch)
2008-06-08 17:21 UTC, tora3
no flags Details | Diff
A patch for xsltproc ... --stringparam locale XX .../util/alllang.xsl .../misc/merge/org/openoffice/Office/Common.xcu (1.30 KB, patch)
2009-05-25 06:06 UTC, tora3
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description tora3 2008-04-11 07:36:00 UTC
Phenomenon:
.xcu files in the directory /opt/openoffice.org2.4/share/registry/res/ 
include much blank lines. That might affect performance.

Facts:
For instance, res/ja/org/openoffice/Office/DataAccess.xcu has 3247 lines.
128 lines out of them have a content and the rest has a blank line. 
The rate, 128 divided by 3247, is 3.94%. 

Total amount of .xcu files in the directory res/ja reaches 312221 lines 
and 2875301 bytes (approx. 2.7MB). If the blank lines are removed, the 
total could be reduced to 18846 lines and 747025 bytes (approx. 0.7MB).

An attachemnt file lines.txt shows the statictics of OOo 2.4 Linux 
English plus a Japanese language pack. Each line has four columns: 
rate, the number of meaningful lines, the number of total lines, and 
file name. A part of the file is shown below.

Rate    Lines   Lines   Filename
======= ======= ======= ===========================================
03.94%  128     3247    res/ja/org/openoffice/Office/DataAccess.xcu
05.01%  4342    86650   res/ja/org/openoffice/Office/TableWizard.xcu
...(omitted)...
100.00% 8668    8668    data/org/openoffice/Office/TableWizard.xcu
100.00% 14113   14113   data/org/openoffice/Office/Labels.xcu

The above file can be produced with the following commands:
cd /opt/openoffice.org2.4/share/registry
f=`find * -name '*.xcu'`
perl -ne '$n++ unless m/\A\s*\Z/; if (eof) { $r=$n*100.0/$.; printf
"%05.2f%%\t%d\t%d\t%s\n", $r, $n, $., $ARGV; $n=0; close ARGV }' $f | sort -n -k
1 -n -k 3 > lines.txt

Proposal:
To remove blank lines, a XSLT stylesheet officecfg/util/alllang.xsl 
could be slightly tweaked by adding the following line in the beggining.

  <xsl:strip-space elements="*" />
Comment 1 tora3 2008-04-11 07:38:21 UTC
Created attachment 52745 [details]
statictics of OOo 2.4
Comment 2 tora3 2008-04-11 07:41:30 UTC
Created attachment 52746 [details]
An experimental patch
Comment 3 kai.sommerfeld 2008-05-13 09:21:42 UTC
kso->sb: Stephan, please take over.
Comment 4 Stephan Bergmann 2008-05-13 09:29:47 UTC
.
Comment 5 Stephan Bergmann 2008-06-04 08:48:33 UTC
The attached patch_88162_2008-04-11.diff is problematic, as it would also strip
space that should be preserved, for example if an elements text content consists
entirely of spaces.  According to mib (who had a quick glance at alllang.xsl),
the problem is the way the xsl file is written (to copy everything by default
and remove unwanted element content, instead of dropping everything by default
and only keeping wanted elements), but changing that would probably be more work
and risk than I am willing to put into OOo 3.0 at this time.
Comment 6 tora3 2008-06-08 17:16:15 UTC
Thank you for consideration.

Here is another patch patch_88162_2008-06-09.diff. 
Let's see how it works.

Reproduce the phenomenon by referring to the outputs from dmake.
========================================================================
cd $SRC_ROOT/officecfg/registry/data/org/openoffice/Office

xsltproc --verbose \
  --nonet \
  --stringparam xcs
../../../../../registry/schema/org/openoffice/Office/Common.xcs \
  --stringparam schemaRoot ../../../../../registry/schema \
  --stringparam locale en-US \
  ../../../../../util/alllang.xsl \
  Common.xcu 2>&1 | less
========================================================================

Outputs produced with a current XSLT template alllang.xsl includes 
many blank lines.
========================================================================
<?xml version="1.0"?>
<oor:component-data xmlns:oor="http://openoffice.org/2001/registry"
xmlns:install="http://openoffice.org/2004/installation"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" oor:name="Common"
oor:package="org.openoffice.Office">
  
  
  
  
  <node oor:name="Menus">
    
    <node oor:name="Wizard">
      
      
      
      
      
      
      
      
      
      
      
      
      
      <node oor:name="m16">
        
        <prop oor:name="Title">
          
          <value xml:lang="en-US">Install fonts from the web...</value>
        </prop>
        
        
      </node>
    </node>
  </node>
  
  
  
  
  
</oor:component-data>
========================================================================

Look at a diagnose message additionally attached with a --verbose option.
The message implys that templates for text are missing.
========================================================================
xsltProcessOneNode: no template found for text
xsltDefaultProcessOneNode: copy text 
  
xsltCopyText: copy text 
========================================================================

A template that will take care of texts has been experimentally added 
in the XSLT template file alllang.xsl.
========================================================================
	<!-- catch any unprocessed texts and dispose them -->
	<xsl:template match = "text()" mode="locale"/>
========================================================================

Now outputs produced with a revised alllang.xsl does not have any 
unwanted blank lines.
========================================================================
<?xml version="1.0"?>
<oor:component-data xmlns:oor="http://openoffice.org/2001/registry"
xmlns:install="http://openoffice.org/2004/installation"
xmlns:xs="http://www.w3.org/20
01/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
oor:name="Common" oor:package="org.openoffice.Office">
  <node oor:name="Menus">
    <node oor:name="Wizard">
      <node oor:name="m16">
        <prop oor:name="Title">
          <value xml:lang="en-US">Install fonts from the web...</value>
        </prop>
      </node>
    </node>
  </node>
</oor:component-data>
========================================================================
Comment 7 tora3 2008-06-08 17:21:51 UTC
Created attachment 54325 [details]
A revised patch
Comment 8 tora3 2009-05-22 09:08:23 UTC
Some attempts that I made before such as addition of 
<xsl:template match = "text()"/> might not be on the right truck. 

It seems that missing <xsl:template match = "/"/> would be the real cause 
of this issue. 

I would be trying to look into this again. 
Comment 9 tora3 2009-05-25 06:04:13 UTC
A patch for the following situation is being attached:
  xsltproc ... --stringparam locale XX .../util/alllang.xsl ... \
  .../misc/merge/org/openoffice/Office/Common.xcu
Comment 10 tora3 2009-05-25 06:06:59 UTC
Created attachment 62483 [details]
A patch for xsltproc ... --stringparam locale XX .../util/alllang.xsl .../misc/merge/org/openoffice/Office/Common.xcu
Comment 11 tora3 2009-05-25 06:24:55 UTC
With current implementation
DEV300_m48/officecfg/unxsoli4.pro/misc/registry/res/ja/org/openoffice 
has 
 42 files with, at told, 372303 lines and 3215964 bytes.

With the patch,
patch_88162_2009-05-25_officecfg_util_alllang.xsl_about_locale_to_DEV300_m48.diff, 
the figures have turned into be 
 42 files with, at told,  18530 lines and  706034 bytes.
Comment 12 tora3 2009-05-26 08:20:28 UTC
According to "XSL Transformations (XSLT) Version 1.0" by W3C [1], texts in the 
source file including whitespaces will be copied by the built-in template rules
to the destination file if the text nodes are evaluated. 

=== cited ===
5.8 Built-in Template Rules

There is a built-in template rule to allow recursive processing to continue 
in the absence of a successful pattern match by an explicit template rule in 
the stylesheet. ...(omitted)... 

There is also a built-in template rule for text and attribute nodes that copies 
text through:

  <xsl:template match="text()|@*">
    <xsl:value-of select="."/>
  </xsl:template>
=============

What we have learned here is that to prevent unnecessary copy of whitespaces, 
it would be better if we explicitly specify what templates should be evaluated. 

  <xsl:apply-templates/>
    does not limit templates. Thus, all templates including build-in ones 
    will be evaluated and consequently, unnecessary whitespaces will be copied 
    from source to destination by the template with match="text()".

  <xsl:apply-templates select = "node|prop|value"/>
    does explicitly specify what templates should be evaluated. Therefore, 
    no build-in template will be evaluated, or no unnecessary whitespaces 
    will be copied.

This fashion might be applied to other existing stylesheets in the module.

[1] http://www.w3.org/TR/xslt#built-in-rule
Comment 13 Stephan Bergmann 2009-05-26 08:23:17 UTC
@tora:  Thank you for your new patch.  I will see to get it integrated into OOo 3.2.
Comment 14 tora3 2009-05-26 09:03:46 UTC
@sb: Thank you for your considerations. 
Comment 15 Stephan Bergmann 2009-08-26 10:26:15 UTC
applied attached
patch_88162_2009-05-25_officecfg_util_alllang.xsl_about_locale_to_DEV300_m48.diff as
<http://hg.services.openoffice.org/hg/cws/sb113/rev/84d8584def62>
Comment 16 tora3 2009-08-27 05:21:07 UTC
@sb: Thanks a lot.
Comment 17 Stephan Bergmann 2009-09-01 09:13:46 UTC
@jsk: please verify that localizations other than "en-US" are still correct
(this change impacts the "fallback" locale "en-US" much less than all the
others, so I also provided "de" installsets)
Comment 18 joerg.skottke 2009-09-09 07:01:03 UTC
.
Comment 19 joerg.skottke 2009-09-09 07:02:46 UTC
.
Comment 20 joerg.skottke 2009-10-09 08:12:00 UTC
Close