Bug 56854

Summary: XMLBeans performance when using getXXXList() and other proxy methods
Product: POI Reporter: Yaniv Kunda <yaniv>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.11-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: 56854-1
56854-2

Description Yaniv Kunda 2014-08-14 14:06:00 UTC
The XMLBeans recommendation is that when accessing long XML documents, or
when performance matters, one should use XmlCursor and not getXXXList().

In order to provide a short-term solution, I propose we sweep the
code and replace all usages of getXXXList() with getXXXArray() -  which is
much faster since it creates a snapshot of the xml data instead of a live
view.

This will be trivial for read-only operations (e.g. bug #56556) and easy
for write operations – just read the array, work on it or create a new one,
and set the result back to the proxy.

Also, some attempts to minimize this – using toArray() on the returned
list, can be reduced to getXXXArray(), since toArray() iterates over the
elements, which is very inefficient in the XmlBeans impl.

We should decide how to document the new (and old) usages of getXXXArray(), which are deprecated (and probably will be as long as the XMLBeans project is dead) -
either by ignoring it altogether, or by adding @SuppressWarnings("deprecation") on all usages (thus always using a variable) optionally with a comment describing why the deprecated method is used (e.g. a link to this bug).

In addition, other methods (e.g. CTCol.getMax()) are sometimes called several times in the same scope when their result is not changing, which is another redundant piece of XMLBeans code being called.
Comment 1 Yaniv Kunda 2014-08-17 14:27:04 UTC
Created attachment 31921 [details]
56854-1

This is a first sweep of proxy method usages, which includes mostly usage of array-based proxy methods instead of list-based when appropriate, and other repeat calls of methods inlined.

test-all runtime dropped from ~80 seconds to ~60 on my machine, mostly due to the changes in XSSF and ColumnHelper.
Comment 2 Andreas Beeker 2014-08-28 00:12:47 UTC
Thank you for the patch - applied with r1620997

Most changes were trivial and the tests for ColumnHelper were successful, so should be ok. Added a few deprecation-ignores.

for the time being, I'll leave the bug entry open, waiting for the next diff.
Comment 3 Yaniv Kunda 2014-09-14 10:31:18 UTC
Created attachment 32020 [details]
56854-2

Additional optimizations, mainly in XSSFSheet
Comment 4 Andreas Beeker 2014-09-14 23:03:47 UTC
Thank you for the second patch - applied with r1624922
XSSFSheet.removeMergedRegion(int) looked too complicated, I haven't seen a reason, why not to use the default remove-element-by-index impl.
Comment 5 Yaniv Kunda 2014-09-15 14:45:40 UTC
You're right - sometimes it takes development to visit the "complicated" station when travelling from "a mess" to "elegant" :-)

Note: the patch for XSSFSheet.removeMergedRegion/s is what I originally done in https://issues.apache.org/bugzilla/attachment.cgi?id=31960&action=diff
for issue 55280 (https://issues.apache.org/bugzilla/show_bug.cgi?id=55280)

(In reply to Andreas Beeker from comment #4)
> Thank you for the second patch - applied with r1624922
> XSSFSheet.removeMergedRegion(int) looked too complicated, I haven't seen a
> reason, why not to use the default remove-element-by-index impl.
Comment 6 Andreas Beeker 2014-09-15 15:48:41 UTC
Thanks for linking the other bug - I've already had a quick look yesterday where these XSSFSheet.removeMergedRegion(s) methods came from ... :)

I haven't setup a profile environment for the test cases, yet (... the builds are now faster, when looking at builds.apache.org, but this only a very rough estimation ...)

Yaniv, in case you already have it in place (and also have some old stats to compare to), it would be good, if you also validate/profile the derivation from your patch - sometimes it's hard to decide if something is just refurbished old code or if it's like this on purpose.
Comment 7 Dominik Stadler 2015-09-13 20:18:11 UTC
It seems the things discussed here are mostly done and applied since some time already, I think it best to open new bugs for any additional optimizations that are proposed.