Bug 61630 - [PATCH] Performance Tweaks in XSSFExportToXML
Summary: [PATCH] Performance Tweaks in XSSFExportToXML
Status: REOPENED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.17-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-18 13:09 UTC by Daniel
Modified: 2017-10-26 18:10 UTC (History)
0 users



Attachments
Patch improving Performance (1.07 KB, application/x-gzip)
2017-10-18 13:09 UTC, Daniel
Details
patch improving performance, rebased (4.15 KB, patch)
2017-10-18 16:00 UTC, Javen O'Neal
Details | Diff
Proposing patch for further Performance improvements (1.95 KB, application/tar+gzip)
2017-10-20 09:45 UTC, Daniel
Details
Last Patch rebased against trunk r1813152 (1.40 KB, application/tar+gzip)
2017-10-24 09:34 UTC, Daniel
Details
patch improving performance by using hashmap (1.44 KB, application/tar+gzip)
2017-10-25 10:55 UTC, Daniel
Details
Patch using correct index for comparing (366 bytes, application/tar+gzip)
2017-10-26 12:58 UTC, Daniel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel 2017-10-18 13:09:02 UTC
Created attachment 35433 [details]
Patch improving Performance

When exporting a large .xlsx with XSSExportToXML, it takes very long. 

Find attached Patch, which reduced the runtime on an xlsx with 42k xpaths by 80%.

The idea for the improve is taken from:  https://www.ibm.com/developerworks/library/x-perfap2/index.html
Comment 1 Javen O'Neal 2017-10-18 15:57:18 UTC
Did the private selectNode method, which is called by private getNodeByXPath, and, in turn, other methods, affect the performance?

It looks like the general idea here is to avoid creating a NodeList, since that eagerly traverses the XML nodes. A lazy traversal approach seems to be significantly faster here, based on your findings.
Comment 2 Javen O'Neal 2017-10-18 16:00:58 UTC
Created attachment 35434 [details]
patch improving performance, rebased

I resolved some recent merge conflicts with your patch. Rebased patch is attach. I'll apply this once I run it against our unit test suite.
Comment 3 Javen O'Neal 2017-10-18 16:04:18 UTC
Committed in r1812546.
Comment 4 Daniel 2017-10-20 09:44:21 UTC
profiling the large .xlsx file didn't mark your mentioned private methods as hotspots. further profiling marked the private method indexOfElementInComplexType as bottleneck. This because of many calls performing the sort.
Introducing a Cache - just a hashmap - storing the index for an entry minimizes the operations on the DOM. See attached my proposed patch. It significantly Speeds up the Export for large .xlsx
Comment 5 Daniel 2017-10-20 09:45:11 UTC
Created attachment 35445 [details]
Proposing patch for further Performance improvements
Comment 6 Javen O'Neal 2017-10-20 23:23:47 UTC
Comment on attachment 35445 [details]
Proposing patch for further Performance improvements

>íY[SÛFΫ3ÃØðÊX¶,0(Î¥Z¦Ü&L¼ÆJdI]­Á4ñï9»+iµ2IÚ>ø<`i/çòÛ®H<î:|Êý8êºÝîV¯Gº_··½¾EÜ
w{}£·¾±ã®ÛÛÜ|Dº?P§&)÷Ø£î7Ë2ûªýt
èt¤Ìwâx:Þ­çÄìÆñÏQ'g¦CN9ó|3çêââð`Ä_ÆWã°[V÷¿V·Ûío×¥a1z¤A­íÞvsåq«Õú|ïbö1n'÷ÀõåKÒ~¶aoøûòåÊc°àÖã@hñÀ'à òBr~vtßÜPFÏ~%}õ~(ÄÝwn(³)Ù½4mþ¬
î¸ðÄKÈØK`¶
F6²))ñ7/Á½Î@iEç$@§ãÖ>èÝÜu䬭©'²Fö)ñÄf*T«¼h@RÊSÂGuJ(kJ&)»È
(Pp
>]xÂ	'ØiÄs9^¨/þìgĽëHYiØX~<ÇÑÕ¹ÇG¶Î@	[FÀK=fíÇaH}ñvR0Î&À.µù(³]ã0%Ñ$b/Ù¿i|+i[ë´­;&Cbù4ô£IÊýóè4PâOÅsO¿ÜH¤%6`w°ïp`5qÌzaZÆË$ðë>(s!S¡­Éííy[·Ñq|KI|RÀQBÇ	¿%1@A@}¤4Ñ£"2Üza0XÀ>Z««ú×,´4KÈúýb>âV³I>%Æs/{³ä	äÌÑXÀ¡#
¸à°ØÅÝÙ\a@ôøÅw(¬])aÁõ0«ò]©2G¢è¤qmxQ@èJTØíé 
¿):çÑweø{ö//íÄçú[¿Ìà¯ÿ¢'tÝKÜ´ùrÎ&ahl´¼«¶ìƬTß.iËð\ßĸl­on×ç´Ó±GRùÓ'CÕÞ@²ÒxÂüjy	ìÁZÊ⩯øàî|ÞªlÍ7tԵЦ×g'B¢Y)MÕrä8~CGói"ª?ö,(AÌ Ù¢ædmimÑaOXD8ÐRß!>´­£³;¡ÂÓgö3ÀôÙ6üÔ¥ü¬<d¼^CCþ¨Û§=ú^JÉéÛ7Gû»ÒèVí)ñ5`ùa'H± àëaÌÆçt êÆÜh&ªýä;p»ÜUM¡Ðz2<z]c¬×ÍOUÚÙWßÿIB:#ÒG
>~ê+ln4	n­:«%1ù.Ümâ­f_[I5òÔSì¡°é¼±èF7bºP¬±=]ª{Q]§ê:ãd!Úzû$ë8	éôò>¡oâ8ë÷ë2ÑT7é¹ë¶Xº=ÛukV'äCûÍ ª¼þ4ÔÈ"mжeÖd'ã¤R¸sH[SÒ"¿ºª¹Àxl|Åbeªi0ÙsùU¢´ÑJKÃ@Y+Çó¡yiì+º,_+HûI¦¾Rr¯ô¤ßvñdTÈÆf-Ùf
=mãüMHª~¶Ýy'ÁúùײϹf|WÌë­V<y!rÛn»uQS¤Ö:Ç!§g»¢KÉKÇ ¦iôÔ&l=pg£¦ÏCqÆE²¹Ö°­9fè«b²áÚ;PLÖ·íM£($ºYÖÈ4»bð.BJÕ"ª"o4Ŧî¯ä[8B
Tü1§Ú¤VÜt¸\hs¤¯ÏJ¥Çv«f¢[:1ø	õÑ,
¡o$¢¦yéÃLh4ª°ÿð\®ð1M<_tqiÉG,mu©zÀÅ-óÖ.¬­\Bþîg:`¥Z+Ы^E8R1h¤)O]Ö¥¿ U¥«jèKzÕ,ñ@ZeèyòùsÙ~×Võ~-b"àù4eMý1.Û ?NAaBMá«üªÕæüº+Q­S_#Ç{C³TT$¡Lµ¾£¾]¼Ã³d!vA.PDº(óL°è^§kïådyWsëí2lÇ5ÿ°²(iÁ¨¥²õõ©ËЩ
øúÓWFåd
>æÝTgåSqIý9ÊÔGùbm
>æA6«k\Õ¶UÖ³Õ2­SE2óS:åÁu>¶ÌyÀ¾t³t0÷JÆ4ÉÌÞ]bYFú«º°È®Õw»JeYÔâ*(áo,T7RÖ1ÑÝ¢ã!¾Æ÷ ÈüNï-oóE±?æveSËý!3YTºËû\aÓWîÉ¥´á´
¯8s=á4µd&¯Á§ckÑaù	ÇÜo4
>«b©lØ3QÿÊ5[-}2<Jmºÿú¿JKZÒ´¤%-iIKZÒþô>?¦ö(
Comment 7 Javen O'Neal 2017-10-20 23:43:20 UTC
Can you rebase your patch to http://svn.apache.org/repos/asf/poi/trunk? r96747 looks like it might be a private repo. This patch at least predates r1812097.

If you could, please exclude whitespace changes from the patch, which makes it quicker to review the changes (fortunately this one's short), but more importantly reduces the likelihood of merge conflicts due to sweeping whitespace changes.
Comment 8 Daniel 2017-10-24 09:33:27 UTC
Find attached the patch rebased to trunk. Your assumption of a private repo was right. Sorry about that.

I just used ant -f patch.xml to create the patch, don't know how to specifically exclude whitespaces.
Comment 9 Daniel 2017-10-24 09:34:33 UTC
Created attachment 35450 [details]
Last Patch rebased against trunk r1813152
Comment 10 Javen O'Neal 2017-10-24 23:00:41 UTC
Thanks! Applied in r1813240 without modification.
Comment 11 Javen O'Neal 2017-10-24 23:01:25 UTC
These changes will be included in the first snapshot release of POI 4.0.0.
Comment 12 Javen O'Neal 2017-10-24 23:40:42 UTC
Looks like this broke one of our integration tests (which passes when r1813240 is reverted).
https://svn.apache.org/viewvc/poi/trunk/test-data/spreadsheet/59026.xlsx

Testcase: testAllFiles[1684: spreadsheet/59026.xlsx using org.apache.poi.stress.XSSFFileHandler@23223dd8] took 0.016 sec
        Caused an ERROR
While handling spreadsheet/59026.xlsx
java.lang.Exception: While handling spreadsheet/59026.xlsx
        at org.apache.poi.TestAllFiles.testAllFiles(TestAllFiles.java:438)
Caused by: org.xml.sax.SAXParseException; cvc-complex-type.2.4.a: Invalid content was found starting with element 'EmployeeInfo'. One of '{ExpenseItem}' is expected.
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:396)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:284)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(XMLSchemaValidator.java:452)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(XMLSchemaValidator.java:3230)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(XMLSchemaValidator.java:1790)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(XMLSchemaValidator.java:740)
        at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.beginNode(DOMValidatorHelper.java:277)
        at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.validate(DOMValidatorHelper.java:244)
        at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.validate(DOMValidatorHelper.java:190)
        at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:109)
        at javax.xml.validation.Validator.validate(Validator.java:124)
        at org.apache.poi.xssf.extractor.XSSFExportToXml.isValid(XSSFExportToXml.java:259)
        at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:219)
        at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:105)
        at org.apache.poi.stress.XSSFFileHandler.exportToXML(XSSFFileHandler.java:156)
        at org.apache.poi.stress.XSSFFileHandler.handleFile(XSSFFileHandler.java:111)
        at org.apache.poi.TestAllFiles.testAllFiles(TestAllFiles.java:410)
Comment 13 Javen O'Neal 2017-10-25 00:02:46 UTC
Looks to be due to the left and right xpaths being different lengths

Reading spreadsheet/59026.xlsx with class org.apache.poi.stress.XSSFFileHandler
compare() called with:
    leftXpath=/Root/EmployeeInfo/DateOfBirth
    rightXpath=/Root/EmployeeInfo/Name
indexOfElementInComplexType() called with:
    samePath = //Root/EmployeeInfo
    leftElementName=DateOfBirth
    rightElementName=Name
    complexType=null

compare() called with:
    leftXpath=/Root/EmployeeInfo/Code
    rightXpath=/Root/EmployeeInfo/DateOfBirth
indexOfElementInComplexType() called with:
    samePath = //Root/EmployeeInfo
    leftElementName=Code
    rightElementName=DateOfBirth
    complexType=null

compare() called with:
    leftXpath=/Root/ExpenseItem
    rightXpath=/Root/EmployeeInfo/Code
indexOfElementInComplexType() called with:
    samePath = //Root
    leftElementName=ExpenseItem
    rightElementName=EmployeeInfo
    complexType=null

Caused by: org.xml.sax.SAXParseException; cvc-complex-type.2.4.a: Invalid content was found starting with element 'EmployeeInfo'. One of '{ExpenseItem}' is expected.
Failed: spreadsheet/59026.xlsx
Comment 14 Javen O'Neal 2017-10-25 00:04:04 UTC
I reverted r1813240. I'm waiting for a patch that passes the integration test suite.
Comment 15 Daniel 2017-10-25 10:55:13 UTC
Created attachment 35461 [details]
patch improving performance by using hashmap

Patch which won't break the Integration Tests. I ran 'ant test' and TestAllFiles.java in Eclipse.
Comment 16 Javen O'Neal 2017-10-25 18:35:06 UTC
(In reply to Daniel from comment #15)
> Created attachment 35461 [details]
Applied in r1813332 without modification. Thank you!
Comment 17 Daniel 2017-10-26 12:58:39 UTC
Created attachment 35468 [details]
Patch using correct index for comparing

While further investigating the XSSFExportToXML functionality I encountered a bug that i made in the previous patch. 
This patch corrects that bug.
Comment 18 Daniel 2017-10-26 13:00:51 UTC
Before applying patch 'Patch using correct index for comparing': 

java.lang.IllegalArgumentException: Comparison method violates its general contract!
	at java.util.TimSort.mergeLo(Unknown Source)
	at java.util.TimSort.mergeAt(Unknown Source)
	at java.util.TimSort.mergeCollapse(Unknown Source)
	at java.util.TimSort.sort(Unknown Source)
	at java.util.Arrays.sort(Unknown Source)
	at java.util.Vector.sort(Unknown Source)
	at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:150)
	at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:105)
Comment 19 Javen O'Neal 2017-10-26 18:10:41 UTC
(In reply to Daniel from comment #17)
> Created attachment 35468 [details]
> Patch using correct index for comparing
Applied in r1813443 without modification. Thanks for testing!

(In reply to Daniel from comment #18)
> java.lang.IllegalArgumentException: Comparison method violates its general
> contract!
Is there a simplified excel file that you can share that tests for this? This would be good to add into our unit test suite, since XSSFExportToXML has relatively poor test coverage.