Created attachment 35433 [details] Patch improving Performance When exporting a large .xlsx with XSSExportToXML, it takes very long. Find attached Patch, which reduced the runtime on an xlsx with 42k xpaths by 80%. The idea for the improve is taken from: https://www.ibm.com/developerworks/library/x-perfap2/index.html
Did the private selectNode method, which is called by private getNodeByXPath, and, in turn, other methods, affect the performance? It looks like the general idea here is to avoid creating a NodeList, since that eagerly traverses the XML nodes. A lazy traversal approach seems to be significantly faster here, based on your findings.
Created attachment 35434 [details] patch improving performance, rebased I resolved some recent merge conflicts with your patch. Rebased patch is attach. I'll apply this once I run it against our unit test suite.
Committed in r1812546.
profiling the large .xlsx file didn't mark your mentioned private methods as hotspots. further profiling marked the private method indexOfElementInComplexType as bottleneck. This because of many calls performing the sort. Introducing a Cache - just a hashmap - storing the index for an entry minimizes the operations on the DOM. See attached my proposed patch. It significantly Speeds up the Export for large .xlsx
Created attachment 35445 [details] Proposing patch for further Performance improvements
Comment on attachment 35445 [details] Proposing patch for further Performance improvements >íY[SÛFΫ3ÃØðÊX¶,0(Î¥Z¦Ü&L¼ÆJdI]Á4ñï9»+iµ2IÚ>ø<`i/çòÛ®H<î:|Êý8êºÝîV¯Gº_··½¾EÜ w{}£·¾±ã®ÛÛÜ|Dº?P§&)÷Ø£î7Ë2ûªýt èt¤Ìwâx:ÞçÄìÆñÏQ'g¦CN9ó|3çêââð`Ä_ÆWã°[V÷¿V·Ûío×¥a1z¤AíÞvsåq«Õú|ïbö1n'÷ÀõåKÒ~¶aoøûòåÊc°àÖã@hñÀ'à òBr~vtßÜPFÏ~%}õ~(ÄÝwn(³)Ù½4mþ¬ î¸ðÄKÈØK`¶ F6²))ñ7/Á½Î@iEç$@§ãÖ>èÝÜu䬩'²Fö)ñÄf*T«¼h@RÊSÂGuJ(kJ&)»È (Pp >]x 'ØiÄs9^¨/þìgĽëHYiØX~<ÇÑÕ¹ÇG¶Î@ [FÀK=fíÇaH}ñvR0Î&À.µù(³]ã0%Ñ$b/Ù¿i|+i[ë´;&Cbù4ô£IÊýóè4PâOÅsO¿ÜH¤%6`w°ïp`5qÌzaZÆË$ðë>(s!S¡ÉÃíy[·Ñq|KI|RÀQBÇ ¿%1@A@}¤4Ñ£"2Üza0XÀ>Z««ú×,´4KÈúýb>âV³I>%Æs/{³ä äÌÑXÀ¡# ¸à°ØÅÝÙ\a@ôøÅw(¬])aÁõ0«ò]©2G¢è¤qmxQ@èJTØíé ¿):çÑweø{ö//íÄçú[¿Ìà¯ÿ¢'tÝKÜ´ùrÎ&ahl´¼«¶ìƬTß.iËð\ßĸlon×ç´Ó±GRùÓ'CÕÞ@²ÒxÂüjy ìÁZÊ⩯øàî|ÞªlÍ7tԵЦ×g'B¢Y)MÕrä8~CGói"ª?ö,(AÌ Ù¢ædmimÑaOXD8ÐRß!>´£³;¡ÂÓgö3ÀôÙ6üÔ¥ü¬<d¼^CCþ¨Û§=ú^JÉéÛ7Gû»ÒèVí)ñ5`ùa'H± àëaÌÆçt êÆÜh&ªýä;p»ÜUM¡Ðz2<z]c¬×ÍOUÚÙWßÿIB:#ÒG >~ê+ln4 n:«%1ù.Ümâf_[I5òÔSì¡°é¼±èF7bºP¬±=]ª{Q]§ê:ãd!Úzû$ë8 éôò>¡oâ8ë÷ë2ÑT7é¹ë¶Xº=ÛukV'äCûÍ ª¼þ4ÔÈ"mжeÖd'ã¤R¸sH[SÒ"¿ºª¹Àxl|Åbeªi0ÙsùU¢´ÑJKÃ@Y+Çó¡yiì+º,_+HûI¦¾Rr¯ô¤ßvñdTÈÆf-Ùf =mãüMHª~¶Ýy'ÁúùײϹf|WÌëV<y!rÛn»uQS¤Ö:Ç!§g»¢KÉKÇ ¦iôÔ&l=pg£¦ÏCqÆE²¹Ö°9fè«b²áÚ;PLÖ·íM£($ºYÖÈ4»bð.BJÕ"ª"o4Ŧî¯ä[8B Tü1§Ú¤VÜt¸\hs¤¯ÏJ¥Çv«f¢[:1ø õÑ, ¡o$¢¦yéÃLh4ª°ÿð\®ð1M<_tqiÉG,mu©zÀÅ-óÖ.¬\Bþîg:`¥Z+Ы^E8R1h¤)O]Ö¥¿ U¥«jèKzÕ,ñ@ZeèyòùsÙ~×Võ~-b"àù4eMý1.Û ?NAaBMá«üªÕæüº+QS_#Ç{C³TT$¡Lµ¾£¾]¼Ã³d!vA.PDº(óL°è^§kïådyWsëí2lÇ5ÿ°²(iÁ¨¥²õõ©ËЩ øúÓWFåd >æÝTgåSqIý9ÊÔGùbm >æA6«k\Õ¶UÖ³Õ2SE2óS:åÁu>¶ÌyÀ¾t³t0÷JÆ4ÉÌÞ]bYFú«º°È®Õw»JeYÔâ*(áo,T7RÖ1ÑÝ¢ã!¾Æ÷ ÈüNï-oóE±?æveSËý!3YTºËû\aÓWîÉ¥´á´ ¯8s=á4µd&¯Á§ckÑaù ÇÜo4 >«b©lØ3QÿÊ5[-}2<Jmºÿú¿JKZÒ´¤%-iIKZÒþô>?¦ö(
Can you rebase your patch to http://svn.apache.org/repos/asf/poi/trunk? r96747 looks like it might be a private repo. This patch at least predates r1812097. If you could, please exclude whitespace changes from the patch, which makes it quicker to review the changes (fortunately this one's short), but more importantly reduces the likelihood of merge conflicts due to sweeping whitespace changes.
Find attached the patch rebased to trunk. Your assumption of a private repo was right. Sorry about that. I just used ant -f patch.xml to create the patch, don't know how to specifically exclude whitespaces.
Created attachment 35450 [details] Last Patch rebased against trunk r1813152
Thanks! Applied in r1813240 without modification.
These changes will be included in the first snapshot release of POI 4.0.0.
Looks like this broke one of our integration tests (which passes when r1813240 is reverted). https://svn.apache.org/viewvc/poi/trunk/test-data/spreadsheet/59026.xlsx Testcase: testAllFiles[1684: spreadsheet/59026.xlsx using org.apache.poi.stress.XSSFFileHandler@23223dd8] took 0.016 sec Caused an ERROR While handling spreadsheet/59026.xlsx java.lang.Exception: While handling spreadsheet/59026.xlsx at org.apache.poi.TestAllFiles.testAllFiles(TestAllFiles.java:438) Caused by: org.xml.sax.SAXParseException; cvc-complex-type.2.4.a: Invalid content was found starting with element 'EmployeeInfo'. One of '{ExpenseItem}' is expected. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:396) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:284) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(XMLSchemaValidator.java:452) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(XMLSchemaValidator.java:3230) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(XMLSchemaValidator.java:1790) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(XMLSchemaValidator.java:740) at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.beginNode(DOMValidatorHelper.java:277) at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.validate(DOMValidatorHelper.java:244) at com.sun.org.apache.xerces.internal.jaxp.validation.DOMValidatorHelper.validate(DOMValidatorHelper.java:190) at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:109) at javax.xml.validation.Validator.validate(Validator.java:124) at org.apache.poi.xssf.extractor.XSSFExportToXml.isValid(XSSFExportToXml.java:259) at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:219) at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:105) at org.apache.poi.stress.XSSFFileHandler.exportToXML(XSSFFileHandler.java:156) at org.apache.poi.stress.XSSFFileHandler.handleFile(XSSFFileHandler.java:111) at org.apache.poi.TestAllFiles.testAllFiles(TestAllFiles.java:410)
Looks to be due to the left and right xpaths being different lengths Reading spreadsheet/59026.xlsx with class org.apache.poi.stress.XSSFFileHandler compare() called with: leftXpath=/Root/EmployeeInfo/DateOfBirth rightXpath=/Root/EmployeeInfo/Name indexOfElementInComplexType() called with: samePath = //Root/EmployeeInfo leftElementName=DateOfBirth rightElementName=Name complexType=null compare() called with: leftXpath=/Root/EmployeeInfo/Code rightXpath=/Root/EmployeeInfo/DateOfBirth indexOfElementInComplexType() called with: samePath = //Root/EmployeeInfo leftElementName=Code rightElementName=DateOfBirth complexType=null compare() called with: leftXpath=/Root/ExpenseItem rightXpath=/Root/EmployeeInfo/Code indexOfElementInComplexType() called with: samePath = //Root leftElementName=ExpenseItem rightElementName=EmployeeInfo complexType=null Caused by: org.xml.sax.SAXParseException; cvc-complex-type.2.4.a: Invalid content was found starting with element 'EmployeeInfo'. One of '{ExpenseItem}' is expected. Failed: spreadsheet/59026.xlsx
I reverted r1813240. I'm waiting for a patch that passes the integration test suite.
Created attachment 35461 [details] patch improving performance by using hashmap Patch which won't break the Integration Tests. I ran 'ant test' and TestAllFiles.java in Eclipse.
(In reply to Daniel from comment #15) > Created attachment 35461 [details] Applied in r1813332 without modification. Thank you!
Created attachment 35468 [details] Patch using correct index for comparing While further investigating the XSSFExportToXML functionality I encountered a bug that i made in the previous patch. This patch corrects that bug.
Before applying patch 'Patch using correct index for comparing': java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(Unknown Source) at java.util.TimSort.mergeAt(Unknown Source) at java.util.TimSort.mergeCollapse(Unknown Source) at java.util.TimSort.sort(Unknown Source) at java.util.Arrays.sort(Unknown Source) at java.util.Vector.sort(Unknown Source) at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:150) at org.apache.poi.xssf.extractor.XSSFExportToXml.exportToXML(XSSFExportToXml.java:105)
(In reply to Daniel from comment #17) > Created attachment 35468 [details] > Patch using correct index for comparing Applied in r1813443 without modification. Thanks for testing! (In reply to Daniel from comment #18) > java.lang.IllegalArgumentException: Comparison method violates its general > contract! Is there a simplified excel file that you can share that tests for this? This would be good to add into our unit test suite, since XSSFExportToXML has relatively poor test coverage.
Closing this one for now, it does not look like we will get more test-files here for now.