Bug 53508 - Performance XSSFCell.setCellValue(String) is very slow
Summary: Performance XSSFCell.setCellValue(String) is very slow
Status: RESOLVED INVALID
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.8-FINAL
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-04 08:17 UTC by velten
Modified: 2013-09-26 16:11 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description velten 2012-07-04 08:17:05 UTC
We generate large Documents. We found that especially Sheets with String-Cells are very slow. 

CPU-Usage was several minutes 100%. Memory footprint rather low.
(24 Minutes for a Sheet with around 31000 rows) 

I take a look in Source code and figured out, that XSSFCell uses shared Strings Method. 

I debugged a little trough and found that the generation of the HashKey for the shared Strings seems to be very expensive. (SharedStringsTable.addEntry)

For me, I solved the problem by using Inline Strings instead of Shared Strings. It is very much faster. (only some seconds now instead of 24 minutes)

But Api for this is not at top-level. I use this now:

CTCell ctCell = ((XSSFCell)cell).getCTCell();
ctCell.setT(STCellType.INLINE_STR);
XSSFRichTextString xssfRichTextString = new XSSFRichTextString(value);
ctCell.setIs(xssfRichTextString.getCTRst());
Comment 1 Yegor Kozlov 2012-07-04 10:01:10 UTC
Can you post sample code that compares two methods and demonstrates that current implementation is slow? 

SharedStringsTable.getKey() can be slow, but I'd like to see a proof. Every time a string is added to the SST, SharedStringsTable.getKey() computes its hash code which actually results to invoking xmlBean.toString().hashCode() where xmlBean is the bean holding the string data. 

Inlining strings has a side effect: it speeds up building of worksheets, but the resulting file can be much bigger and it will take significantly more time to open it in Excel.

My idea is to tweak SharedStringsTable.getKey() as follows:

    private String getKey(CTRst st) {
        if(st.sizeOfRArray() == 0) {
            // for plain text return the text value
            return st.getT();
        } else {
            //  for rich text return xml representation of the CTRst bean
            return st.xmlText(options); 
        }
    }

if a string has no rich text runs then we return  its text value, it should be much faster that invoking st.xmlText(options). 

It would be great if you apply this fix , build POI and meausure the performace boost. 

Yegor
Comment 2 velten 2012-07-05 11:24:37 UTC
Hello Yegor

Thank you for very fast response!

I must admit, that I tried to create a testcase, but failed. 

But problem on our server persisted.

I recognized that the problem disappeared only with server restart. First I recognized this on our testing server. Then I proofed this a second time at our production server.

I was able to measure the difference: Long running server performance was around 30 times slower than fresh started server. 

After some over night XMLX-Exports on our production server the performance is down again (perhaps 20 times slower). Test server has had no workload and is unchanged fast.


We figured out only with thread dumps, that the SharedStringsTable Hash Key Generation is perhaps the problem, because threads was hanging in this method very often.

Obviously the POI code causes not the 30 times slowdown in performance. Perhaps there is a bad cache implementation or something else in the XML-Beans code?

Currently I have not tested your code snippet. Surely this could help to speed up POI anyway.

I will try to search something further..   


Before Server restart:
2012-07-04 16:44:42,423 ERROR table.Table.serializeXLS() (729) - serializeXLS for 31681 rows with 20 columns for user 'XXX' lasted 473392 with an avg of 14.942457624443673 per row

After Server restart:
2012-07-04 18:56:29,686 ERROR table.Table.serializeXLS() (729) - serializeXLS for 31681 rows with 20 columns for user 'XXX' lasted 14919 with an avg of 0.4709131656197721 per row

After some over night export workload this morning:
2012-07-05 10:52:00,446 ERROR table.Table.serializeXLS() (729) - serializeXLS for 31680 rows with 20 columns for user 'XXX' lasted 281254 with an avg of 8.877967171717172 per row

(Each of this durations was stable reproducible over several test export runs)

Best wishes

Daniel
Comment 3 velten 2013-09-25 13:25:59 UTC
I think we have found the real bug.

With JDK 7 there was implemented a new init method of HashMap and some other Maps.

This new Init Method used the standard Random-Generator to calculate the hashcode.

But this Random-Generator is synchronized. Causing problems in multithreaded environment. There is no performance bug, when there is only one Thread. But in multithreaded environment there can be a huge loss in performance.

This JDK-Bug is fixed with version 1.7.0_40.

After update to this version all seem to be fine now.

Please look at:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8006593

http://stackoverflow.com/questions/14010906/given-that-hashmaps-in-jdk1-6-and-above-cause-problems-with-multi-threading-how

https://blogs.oracle.com/henrik/entry/migrating_from_java_se_6

http://java-performance.info/changes-to-string-java-1-7-0_06/
Comment 4 velten 2013-09-26 16:11:16 UTC
Putting perhaps an hint in documentation would be very good.