Bug 54593 - Poor performance when adding borders
Summary: Poor performance when adding borders
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on: 58787
Blocks:
  Show dependency tree
 
Reported: 2013-02-21 10:18 UTC by Nikola Štohanzl
Modified: 2016-04-05 17:10 UTC (History)
1 user (show)



Attachments
generated XSSF workbooks with and witout bordering (9.95 KB, application/zip)
2013-02-21 10:18 UTC, Nikola Štohanzl
Details
test case (4.19 KB, text/plain)
2013-05-30 16:59 UTC, Nikola Štohanzl
Details
Screnshot showing where most of the time is spent (167.62 KB, image/png)
2013-05-30 21:09 UTC, Dominik Stadler
Details
Initial try of caching the toString() result (2.59 KB, patch)
2013-08-12 21:17 UTC, Dominik Stadler
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nikola Štohanzl 2013-02-21 10:18:52 UTC
Created attachment 29974 [details]
generated XSSF workbooks with and witout bordering

Hi, 

i just started with Apache POI and created a simple test XSSF workbook (see attached files, sheet is locked, password is "kuku"). I noticed a big gap in elapsed times after adding borders. Small benchmark (average of 10 consecutive runs) on my i5@2,5GHz notebook (Oracle JDK 1.7, JVM setting -Xmx256M) shows

XSSF no borders: 700ms
XSSF borders: 2500ms
HSSF borders: 290ms

I use RegionUtil for bordering (as is suggested in the quick guide) and my bordering helper method looks like

private static void thinBorder(Workbook wb, Sheet sh, CellRangeAddress rng) {
	RegionUtil.setBorderBottom(CellStyle.BORDER_THIN, rng, sh, wb);
	RegionUtil.setBorderLeft(CellStyle.BORDER_THIN, rng, sh, wb);
	RegionUtil.setBorderRight(CellStyle.BORDER_THIN, rng, sh, wb);
	RegionUtil.setBorderTop(CellStyle.BORDER_THIN, rng, sh, wb);		
} 

and i use that approach for bordering single cells as well. After a small debugging, i found that underlaying calls to CellUtil method

private static void setFormatProperties(CellStyle style, Workbook workbook, Map<String, Object> properties)

take 30-70ms each. Combined with a creation of different styles during the bordering, that might be the cause of my problem. Without borders, there are 9 different styles (based on Workbook.getNumCellStyles()), with borders, there are 66 styles.

Perhaps my bordering strategy can be optimised, but creation of approx. 25 styles per second (if i understand the code logic well) seems like poor performance and could be a stopper for big workbooks with formatting. I understand that the HSSF implementation migh be more effective avoiding XML, but the difference simply seems too big.
Comment 1 Dominik Stadler 2013-05-30 15:42:31 UTC
Do you have sample code that can be used as test-case which reproduces the problem?
Comment 2 Nikola Štohanzl 2013-05-30 16:59:57 UTC
Created attachment 30345 [details]
test case
Comment 3 Dominik Stadler 2013-05-30 21:09:16 UTC
Created attachment 30346 [details]
Screnshot showing where most of the time is spent

I did a quick performance analysis using dynaTrace, it seems really the underlying XML handling is causing this behavior. It seems XSSFCellBorder calls toString() on the CTBorder element both in hashCode() and equals(). The call is done a large number of times and every time causes XML Text to be generated for the border-element, so almost all the exeuction time of the test case is spent somewhere in there.

So yes, using HSSF instead of XSSF would probably speed it up considerably. Otherwise some caching might help, but for this I still lack details of POI to see if it would be possible. I assume the real issue is in the openformats-framework that is used by POI which seems to have an expensive toString() implementation.
Comment 4 Nick Burch 2013-05-30 22:22:56 UTC
Suggestions gratefully received on a smarter way to do the hashCode and equals methods for XSSFCellBorder!
Comment 5 Dominik Stadler 2013-08-12 21:17:09 UTC
Created attachment 30723 [details]
Initial try of caching the toString() result

I gave it a try and implemented caching of the toString() value in XSSFCellBorder and compared elapsed times, below are the raw results, patch is attached. It improves a lot, but there is still a chance that the changes break stuff as there is a getBorder() method which is used elsewhere and could modify the contents of the border whereas the borderStr might be kept, so this mostly shows the potential of doing toString() differently here, however I still think the base implementation is not optimal by causing a lot of stuff in toString() including synchronization, the same might be the case in many other types that derive from XmlObjectBase as well...

Before:

fileName = /tmp/01.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 531
fileName = /tmp/02.xlsx, doBorders = true
  elapsed bordering: 3424
  styles count: 68
  elapsed total: 3699
fileName = /tmp/01.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 73
fileName = /tmp/02.xlsx, doBorders = true
  elapsed bordering: 2181
  styles count: 68
  elapsed total: 2300
fileName = /tmp/03.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 73
fileName = /tmp/04.xlsx, doBorders = true
  elapsed bordering: 1819
  styles count: 68
  elapsed total: 1902

After:

fileName = /tmp/01.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 269
fileName = /tmp/02.xlsx, doBorders = true
  elapsed bordering: 1580
  styles count: 68
  elapsed total: 1856
fileName = /tmp/01.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 193
fileName = /tmp/02.xlsx, doBorders = true
  elapsed bordering: 643
  styles count: 68
  elapsed total: 770
fileName = /tmp/03.xlsx, doBorders = false
  elapsed bordering: 0
  styles count: 4
  elapsed total: 55
fileName = /tmp/04.xlsx, doBorders = true
  elapsed bordering: 424
  styles count: 68
  elapsed total: 532
Comment 6 Javen O'Neal 2016-04-05 17:10:42 UTC
Bug 58787 might solve some of these efficiency problems, since styles are built up in a standalone data structure then applied in one go, creating fewer intermediate styles and searching through the styles table for matching styles fewer times (setCellStyleProperties).