Bug 58260

Summary: Formatting Fails after 32768 Rows. Works again after 65538
Product: POI Reporter: Craig <cgleckman>
Component: SXSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: awm, mshlayen
Priority: P2    
Version: 3.12-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Bug Depends on: 58005    
Bug Blocks:    
Attachments: stand alone program that creates an excel document with 70K rows
Input file for sample.java

Description Craig 2015-08-18 17:18:27 UTC
When creating large excel spreadsheets with SXSSF, it seems that cell formatting stops being applied at row 32769 and works again on row 65538. The formatting I am trying to apply is right-align using setAlignment. Additionally, if I attempt to add borders, foreground color, bold, or any other sort of style, the same thing occurs.
Comment 1 Nick Burch 2015-08-18 17:36:42 UTC
Any chance you could create a small junit unit test that shows the problem?
Comment 2 Craig 2015-08-18 18:17:12 UTC
(In reply to Nick Burch from comment #1)
> Any chance you could create a small junit unit test that shows the problem?

Unfortunately this is a work program and cannot use jUnit (do not have the jar files or a way to add them to the system due to security protocols). I can make a sample java program that generates a 70k row excel document based off of a notepad txt document. Would that suffice?
Comment 3 Nick Burch 2015-08-18 18:27:32 UTC
Ideally, we'd want a small program that used SXSSF to write the large number of rows, then read it back with XSSF to show that the formattings are missing on that row range. Doesn't have to be a junit test, if you can do a small standalone program to do that we can convert it. We do want it to check programatically though, rather than by eye, so we can use it to ensure the problem gets fixed + stays fixed!
Comment 4 Craig 2015-08-18 18:46:39 UTC
(In reply to Nick Burch from comment #3)
> Ideally, we'd want a small program that used SXSSF to write the large number
> of rows, then read it back with XSSF to show that the formattings are
> missing on that row range. Doesn't have to be a junit test, if you can do a
> small standalone program to do that we can convert it. We do want it to
> check programatically though, rather than by eye, so we can use it to ensure
> the problem gets fixed + stays fixed!

I will go ahead and create a program to do this . I have never tried reading it back in to check the formatting since our heap size is restricted to 512 and when we tried to open the excel before (for a different reason) it blew the heap. Please give me a few hours and I will have it for you.
Comment 5 Craig 2015-08-18 20:26:34 UTC
(In reply to Craig from comment #4)
> (In reply to Nick Burch from comment #3)
> > Ideally, we'd want a small program that used SXSSF to write the large number
> > of rows, then read it back with XSSF to show that the formattings are
> > missing on that row range. Doesn't have to be a junit test, if you can do a
> > small standalone program to do that we can convert it. We do want it to
> > check programatically though, rather than by eye, so we can use it to ensure
> > the problem gets fixed + stays fixed!
> 
> I will go ahead and create a program to do this . I have never tried reading
> it back in to check the formatting since our heap size is restricted to 512
> and when we tried to open the excel before (for a different reason) it blew
> the heap. Please give me a few hours and I will have it for you.

Nick,

I apologize but I am unable to figure out how to check if a cell, read from an existing excel document, has a format applied to it. Could you please provide guidance? If not I can submit the program I have now (which writes the excel from an input file) and you can add the code to check the formatting.
Comment 6 Dominik Stadler 2015-08-18 20:29:09 UTC
FYI, Bug 58005 talks about a similar problem with XSSF as well, maybe it is not only happening for SXSSF.
Comment 7 Craig 2015-08-18 20:38:14 UTC
(In reply to Dominik Stadler from comment #6)
> FYI, Bug 58005 talks about a similar problem with XSSF as well, maybe it is
> not only happening for SXSSF.

Thank you Dominik. I did not see this before. This does seem like a similar issue. The only difference is their formatting fails much earlier then mine. I am not sure why. Additionally, my formatting seems to reapply itself at the 65538 row mark, which is makes this even more odd. My guess is that the issues are linked but I have been unable to resolve the issue myself after almost a week, which is why I made this bug report.
Comment 8 Craig 2015-08-18 20:49:54 UTC
Created attachment 33013 [details]
stand alone program that creates an excel document with 70K rows

The code creates a single column with 70k rows. The rows are printed as strings and numbers to mimic the code I am using where I found the issue. If you look at the rows between 32768 and 65538 you will see that the right align formatting does not get applied since the rows printed as strings are aligned to the left. The rows printed as numbers are still aligned right, but that is because of the default formatting used within excel. 

Note: I am using the following jars which will need to be added for the program to run:
poi-3.12-20150511.jar
poi-ooxml-3.12-20150511.jar
poi-ooxml-schemas-3.12-20150511.jar
xmlbeans-2.6.0.jar
Comment 9 Craig 2015-08-18 20:50:53 UTC
Created attachment 33014 [details]
Input file for sample.java

My apologies, I forgot to include the sample txt file that is used for input in sample.java
Comment 10 Craig 2015-08-18 20:51:42 UTC
I am providing the code as it stands since I am going into a meeting and will not be able to do anything regarding this issue for the next few hours.
Comment 11 Nick Burch 2015-08-18 23:36:45 UTC
In your small program, you seem to be creating one cell style per row. Cell Styles are workbook-scoped, so you should create them up-front and re-use them when needed. What happens if you create your right-aligned cell style once, and re-use it on each row you want styled?
Comment 12 Dominik Stadler 2015-08-19 10:28:55 UTC
Yes, there should rarely be a reason to have that many styles if you properly re-use them. Also the resulting file will be much smaller.

I tried this:
* For SXSSF and XSSF with > 65535 the file is reported as "corrupt" by Excel, so it seems 65k is the hard limit here
* For XSSF using up to 65k styles works as expected now
* For SXSSF using up to 32k works, up to 65k does not work
* For HSSF there is a different limit of 4k which is enforced correctly already.

So I see two things that we can look at here:
* For SXSSF fix styles up to 65k
* For XSSF and SXSSF fail with an error if more styles are added
Comment 13 Dominik Stadler 2015-08-19 13:38:54 UTC
We have fixed this in r1696586 as follows:
* The check for the maximum number of styles was not effective for XSSF/SXSSF, this should now work and throw an exception if you try to create more than 64000 styles (you should really re-use them anyway!)
* Fixed writing out an SXSSFWorkbook with more than 32767 styles, unfortunately the API uses "short" in some places which needs to be handled with care unless we can adjust the API to "int" here.
Comment 14 Dominik Stadler 2015-08-19 13:42:42 UTC
*** Bug 58005 has been marked as a duplicate of this bug. ***
Comment 15 Craig 2015-08-19 15:11:52 UTC
Hi Nick and Dominik,

Thank you for your responses. I was waiting to reply until I was at work and able to double check my results. I was unaware that you could reuse styles, so that is my mistake. When I stop creating the styles in each loop my problems were resolved. I apologize for creating a bug ticket for something that was clearly not actually a bug. I am glad to see, however, that some adjustments were made for anyone who needs more than 32768 styles. 

Thank you and sorry again.
Comment 16 Dominik Stadler 2015-09-03 19:28:40 UTC
*** Bug 53209 has been marked as a duplicate of this bug. ***
Comment 17 Javen O'Neal 2015-09-05 18:13:46 UTC
Spreadsheets that use naive style creation end up with very large styles tables with duplicate entries. Perhaps the root cause here is we don't have a great way to do either of the following:
1) Make a function to query the styles table if for a given style. The problem here is that you can't create a style to query the styles table without necessarily adding that style to the style table. If I want to start with an existing style, and slightly modify it (say, change the font to bold), I don't have a great way to see if there is another style that is bold in the styles table. Out of laziness for writing my own code that can compare styles, I just create a new style. Do this in a for loop and you've got problems.
2) when saving a workbook or calling a function on-demand, find all duplicate styles in the styles table and deduplicate them, updating all cell references to the deduplicated style. Microsoft Excel seems to deduplucate styles on workbook save.

I can open up a new bug if these fixes would help prevent issues like seen in this bug.
Comment 18 Nick Burch 2015-09-06 22:34:21 UTC
For HSSF we have HSSFOptimiser - https://poi.apache.org/apidocs/org/apache/poi/hssf/usermodel/HSSFOptimiser.html - but there's no SS generic or XSSF equivalent