Bug 47543

Summary: Using many HSSFRichTextString cells causes Excel 2003 and lower to crash
Product: POI Reporter: Curtis Browning <curtis>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: critical CC: curtis
Priority: P1    
Version: 3.5-dev   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: Demonstrates the problem
Generated by the attached test program using POI
Generated manually in Excel - same content as test program
driver.vbs - a VBA script for testing
vba-100.xls - a 10x10 grid produced by driver.vbs
Driver2.java - a version of Driver.java that produces output modifiable by Excel 2003
driver2-10000.xls - 100x100 grid produced by Driver2.java

Description Curtis Browning 2009-07-16 09:44:06 UTC
Creating too many HSSFRichTextString cells causes Excel 2003 and lower to crash when changing the color of the cells and then saving the Excel file.

Run the attached Driver program, which creates a file that demonstrates the Excel 2003 errors.
 
 1. Run the program, then open the resulting file in Excel 2003 or lower.
 2. Select all cells by clicking on the corner cell
 3. Press the color toolbar button to change the color of all cells
 4. Press the save button in Excel and it will crash.

Note: The row and column count is only 10 in the attached example, but if you increse the ROWS and COLS constants to 100 in the makeData()method below, then Excel 2003 will crash after step 3 above.

This is a critical bug for our product, with very large customers complaining vociferously. We are running POI 3.1 now but I tested this with 3.5 beta and get the same results.

We often generate large workbooks that contain tables with many footnotes, hence the need for HSSFRichTextString. To work around the HSSFRichTextString problem for other cells, we have been forced to fallback to using the deprecated version of HSSFCell.setCellValue() that takes a plain String instead of the recommended HSSFRichTextString.
Comment 1 Curtis Browning 2009-07-16 09:47:31 UTC
Created attachment 23995 [details]
Demonstrates the problem

I thought this was attached with the first submittal, but it didn't seem to take it.
Comment 2 Yegor Kozlov 2009-07-19 23:40:27 UTC
The problem confirmed in current trunk. 

Changing font color for a 10x10 grid works fine but crashes Excel for a 100x100 grid. 

Excel 2007 has no problems with such 100x100 files, the problem is observed in Excel 2003 and lower. 

Regards,
Yegor
Comment 3 Curtis Browning 2009-07-20 07:32:13 UTC
I would like to note also that in the 10x10 example, Excel 2003 will also crash if one attempts to save the Excel file immediately after changing the cell colors.
Comment 4 Nick Burch 2009-07-27 08:04:06 UTC
Just to check - if you create a file like the one that crashes Excel 2003 using excel itself, does that open fine in excel 2003? (i.e. is it a problem with excel and lots of rich text formatting, or just in how poi writes out the rich text formatting?)

If it is a poi issue, any chance you could create two files:
* the smallest file produced by poi with rich text that crashes excel 2003
* a file with the same contents produced by excel

(If you have time, it'd be great if you could use BiffViewer to try to spot what excel and poi have done differently between these two files)
Comment 5 Curtis Browning 2009-07-27 09:03:22 UTC
Created attachment 24042 [details]
Generated by the attached test program using POI 

This file was generated by the test program attached to this report. If one follows the steps in the report, Excel will crash when the save button is pressed.
Comment 6 Curtis Browning 2009-07-27 09:05:19 UTC
Created attachment 24043 [details]
Generated manually in Excel - same content as test program

This file was (painstakingly) created in Excel 2003 by manually editing each cell. It has the same visual content as the file generated by the POI test program, but it doesn't crash Excel 2003 when the cell colors are changed and the file is saved.
Comment 7 Curtis Browning 2009-07-27 09:13:57 UTC
Hello, 

In response to your first question:
No, this is not a problem with many rich strings in Excel 2003. 

See the two attached Excel documents which contain the same visual information (a 10x10 cell matrix with superscripts on each cell value). The one named test_poi_export.xls was generated using the attached test program. When one highlights all the cells, changes their text color, then presses Save in Excel 2003, Excel 2003 will crash. When one follows the same steps with the second attached Excel workbook (from_excel.xls), the text color changes correctly and pressing the Save button in Excel 2003 does not crash the application.

I attempted to open these files in BiffViewer, but whereas the one created by POI opens fine, the second file ("from_excel.xls", created manually in Excel 2003), crashes the BiffViewer application with a general protection fault. I don't care that the one created in Excel 2003 crashes BiffViewer, what we need is for POI to generate a file that does not crash Excel 2003.

Regards,
Curtis
Comment 8 Nick Burch 2009-07-28 06:32:03 UTC
Thanks for the info and files Curtis

Step one will be to get BiffViewer to be able to open the excel file without crashing. 

Once we've sorted that, we can then compare the two files at the record level, and see what Excel has done differently about encoding the rich text. 

Finally, when we know that, we can make a stab at changing how poi saves large amounts to rich text to closer match excel
Comment 9 Curtis Browning 2009-07-28 07:53:30 UTC
Thanks Nick,

Are you asking me to track down the reason why the file created in Excel 2003 does not open in BiffViewer, or is that something that you will pursue?

My guess is that it has something to do with the CONTINUE record in the SST. When I was writing Excel export code in C++ many years ago, I found that there are documentation problems with the way that CONTINUE records are handled by Excel. It took a long time and much effort to figure out what Excel really does when rich text strings are continued in the SST, but we finally got it working correctly.

Regards,
Curtis
Comment 10 Yegor Kozlov 2009-07-28 22:45:53 UTC
Guys,

Creating a text .xls file manually is  not a good idea - Excel creates a lot of garbage in this case. I created a simple VBA script that does a job similar to the attached Driver.java. The output is always readable by BiffViewer AND tolerant to changing font colors, no matter how large the grid is - 10x10, 100x100, etc. For all tests produced by the VBA script Excel allows changing the font color. 

I compared POI- and VBA- versions and didn't see anything suspicious about CONTINUE records - all looks sane.  

I won't be able to look into it more earlier than this weekend. If anyone wants to track this bug, I suggest the following strategy:

 - modify Driver.java and driver.vbs to produce equivalent output
 - for both POI- and VBA- code use a blank template XLS to minimize the differences. 
 - start with a small grid, say 5x5. 
 - start tweaking POI to produce output as much close to VBA as possible. For testing purposes you may need to temporary modify record classes.
 
I won't be surprised if the culprit is a record absolutely irrelevant to SST - you never know it with Excel :)
 
Regards,
Yegor
Comment 11 Yegor Kozlov 2009-07-28 22:46:52 UTC
Created attachment 24056 [details]
driver.vbs - a VBA script for testing
Comment 12 Yegor Kozlov 2009-07-28 22:48:05 UTC
Created attachment 24057 [details]
vba-100.xls - a 10x10 grid produced by driver.vbs
Comment 13 Yegor Kozlov 2009-07-31 12:11:06 UTC
Quite an interesting bug.

Here is the problem in a nutshell:

For reach text containing N runs Excel saves font the first run in the cell style and subsequent N-1 runs override the font in the cell style.

Compare two snippets that produce equivalent output:

1. POI approach: 

 HSSFCell hssfCell = row.createCell(idx);
 //rich text consists of two runs
 HSSFRichTextString richString = new HSSFRichTextString( "Bug 47543" );
 richString.applyFont( 0, 4, font1 );
 richString.applyFont( 4, 8, font2 );
 hssfCell.setCellValue( richString );

2. Excel approach

 //create a cell style and assign the first font to it
 HSSFCellStyle style = workbook.createCellStyle();
 style.setFont(font1);

 HSSFCell hssfCell = row.createCell(idx);
 hssfCell.setCellStyle(style);

 //rich text consists of one run overriding the cell style
 HSSFRichTextString richString = new HSSFRichTextString( cellValue );
 richString.applyFont( 4, 8, font2 );
 hssfCell.setCellValue( richString );

It turns out that the way you create HSSFRichTextString cells is important, (1) results in Excel crash while (2) always seems to work. 

See Driver2 - a version of the attached Driver program that produces valid output readable and modifiable by Excel 2003. 

driver2-10000.xls is a sample output of 100x100 grid.

Regards,
Yegor
Comment 14 Yegor Kozlov 2009-07-31 12:13:03 UTC
Created attachment 24075 [details]
Driver2.java - a version of Driver.java that produces output modifiable by Excel 2003
Comment 15 Yegor Kozlov 2009-07-31 12:13:50 UTC
Created attachment 24076 [details]
driver2-10000.xls - 100x100 grid produced by Driver2.java
Comment 16 Yegor Kozlov 2009-07-31 12:16:34 UTC
Curtis,

Can you try the suggested workaround and let me know whether it works for you or not. 

Yegor
Comment 17 Curtis Browning 2009-08-03 08:59:17 UTC
Thanks very much Yegor, the suggested change did indeed address the problem. It makes sense also given the way that Excel stores font information for rich strings in the SST. 

I would suggest also that the POI documentation on using rich text strings be updated to reflect your findings regarding the correct usage. When I looked back at some of the internet questions and postings related to problems using POI with large numbers of HSSFRichTextString objects, I now suspect that the same issue may be involved.

Thanks again for your help, 
Curtis
Comment 18 Yegor Kozlov 2009-09-13 07:56:46 UTC
Finally, I updated the javadoc for HSSFRichTextString  to reflect my research on this issue.

Yegor