Bug 44827

Summary: corrupt xls file after modifying one column
Product: POI Reporter: John Rodriguez <john.rodriguez>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED WORKSFORME    
Severity: major    
Priority: P2    
Version: 3.0-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description John Rodriguez 2008-04-15 13:13:59 UTC
I am using HSSF to read 14000+ line xls file and save it into a copy, with one modification: stripping RTF control chars from one column.  When I tried to open this output xls, I get "Excel found unreadable content in [filename]."  If I allow Excel (v 2003 SP3) to repair the file, it changes my date and currency columns to general cell format.

Code:

Iterator rit = sheet.rowIterator();
rit.next();
for(int i = 1; rit.hasNext(); i++) {
  row = (HSSFRow)rit.next();
  HSSFCell commentsCell = row.getCell(commentsIdx);
  if(commentsCell != null) {
    rtfComments = commentsCell.getRichStringCellValue().getString();

    RTFEditorKit kit = new RTFEditorKit();
    Document doc = kit.createDefaultDocument();
    kit.read(new StringReader(rtfComments), doc, 0);
    txtComments = doc.getText(0, doc.getLength());

    commentsCell.setCellValue(new HSSFRichTextString(txtComments));
  }
}

// write converted workbook to file
FileOutputStream fileOut = new FileOutputStream(outfile);
wb.write(fileOut);
fileOut.close();
Comment 1 Josh Micich 2008-04-15 14:57:11 UTC
You'll probably need to upload an example XLS file to help diagnose this bug.  Please cut down the XLS file as much as possible, while still reproducing the error.

BTW does the code you have below still cause a corrupt file if you exit the loop after modifying only one cell?
Comment 2 John Rodriguez 2008-04-16 11:44:56 UTC
I debugged the problem a little further.  When I run the program and break the loop after a given number of iterations, I noticed that the output file was valid for all rows until the row in which txtComments is "", i.e.,  

commentsCell.setCellValue(new HSSFRichTextString(""));

I tried to trim the file to a few thousand rows where this case occurs, but I couldn't duplicate the problem.  Perhaps the large file size contributes to the problem.

Actually, I tried another solution.  I saved the original file under a different name in Excel, same content, and the file size decreased significantly.  I think the export tool we use bloats the Excel file and HSSF has trouble resaving it?

The original excel file is here: http://www.columbia.edu/~jr534/dbo_contractproducts.original.zip
Comment 3 Josh Micich 2008-04-18 10:39:06 UTC
I took a look at the example.  Same behaviour as you pointed out.  When excel saves the original file it goes from 24M down to 12M.  POI can manipulate the second file OK.

So there is still the problem with the original file.  POI is taking a file that Excel can read, and transforming it into one that excel can't read.  In the example code, if you stop iteration at row 611 everything is OK. Row 612 is the first row where the cell F text does not begin with "{\rtf". The code you provided probably has a small bug because it transforms this non-rtf string into empty string.

So, an easy fix to your specific problem might be to handle that situation properly.

I re-wrote the test code to just replace one cell (F3) text with empty string.  The output file has the same problem.  This is the simplest way I have found to reproduce the bug.  If I replace the cell text with "a", everything works fine.

I'm not sure what's so special about empty string.  I took a look at the various files (24M before + after, 12M before + after) using BiffView, and could not see anything weird that POI is doing.  My suspicion is that there is something wrong with the rest of the file (that POI doesn't touch) that only causes Excel to fail when the SST record gets an extra entry of empty string.  Perhaps it has something to do with (not) purging unused values from the shared string table.


For the record, which utility are you using to create this original Excel file?

Please post back if you still need resolution to this bug.  If you are OK with not replacing the cell text with empty string, this bug may be of much lower priority.
Comment 4 David Fisher 2008-12-29 16:51:24 UTC
Hard to know how important this is until the OP replies to Josh.
Comment 5 Dominik Stadler 2013-08-05 09:08:21 UTC
no response in a long time => resolving for now, please reopen with more information ifnthis is still an issue