Bug 30978

Summary: removeSheetAt() method corrupts workbook
Product: POI Reporter: forevernd
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: phamlen
Priority: P3    
Version: 2.5-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: Zip with original workbook - corrupt workbooks - error screenshot
Affected worksheet (almost blank) + BiffViewer output
Exel 2003 get this error while viewing. Used removeSheetAt() method
Zip with test code and Excel files

Description forevernd 2004-09-01 01:00:09 UTC
I have an excel workbook A with 3 worksheets. I need to create 3 additional  
workbooks each contains a worksheet from workbook A. Because currently HSSF 
model doesn't allow copy sheets, I thought I could workaround by openning the 
workbook A, remove 2 worksheets, write it back out as workbook A1, (and keep 
doing the samething until all workbooks have been produced). Here is my code:

public void testRemoveSheets() {
  String savePath = "C:/temp/Doc/";
  String sheetNames[] = {"Sheet1", "Sheet2", "Sheet3"};
  try {
     for (int i = 0; i < sheetNames.length; i++) {
       String sn  = sheetNames[i];
       POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(savePath 
+ "TestRemove2.xls"));
       HSSFWorkbook wb = new HSSFWorkbook(fs);
       FileOutputStream fos = new FileOutputStream(savePath + "TestRemove2-" + 
sn + ".xls");
		    	
       for (int j = 0; j < sheetNames.length; j++) {
         String s = sheetNames[j];
         if (!s.equals(sn)) wb.removeSheetAt(wb.getSheetIndex(s));
       }
				
       wb.write(fos);
       fos.close();
    }
  }
  catch (Exception e) {System.out.println(e.toString());}
}

Everything seamed to be OK because I received no error. However, when I tried 
to open the newly generated workbooks, MS Excel (ver 2002) gave me an error. 
Here is the error log
**************
Microsoft Excel File Repair Log

Errors were detected in file 'C:\temp\Doc\TestRemove2-Sheet1.xls'
The following is a list of repairs:

Removed one or more invalid names.
**************

Data seems to be OK but the error is annoying/scary to users. And the file 
size is always bigger than it would be if generated individually.

This bug is similar to bug # 29619 - Excel file becomes corrupt after sheet 
deletion.
Comment 1 forevernd 2004-09-01 01:07:20 UTC
Created attachment 12586 [details]
Zip with original workbook - corrupt workbooks - error screenshot
Comment 2 Damien Conlon 2005-04-28 15:46:53 UTC
(In reply to comment #0)
...
> Data seems to be OK but the error is annoying/scary to users. And the file 
> size is always bigger than it would be if generated individually.
...

I have just worked around this issue in a project I'm working on. I haven't been
successful in creating a test case, but still thought I'd describe what I know -
hopefully it may help isolate what's causing Excel to display this error.

* We have Excel 2000 and Excel 2003 in the office. The "Removed one or more
invalid names." error only appears in Excel 2003. However, it appears in 2003
even if the input XLS file was created and saved in 2000.

* I deleted all charts and macros from the workbook and re-saved, but the error
still occurs.

* I worked around the error by making sure never to write to any sheets before
removing a sheet that precedes them. So e.g. don't write to Sheet3 and then
delete Sheet2; instead, delete Sheet2 and *then* write to any sheets that come
after it. However, this factor by itself does not cause this error to be
displayed. I'm not sure what else also needs to happen.
Comment 3 Andy Oliver 2005-04-28 16:11:12 UTC
note: if someone attaches an affected workbook, I imagine running BiffViewer
will show that there are still objects referring to the sheet (look for "Sheet1"
or the name of the deleted sheet).  If that doesn't pan out run POIFSViewer and
there must be embedded objects related to the sheet.  If we can figure out what
those typically are, we should be able to fix this.
Comment 4 Damien Conlon 2005-05-04 21:46:27 UTC
Created attachment 14934 [details]
Affected worksheet (almost blank) + BiffViewer output

Zip file contains a workbook with most of its content removed, but which still
gives the "Removed one or more invalid names" error upon opening in Excel 2003.


Also contains BiffViewer dump obtained by calling BiffViewer.run() on the
affected workbook.
Comment 5 Suresh H 2005-12-22 07:51:24 UTC
guys 

please update how to fix this bug
Comment 6 Subhasish 2006-01-12 08:00:05 UTC
Created attachment 17395 [details]
Exel 2003 get this error while viewing. Used  removeSheetAt() method
Comment 7 Matt Hill 2007-05-08 09:47:42 UTC
Any update on how to fix or work around this bug?

If more info is still needed, let me know.  It's very easy to recreate.
Comment 8 David Fisher 2007-05-08 10:25:09 UTC
The developers are concentrating on any issues affecting the 3.0 release.

The release candidate 4 of POI 3.0 is now available at:

http://people.apache.org/~nick/POI-3.0-RC4/

Would you please test if this bug is still an issue.

Comment 9 Matt Hill 2007-05-08 10:58:02 UTC
Thanks for the quick response.

I downloaded release candidate 4 of POI 3.0, and I still get the error.
Comment 10 David Fisher 2007-05-08 11:40:14 UTC
Are you using one of the files attached to this issue, or some other file?

If some other file then you should attach the file.

Also, are you sure of the provenance of the file - do you know where it came
from? What version(s) of Excel, and on operating system(s)

All this will help my developer decide if they can do anything quickly, or if
there is a way to determine what Excel objects are missed after the
removeSheetAt call.

I think that there needs to be an exception thrown if there are any dependencies
in the rest of the Workbook to the Worksheet that is being removed. If POI
throws a meaningful exception then that will allow you to inform your user, etc.

I'm at the project management level. I'll see what Yegor thinks, but I can't say
when we will have more information.

Regards,
Dave
Comment 11 Matt Hill 2007-05-08 14:54:28 UTC
Thanks, Dave.

I attached a zip containing test code and a test Excel file.  I also included
the resulting Excel file with the error.

I created the attached test Excel file using Excel 2002 SP3 on Windows XP.

Let me know what you find.
cheers!
matt
Comment 12 Matt Hill 2007-05-08 14:54:32 UTC
Created attachment 20149 [details]
Zip with test code and Excel files
Comment 13 Robert Baric 2007-10-15 04:56:03 UTC
I can tell that this bug is still present in POI 3.0.1 final with Excel Versions
2003 SP3 (Viewer) and Excel 2007.
regards 
 robert

Comment 14 Robert Baric 2007-11-20 04:57:16 UTC
I wonder why this bug has still "needinfo" status. It has been opened three
years ago and seems to be well documented.
If this issue cannot be reproduced by the developers, please let us know what
information is needed.

Comment 15 Sreedhar Seelam 2008-01-30 03:00:27 UTC
Any update on fix for this bug?

Comment 16 Nick Burch 2008-02-14 07:03:39 UTC
Unfortunately, to fix this bug, we're going to have to find a needle in a
haystack (the record with the reference to the deleted sheet - since poi will
open the spreadsheet with the removed sheet, it isn't a simple corruption issue)

In order to find our needle, we'll initially need three files:
* the original file with all the sheets in it
* the file after removing the sheet in poi
* the file after removing the sheet in excel

Compare the latter two files, see the differences, and in theory you've found
the record(s) we need to update. Alas it isn't quite that simple...

Next, we have some fun with the fact that saving a file in excel and saving it
in poi will give slightly different files. So, we need to open files 1 and 3 in
poi, and save those again without changes. Now, find the difference between 1
and 1-from-poi, and 3 and 3-from-poi. Knowing those, you can see what
differences there are from just the saving.

Finally, find the differences between files 2 and 3, and ignore those
differences that can be tracked back to excel save vs poi save. Hopefully that
will leave you with only one or two differences, which are your needles. Once we
know what those records are, we can work to fix.

Anyone affected by this issue fancy spending an hour or two hunting down the needle?
Comment 17 Peter Hamlen 2008-06-11 17:54:39 UTC
I have some additional information that may help some people with a workaround.

I'm still working on generating a neat and tidy testcase but, in the meantime, here's what I know:

1)  I have a spreadsheet generated with Excel 2003 on a Windows XP machine.  It contains a total of 19 sheets.  

2)  I read in the 19 sheets and try to delete 12 of them.

3)  I experience the "corrupted workbook" error whenever I delete the 9th sheet - REGARDLESS of which sheet it is.  In other words, I can delete any 8 sheets but if I delete the 9th, it breaks.

4)  By adding an additional sheet to my workbook, my code then failed after deleting 10 worksheets.

5)  I was able to delete all 12 of the worksheets I needed by simply adding extra blank worksheets at the end.

Therefore, for people trying to solve this via a workaround, you might try adding additional blank worksheets (I also hide them using EXCEL so they don't display.)
Comment 18 Michael Zalewski 2008-06-13 12:21:15 UTC
I was looking at the BIFF dumps of some of these files that don't open after sheet deletion.

I noticed what may be the problem.

If you have defined a print area when a sheet is printed, Excel remembers the dimensions of the print area in a NAME record. The NAME record has a 0 based index to the sheet where the area is located. Of course, if you delete a sheet, these indexes must be adjusted for any NAME record referring to a sheet after (with a higher index than) the sheet which was deleted. Apparently, POI does not make this adjustment.

So for example, the second attachment to this report shows the BIFF dump for a corrupted file. The file has 5 sheets, so the valid sheet indexes are 0 .. 4

There are 4 NAME records, with sheet indexes 1, 2, 4 and 6. The last one is the culprit (I believe)

In the case of the fourth attachment, the file result.xls shows 2 sheets named 'Page 1' and 'Page 3'. So the valid sheet indexes would be 0 or 1

There is 1 NAME record with a sheet index of 3.
Comment 19 Nick Burch 2008-06-16 06:26:21 UTC
(In reply to comment #18)
> So for example, the second attachment to this report shows the BIFF dump for a
> corrupted file. The file has 5 sheets, so the valid sheet indexes are 0 .. 4

NamedRanges are slightly special though - they reference ExternalSheetIndex not the normal Sheet index/order. These ExternalSheetIndexes are allowed to point to sheets that no longer exist

If you create a named range pointing at a sheet, then delete that sheet, excel is quite happy to keep that named range. It will have "#REF" for the sheet part, but otherwise be all there

Take a look at src/testcases/org/apache/poi/hssf/data/30978-deleted.xls - this has a named range pointing to a deleted sheet, created by excel

So, I think this isn't necesserily the problem
Comment 20 Michael Zalewski 2008-06-16 12:03:08 UTC
(In reply to comment #19)
That's not what I see

The *formula* for the named range might refer to deleted sheets. But the *sheet index* field of the BIFF (a short integer at offset 8-9), I believe, must refer to an existing sheet.

When I look at 30978-deleted.xls, I find three NAME records, for user defined ranges "OnOne", "On2" and "OnThree". Apparently "On2" refers to the deleted Sheet2 somehow. I think the 'formula' part of the NAME record still refers to the deleted sheet. But the Sheet Index field of the record has been adjusted.

All three NAME records in 30978-deleted.xls have a Sheet Index of 0. Since there are two sheets in the workbook, the allowed range for the sheet index value should be 0..1, and these NAME records meet that criterion.

I believe, to fix this bug, the 'sheet index' field of all NAME records must be appropriately adjusted, and NAME records which belong to a sheet which is deleted must also be removed.

Side Note: Even if you do this, deleting a sheet with a macro will still not work, because there is yet another structure that exposes sheet names as property names to the macro object 'ThisWorkbook'. Macros are not involved in this bug, but the presence of macros will cause deleteSheet() to corrupt the workbook.(In reply to comment #19)
Comment 21 Nick Burch 2008-06-18 04:39:56 UTC
Ah, done some more digging and I see what you mean about the sheet index. It can be zero (my test file only had zero), but if it's non-zero then it's a 1-based index to the sheet

I've committed a patch to POI to fix up the 1-based sheet indexes as needed when deleting the sheet. However, excel still gives an error about names when opening

I think the issue is down to us changing the Ptg when we write the name out. There's a disabled test in svn (usermodel/TestBugs.java#BROKENtest30978)

Josh - any ideas why the Ptg is being changed?
Comment 22 Josh Micich 2008-06-20 00:26:40 UTC
Fixed in svn r669809

Ptg class hierarchy needed adjustment for DeletedRef3DPtg and DeletedArea3DPtg.  These are not true sub-classes of Ref3DPtg and Area3DPtg.  Similar to the fix for bug 45091.
Comment 23 Nick Burch 2008-06-20 02:34:08 UTC
To confirm - I have re-run the test from Matt Hill (attachment 20149 [details]) now that Josh's fix is in. Excel 2003 has no issues when opening the output result.xls, so it looks like we've finally got this one fixed :)