Bug 57593

Summary: Unable to read an Excel file. After editing nothing and resaving in the Office - no problem, the file is readable by POI
Product: POI Reporter: Vasili <gavr145>
Component: HSSFAssignee: POI Developers List <dev>
Status: CLOSED FIXED    
Severity: critical CC: gavr145
Priority: P5    
Version: 3.11-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description Vasili 2015-02-18 15:31:31 UTC
java.lang.Class - java.lang.IllegalArgumentException: The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. Is it really an excel file?
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:223)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:245)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:188)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:305)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:286)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:60)


I can't send the problematic file right now because it containts private information and has a lot of stuff. After removal of a single byte (or not editing at all) and saving - the problem is gone. I can speculate that the file had encryption or disabled cells which were removed. So, something was not cleared completely. After re-saving - the office fixes something automatically, so the problem is automatically gone. But we need this to be fixed since users may submit ANY file state and from the user's point of view the file is completely normal and no way to distinguish the wrong file which causes the crash from the good files.
Comment 1 Nick Burch 2015-02-18 15:34:09 UTC
What is the type of the file? If you grab the Apache Tika 1.7 CLI and run it with --detect on the file, what does that report the type as being?
Comment 2 Vasili 2015-02-18 16:07:55 UTC
From our perspective, this xlsm is normal, like thousands of other files. 
What I noticed, MS Office shows yellow "Security Warning Some active content has been disabled. Click for more details." Pressing that button shows nothing, and the message is gone. After any re-saving in the Office (even without pressing on the warning) the problem is also gone.



Here is what Apache Tika shows:

>java -jar tika-app-1.7.jar --detect 096113YE06.xlsm
application/x-tika-ooxml-protected
Comment 3 Nick Burch 2015-02-18 16:13:40 UTC
As Tika has shown, the file in question is an encrypted OOXML, though possibly with the default password. As such, HSSF can't read it

See http://poi.apache.org/encryption.html#XML-based+formats+-+Decryption for details of how to decrypt it to pass to XSSF

(If you use a newer version of POI, you'll get a more helpful exception, but the underlying issue is that you're calling the wrong bit of POI)
Comment 4 Vasili 2015-02-18 16:29:04 UTC
OK, thanks for your fast help! 
I'll try to deal with that

Regards
Vasili
Comment 5 Vasili 2015-02-20 18:39:21 UTC
Look, how it behaves...

>java -jar tika-app-1.7.jar --detect 096113YE06.xlsm
application/x-tika-ooxml-protected

but after I open and immediately close the file in Excel (neither changing anything nor changing the file type), it becomes normal (from the point of view of tika and POI)

>java -jar tika-app-1.7.jar --detect 096113YE06_just_resaved.xlsm
application/vnd.ms-excel.sheet.macroenabled.12


After the re-saving HSSF api works as for thousands of other files derived from the same template. So, users of the template, during filling it, probably encrypted it (as I guess) and then decrypted. And the file remained in some volatile temp state which was automatically fixed by the re-saving.

I'm wondering whether you can detect the file correctly from HSSF by other means (not raising the Exception, I guess BIFF8 may exist even in the 1st case - if it's not the Office which adds it). Is it possible?

Thanks
Comment 6 Vasili 2015-02-20 19:21:11 UTC
Seems there is a bug here (unfortunately, due to another problem, mentioned in another ticket - I can't try to fix it until we fix the build):
we fall through, and if we had EncryptedPackage - we wouldn't have fallen through the 2nd block. I guess, under the 1st catch we should return something, may be a default.


HSSFWorkbook.java, lines 237+, poi-3.11, we use
-----------------------------------------------------
    public static String getWorkbookDirEntryName(DirectoryNode directory) {

        for (int i = 0; i < WORKBOOK_DIR_ENTRY_NAMES.length; i++) {
            String wbName = WORKBOOK_DIR_ENTRY_NAMES[i];
            try {
                directory.getEntry(wbName);
                return wbName;
            } catch (FileNotFoundException e) {
                                       
            }
        }
        
        // check for an encrypted .xlsx file - they get OLE2 wrapped
        try {
        	directory.getEntry("EncryptedPackage");
        	throw new EncryptedDocumentException("The supplied spreadsheet seems to be an Encrypted .xlsx file. " +
        			"It must be decrypted before use by XSSF, it cannot be used by HSSF");
        } catch (FileNotFoundException e) {
            // fall through
        }

        // check for previous version of file format
        try {
            directory.getEntry("Book");
            throw new OldExcelFormatException("The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. "
                    + "POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)");
        } catch (FileNotFoundException e) {
            // fall through
        }

        throw new IllegalArgumentException("The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. "
            + "Is it really an excel file?");
    }
------------------------------------------------------------
Comment 7 Nick Burch 2015-02-21 12:00:44 UTC
(In reply to Vasili from comment #5)
> Look, how it behaves...
> 
> >java -jar tika-app-1.7.jar --detect 096113YE06.xlsm
> application/x-tika-ooxml-protected
> 
> I'm wondering whether you can detect the file correctly from HSSF by other
> means (not raising the Exception, I guess BIFF8 may exist even in the 1st
> case - if it's not the Office which adds it). Is it possible?

I'd suggest you try opening it with XSSFWorkbook. If it's really XSSF, it'll open. If it's encrypted, you'll get an EncryptedDocumentException and you know to try decrytping it. If it's really a .xls file, you'll get (IIRC) an InvalidFormatException to tell you to retry with HSSF

Otherwise, you should be fine using WorkbookFactory.create and catching EncryptedDocumentException. (Use a nightly / 3.12 beta 1 when out though, some improvements have happened)
Comment 8 Vasili 2015-02-23 14:28:06 UTC
Hi Nick,

It seems that trying with different API should be hidden by facade in the framework, because applications have no idea - what type of the file arrives.

If it is the only way - I could try and if it succeeds I will have to do the same wrapper, hiding the abovementioned decisions, but it would be more correct if poi were doing that.

Right now we use the most universal way, i.e.

WorkbookFactory.create(fis);

And a lot of business code (including 3rd party, which we do not control) is already using this approach, so it will be hard to change...

I will return to this later, after the tests.

Thanks
Comment 9 Vasili 2015-02-23 17:16:09 UTC
As you suggested, and to let you know, I tried to open the file with XSSFWorkbook first. It failed since it is not zip, it is BIFF8.
Even if I fall through into HSSF, I'm still getting the same exception. 
So, no sense to use XSSFWorkbook for that, and I use WorkbookFactory as before.

The last does not throws EncryptedDocumentException throwing the above-mentioned exception instead.
I believe, it is rather a bug.

I just tried nighly builds from Jenkins integration server (having downloaded poi-3.12-beta2 tarballs) and got the same pointing to those binaries (unfortunately, I can't debug and step into the code right now due to the build problem 57612):

------8<------
java.lang.IllegalArgumentException: The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. Is it really an excel file?
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:223)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:245)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:188)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:305)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:286)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:60)
------8<------
Comment 10 Vasili 2015-02-23 21:35:33 UTC
Just a correction here (due to resolving of a CLASSPATH issue on my side, which was taking poi-3.7-20101029.jar)

So, now the file in question is throwing the EncryptedDocumentException, as you had expected. 

org.apache.poi.EncryptedDocumentException: The supplied spreadsheet seems to be an Encrypted .xlsx file. It must be decrypted before use by XSSF, it cannot be used by HSSF
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:252)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:308)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:288)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:223)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:381)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:363)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:84)



However the file seems to be not encrypted, although detected by all the tools as encrypted

$ file 096113YE06.xlsm
096113YE06.xlsm: Composite Document File V2 Document, Encrypted

while after opening and immediate closing without any other internal modifications (and no passwords known) it is determined as:

$ file 096113YE06_just_resaved.xlsm
096113YE06_just_resaved.xlsm: Microsoft Excel 2007+

Any suggestions? (We can't know the password because there is no password there)

Thanks and sorry for the confusion with the old version.
Comment 11 Nick Burch 2015-02-24 09:33:29 UTC
Can you see if the file can be decrypted by POI using the default Excel password? (VelvetSweatshop). My guess is that the file has been "protected" with that default password, which is why you can read it in Excel without being prompted for a password, but why it is stored in the encrypted form

On a related note, I wonder if we should have overloaded / altenate WorkbookFactory methods which also take a password? Your code could then be something like:

try {
   wb = WorkbookFactory.create(file);
} catch (EncryptedDocumentException e) {
   String password = promptUserForPassword(file);
   wb = WorkbookFactory.create(file, password);
}
Comment 12 Vasili 2015-02-24 17:22:01 UTC
I've tried Biff8EncryptionKey.setCurrentUserPassword("VelvetSweatshop");
before opening through WorkbookFactory.create(fis);

and the result was the same: 

org.apache.poi.EncryptedDocumentException: The supplied spreadsheet seems to be an Encrypted .xlsx file. It must be decrypted before use by XSSF, it cannot be used by HSSF
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:252)

I agree with you that the file might had been protected, then unprotected, or may have some particular protected parts, nobody knows, may be only MS guys/developers. I assume, if it can be opened in MS Excel (without any indication about the encryption), it is a valid BIFF8 file.
Excel is doing something automatically (fixes?) after any saving:

$ file 096113YE06.xlsm
096113YE06.xlsm: Composite Document File V2 Document, Encrypted

$ file 096113YE06_just_resaved.xlsm
096113YE06_just_resaved.xlsm: Microsoft Excel 2007+

What exact usage/API are you suggesting for HSSF? It is for sure not xml file, it is a binary file.
Comment 13 Nick Burch 2015-02-24 17:23:57 UTC
For an encrypted .xlsx file, you need to follow the approach detailed here - http://poi.apache.org/encryption.html#XML-based+formats+-+Decryption - against an open NPOIFSFileSystem. Do not call any HSSF code, that won't work!
Comment 14 Vasili 2015-02-24 17:29:51 UTC
But that is xml decryption, isn't it?
The encryption, you are suggesting, is encryption over xml files, and so is the API.
That file seems to be binary, BIFF8, so any xml (XSSF) code will fail.
If it is binary, it must be opened with HSSF only. Although, I understand, that we might have caught some undocumented MS binary format feature here (?)

Am I wrong here?
Comment 15 Nick Burch 2015-02-24 17:32:39 UTC
As the docs state, and as I've said several times, you need to follow those directions to decrypt the encrypted .xlsx, then pass the decrypted input stream to xssf to load. Please follow the docs given!
Comment 16 Vasili 2015-02-24 17:54:31 UTC
ok, finally, I succeeded, using the following (xml) API and using the password you provided: 


		POIFSFileSystem fs = new POIFSFileSystem(fis);
	        EncryptionInfo info = new EncryptionInfo(fs);
	        Decryptor d = Decryptor.getInstance(info);
	        
	        System.out.println(d.verifyPassword("VelvetSweatshop")); 
	        
	        InputStream decryptedDataStream = d.getDataStream(fs);
	        
	        OPCPackage pkg = OPCPackage.open(decryptedDataStream);
	        XSSFWorkbook workbook = new XSSFWorkbook(pkg);

so this can be closed now. (so lot of confusion with different APIs!)

Thanks for your support.
Comment 17 Nick Burch 2015-02-25 05:40:14 UTC
If you'd care to suggest the method signature of a WorkbookFactory method which took a password, along with the behaviours for:
 * Call with a null password
 * Called with a password and a non-encrypted file
Then we can look at wrapping up the decryption code in there to make life easy for others!

(I know how to write the code, I'm just not sure on what would make most sense for a user in answer to those two questions...)
Comment 18 Vasili 2015-02-25 14:38:41 UTC
I think, the most sense for us will be maximum backward compatibility, i.e. the old factory methods should remain in-place working as before with 
additional functionality for the default password (the issue described) in the catch block of EncryptedDocumentException .
Exactly what your have proposed in your overloaded implementation. WorkbookFactory.create(fis, password) might be a separate method.
I'm working on something else right now and will move to the wrapper for this issue later, and I might comment with my wrapper 
implementation later - if you will not implement it earlier. Also, I haven't looked deeply into the differences between HSSF and XSSF, so may be mistaken in something. 
I guess, the input stream might be reopened in the catch block while the first stream should be closed (depending on the buffer, it's resetting and how the first call uses the stream).
So, the usage which gives the least number of headaches (at least for us) is something like the following (not tested completely!), backward compatible:


Workbook wb;
try{
   wb = WorkbookFactory.create(fis);
}catch (EncryptedDocumentException e) {
   POIFSFileSystem fs = new POIFSFileSystem(fis);
   EncryptionInfo info = new EncryptionInfo(fs);
   Decryptor d = Decryptor.getInstance(info);
   if(!d.verifyPassword("VelvetSweatshop"))
	throw new Exception("Please use WorkbookFactory.create(file, password) method and supply password"); 
   InputStream decryptedDataStream = d.getDataStream(fs);
   
   OPCPackage pkg = OPCPackage.open(decryptedDataStream);
   wb = new XSSFWorkbook(pkg);
}

An additional method is to be implemented: org.apache.poi.ss.usermodel.WorkbookFactory.create(InputStream is, String password) - to supply additional password
(it will not break the old code since it is additional functionality that had not been assumed before - to supply the password other from the default)

This is my current understanding (before looking into this more deeply)

Thanks!
Comment 19 Nick Burch 2015-04-29 20:36:50 UTC
As of r1676853, overloaded WorkbookFactory.create methods which take a password have been added. Thanks for the idea!