Bug 58061 - Switch to NPOIFS causes simple XLS created by POI to be corrupt
Summary: Switch to NPOIFS causes simple XLS created by POI to be corrupt
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.13-dev
Hardware: PC Linux
: P2 regression (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Blocks: 56791
  Show dependency tree
Reported: 2015-06-20 14:12 UTC by Dominik Stadler
Modified: 2015-07-07 07:49 UTC (History)
0 users

results from POIFSDump in two directories (1.76 KB, application/zip)
2015-06-22 11:19 UTC, Dominik Stadler
Log when parsing the corrupted file (400 bytes, application/xml)
2015-06-22 14:44 UTC, Dominik Stadler
Log when parsing the good file (2.55 KB, application/xml)
2015-06-22 14:45 UTC, Dominik Stadler
File created with NPOIFS (4.50 KB, application/octet-stream)
2015-06-22 14:45 UTC, Dominik Stadler
File created with OPOIFS (4.00 KB, application/octet-stream)
2015-06-22 14:46 UTC, Dominik Stadler
libgsf-1 python program to check the NPOIFS and OPOIFS files (790 bytes, text/x-python)
2015-06-23 16:33 UTC, Nick Burch

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Stadler 2015-06-20 14:12:05 UTC
.xls files produced by POI on trunk are not valid right now, it seems a recent change broke HSSF completely.

When running "git bissect", the following is found as culprit:

9d4d015a344dbe95918dcb7b2daa5af2f0677da0 is the first bad commit
commit 9d4d015a344dbe95918dcb7b2daa5af2f0677da0
Author: nick <nick@13f79535-47bb-0310-9956-ffa450edef68>
Date:   Mon May 11 18:04:30 2015 +0000

    #56791 More updates from OPOIFS to NPOIFS
    git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@1678783 13f79535-47bb-0310-9956-ffa450edef68

:040000 040000 277fa7aea6ea9ed9bfaa53cf45c4d390437ac1e3 26deba68dd564b5b790679dab7eeb99abbe550bf M      src
'bisect run' erfolgreich ausgeführt

I used the following command in the "git bissect" to decide if it still works or not, i.e. a simple XLS that is produced by POI will not open in LibreOffice anymore. It's the same for Excel:

java -cp build/classes:build/examples-classes org.apache.poi.hssf.usermodel.examples.CreateCells && \
test -f workbook.xls && \
unoconv -vvv --timeout=10 --doctype=spreadsheet --output=workbook.pdf workbook.xls
Comment 1 Nick Burch 2015-06-21 19:06:35 UTC
The files won't be identical at the byte level, as the code for assigning blocks is different in NPOIFS. Streams should be the same though

It's odd that things are breaking, as I believe people have been using NPOIFS for ages with HSSF without problems.

Does it happen with all files written, or only certain ones? What's the smallest file we can generate with OPOIFS and NPOIFS that shows it? (It's much easier to compare at the byte/raw level something small!)

If you take a problematic file, generate it with OPOIFS and with NPOIFS, then run POIFSDump on the two, are the various streams the same?

Can NPOIFS open and read the problematic file OK? Can OPOIFS read it? Can HSSF read it?
Comment 2 Dominik Stadler 2015-06-22 11:18:30 UTC
The following produces an invalid xls for me both on Linux and Windows:

    public void test58061() throws IOException {
        Workbook wb = new HSSFWorkbook();
        OutputStream out = new FileOutputStream("C:\\temp\\58061.xls");
        try {
        } finally {

I will attach the result of POIFSDump for the a good file (using OPOIFSFileSystem) and a bad one (using NPOIFSFileSystem).
Comment 3 Dominik Stadler 2015-06-22 11:19:00 UTC
Created attachment 32842 [details]
results from POIFSDump in two directories
Comment 4 Dominik Stadler 2015-06-22 11:20:46 UTC
BiffViewer can open both without error and the only difference is in the username that is written to the file. The corrupt one uses the login-name, the other the first- and last-name of the current user.
Comment 5 Nick Burch 2015-06-22 13:00:17 UTC
Any chance you could try running the latest Microsoft Binary File Format Validator on the two files, and see if either gets reported as having problems? If not, we'll have to dive into the hex dumps, which won't be fun or quick...
Comment 6 Dominik Stadler 2015-06-22 14:44:23 UTC
Created attachment 32845 [details]
Log when parsing the corrupted file

bffvalidator c:\temp\58061corrupt.xls
BFFValidator: "c:\temp\58061corrupt.xls" NOT RECOGNIZED (The Microsoft Office Binary File Format Validator encountered an error reading the file you specified, OR The Microsof
t Office Binary File Format Validator supports Word, Excel, and PowerPoint binary file formats only. The file you specified is an unsupported file type.) at 06/22/15 16:42:14
Log at: c:\temp\58061corrupt.xls.bffvalidator.06-22-15_16-42-14.xml
Comment 7 Dominik Stadler 2015-06-22 14:45:02 UTC
Created attachment 32846 [details]
Log when parsing the good file

bffvalidator c:\temp\58061good.xls
BFFValidator: "c:\temp\58061good.xls" FAILED at 06/22/15 16:42:09
Log at: c:\temp\58061good.xls.bffvalidator.06-22-15_16-42-09.xml
See: http://msdn.microsoft.com/en-us/library/A6FFF2B4-470A-463D-A6E9-9DAD9676CD44 for more information
Comment 8 Dominik Stadler 2015-06-22 14:45:53 UTC
Created attachment 32847 [details]
File created with NPOIFS
Comment 9 Dominik Stadler 2015-06-22 14:46:10 UTC
Created attachment 32848 [details]
File created with OPOIFS
Comment 10 Nick Burch 2015-06-23 16:33:23 UTC
Created attachment 32852 [details]
libgsf-1 python program to check the NPOIFS and OPOIFS files

I've had a bit of a play with libgsf-1, in python, listing script attached. It reports no problems reading the OPOIFS generated file, but on the NPOIFS one it gives the error 

libgsf:msole-WARNING **: failure reading block 0 for 'Workbook'

That, coupled with the BFF validator error, does make me think we must be doing something wrong with how we're writing the file. Odd that our code reads it fine, but the other libraries don't... May need to step through the libgsf code to see what it isn't liking, then re-check the matching NPOIFS write code against the spec for that section
Comment 11 Nick Burch 2015-06-28 19:00:23 UTC
I've noticed that most of the other tools pop their Properties Table into block 0, which NPOIFS wasn't. I've changed that in r1688038. More tweaks may be required though
Comment 12 Nick Burch 2015-06-29 09:56:59 UTC
Larger files seem to open OK with OpenOffice and libgsf, so it looks like remaining issues are around the ministream. If someone with handy access to a copy of windows could try re-creating the small test file on trunk, and seeing if the BFFV gives any more helpful messages about the file or not, that'd help!

(If not, I've got a c program working with libgsf-1, so next step otherwise is to use gdb to step into it to see why it doesn't like the file)
Comment 13 Nick Burch 2015-07-01 18:52:45 UTC
I've spent quite a bit more time on this, but I'm still stumped as to what we're doing wrong. OpenMcdf is able to read the files without error or issue, but libgsf objects. If you open the file in OpenMcdf, save the Workbook stream, then create a new file with OpenMcdf, Excel opens that file. (However, that file is wrong in other ways, bugs reported to the OpenMcdf project!) If you save the Workbook stream out using POIFSDump, then use python + ctypes + libgsf to write it, then Excel is happy with the file, but in some situations libgsf gives the same error about its file that it does for NPOIFS generated ones! Bug reported to libgsf as well. And I still can't work out what NPOIFS is doing that upsets Excel / OpenOffice...
Comment 14 Nick Burch 2015-07-02 22:57:59 UTC
No response on my libgsf or OpenMcdf bug reports

I had a play with the Node.JS OLE2 library js-cfb <https://github.com/SheetJS/js-cfb> - this is able to read the NPOIFS ministream without error, just like OpenMcdf can do. I've also tried the Perl module Spreadsheet::Read, and it reads the NPOIFS file fine too

Microsoft released a new version of the [MS-CFB] file format spec last week, but sadly that didn't seem to have any changes in it that cover this case

At this point, it looks like a subtle issue where one reading of the format spec differs slightly from the one Excel + OO has. Unless someone else can spot something, it's probably going to need setting up a Windows C development environment, writing a simple lister with the same OLE2 functions that Excel uses, and stepping into that with a debugger to see why it doesn't like our ministream.... Any takers? :)
Comment 15 Nick Burch 2015-07-06 22:12:54 UTC
With the help of Morten Welinder from libgsf, I think we might have this cracked now as of r1689505! Would someone be able to confirm?
Comment 16 Dominik Stadler 2015-07-07 07:49:00 UTC
Thanks a lot, it seems to work fine now also with very small files!