Bug 59746 - Unable to correctly read data in XLSX file
Summary: Unable to correctly read data in XLSX file
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.14-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2016-06-23 09:22 UTC by Rob
Modified: 2016-06-23 19:00 UTC (History)
0 users

Suporting docs with xlsx files (18.53 KB, application/x-zip-compressed)
2016-06-23 09:34 UTC, Rob
Full data that will not open (249.36 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2016-06-23 09:36 UTC, Rob
Full data saved by excel (895.95 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2016-06-23 09:36 UTC, Rob
Smaller file that does not work (5.79 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2016-06-23 15:47 UTC, Rob

Note You need to log in before you can comment on or make changes to this bug.
Description Rob 2016-06-23 09:22:56 UTC
I have an excel xlsx file that was exported by some mass spec software (ThemoScientific I think) that looks to contain hierarchical or grouped data. There are two examples of these files in the attached, one that is the full export and one where only the top level data is exported. If you try to open these files using POI, the file opens correctly, but the lastRowNum for the only sheet is -1 (example code is also in the attachment). However, excel can open the file and if you immediately hit save on the file (no changes needed) and then try to open with POI, the file opens and the correct number of rows are found. 

Looking at the XLM for the saved file, there are definitely changes that excel has made, but I am not sure where POI is failing in parsing the original XML. Not sure where the problem lies, but I have found a python package (https://pypi.python.org/pypi/xlrd) that will correctly read the files that POI is unable to read, so the XML does not appear to be completely nonstandard (or only one that excel itself can read). 

This appears still be a problem with 3.15-beta1 as well.
Comment 1 Rob 2016-06-23 09:34:56 UTC
Created attachment 33977 [details]
Suporting docs with xlsx files

I cannot attached the larger files as it goes over the allowed file size.
Comment 2 Rob 2016-06-23 09:36:05 UTC
Created attachment 33978 [details]
Full data that will not open
Comment 3 Rob 2016-06-23 09:36:31 UTC
Created attachment 33979 [details]
Full data saved by excel
Comment 4 Nick Burch 2016-06-23 11:12:31 UTC
The problem is that all the rows in your file neglect to say which one they are!

In a normal file, the sheet xml should be like:

    <x:row r="1">
      <x:c r="A1" t="str">
      <x:c r="B1" t="str">
        <x:v>Dial Code</x:v>
    <x:row r="2">
      <x:c r="A2" t="str">
        <x:v>Abkhazia - Fixed</x:v>
      <x:c r="B2" t="str">

Where each row indicates the row number. However, in your files, the row numbers are missing:

    <x:row hidden="0" outlineLevel="0" collapsed="0">
      <x:c s="1" t="inlineStr">
      <x:c s="1" t="inlineStr">
    <x:row hidden="0" outlineLevel="0" collapsed="0">
      <x:c s="4" t="b">
      <x:c s="4" t="inlineStr">

My guess is that Excel assumes that a row with no row number should be treated as the next row, and if all have no row number start from row 1.

Any chance you could produce a much smaller problematic excel file from the other software, perhaps with just a few 10s of lines, which we could use for unit testing and a possible fix?
Comment 5 Rob 2016-06-23 11:15:37 UTC
Thanks for the rapid follow up. I will find a smaller file and attached it ASAP
Comment 6 Rob 2016-06-23 15:47:03 UTC
Created attachment 33980 [details]
Smaller file that does not work

Here is a smaller file. Software did not make it easy so it might be slightly biger than you were hoping
Comment 7 Nick Burch 2016-06-23 19:00:05 UTC
Hopefully fixed in r1749971 - I can now read back the values the same as shown in OpenOffice for the file

Do let us know if there are other issues with these files due to them having missed other parts of the XML out though!