Bug 65096 - Apache POI Excel XLSX Streaming XML not correctly reading multiple inline Strings
Summary: Apache POI Excel XLSX Streaming XML not correctly reading multiple inline Str...
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-21 17:27 UTC by Jack
Modified: 2021-01-21 21:06 UTC (History)
0 users



Attachments
Example (8.20 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-01-21 17:27 UTC, Jack
Details
Screenshot (6.02 KB, image/png)
2021-01-21 17:27 UTC, Jack
Details
Sheet.xml (662 bytes, text/xml)
2021-01-21 17:28 UTC, Jack
Details
SampleApplication.java (2.61 KB, text/plain)
2021-01-21 17:28 UTC, Jack
Details
Sample Output (205 bytes, text/plain)
2021-01-21 17:28 UTC, Jack
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jack 2021-01-21 17:27:32 UTC
Created attachment 37706 [details]
Example

This is raised off of an issue on Stackoverflow - https://stackoverflow.com/q/65789807

I've got an XLSX Excel file with a single cell.

When loaded using POI's WorkbookFactory, it's read correctly as a single cell.

When read using POI's XSSFSheetXMLHandler, it's read as though it was two separate cells.

When looking at the underlying sheet.xml, you'd expect to see a single item of text per cell, but here it's in two blocks - one formatted using a different font to the other.
Comment 1 Jack 2021-01-21 17:27:59 UTC
Created attachment 37707 [details]
Screenshot
Comment 2 Jack 2021-01-21 17:28:10 UTC
Created attachment 37708 [details]
Sheet.xml
Comment 3 Jack 2021-01-21 17:28:22 UTC
Created attachment 37709 [details]
SampleApplication.java
Comment 4 Jack 2021-01-21 17:28:50 UTC
Created attachment 37710 [details]
Sample Output
Comment 5 PJ Fanning 2021-01-21 20:01:00 UTC
This is probably a bug but do you have any idea what produced this xlsx file? The sheet1.xml is formatted and the namespace declarations are different from most xlsx files I've seen. This is just out of interest.
Comment 6 Jack 2021-01-21 20:06:59 UTC
(In reply to PJ Fanning from comment #5)
> This is probably a bug but do you have any idea what produced this xlsx
> file? The sheet1.xml is formatted and the namespace declarations are
> different from most xlsx files I've seen. This is just out of interest.

It was produced by a piece of proprietary software, that's all I can disclose unfortunately.
I extracted this segment from a larger document - I formatted the XML but the namespace is as it was in the original file.
Comment 7 PJ Fanning 2021-01-21 20:20:28 UTC
In short term, can you ask the owners of the proprietary software not to use multiple <t> elements for a cell?
Comment 8 PJ Fanning 2021-01-21 20:23:04 UTC
The same bug exists in excel-streaming-reader - I have added a fix - https://github.com/pjfanning/excel-streaming-reader/pull/29
Comment 9 Jack 2021-01-21 20:34:03 UTC
(In reply to PJ Fanning from comment #7)
> In short term, can you ask the owners of the proprietary software not to use
> multiple <t> elements for a cell?

Unfortunately, as these files exist I need to be able to load them.

I worked around this by checking if the "is" tag is still open (accessed via reflection) and storing the values before getting them when it's closed.
Comment 10 PJ Fanning 2021-01-21 21:06:18 UTC
I've tried a fix - r1885770 - so far, it looks like the streaming xlsx parser code is somewhat undertested - so I hope I haven't broken other use cases when trying to fix this case