|Summary:||XSSFSheetXMLHandler is using qName instead of localName and missing cells/rows|
|Product:||POI||Reporter:||Tim Allison <tallison>|
|Component:||XSSF||Assignee:||POI Developers List <dev>|
|Attachments:||Excel file broken with the change|
Description Tim Allison 2016-02-17 20:46:16 UTC
On TIKA-1859, Movses raised an issue that he can extract content with POI from a specific xlsx file but not from Tika. I confirmed that the content is extractable with XSSFWorkbook. However, Tika does a streaming read with XSSFSheetXMLHandler. XSSFSheetXMLHandler relies on qName to find "row" and "c". In the submitted problematic file, the qName includes the namespace (i.e. "x:row", "x:c") and the sheet handler entirely skips that content. When I switched the string processing in startElement and endElement in XSSFSheetXMLHandler to rely on localName, instead of qName, content was correctly extracted. Movses ranked this a blocker on Tika. It would be great if we could get the fix in before we cut 3.14... I should have time tonight so make the fix in trunk.
Comment 1 Tim Allison 2016-02-17 20:50:20 UTC
I'm hesitant to rely on localName because that's the whole point of namespaces. Should we test that the URI == "http://schemas.openxmlformats.org/spreadsheetml/2006/main" in startElement and endElement to make sure we're dealing with the right "c" and "row"? Or is there a more elegant option?
Comment 2 Nick Burch 2016-02-17 23:38:04 UTC
It'll be a bit more verbose, but checking both the namespace and the tag name is the safest way to do it
Comment 4 Ed Chu 2016-09-22 19:29:29 UTC
Created attachment 34294 [details] Excel file broken with the change The attached excel file is generated by Microsoft Excel 2016 for Mac. The file can no longer be extracted since 3.14 after the change in this ticket is introduced. As I debug startElement() function call, uri is empty but not null (""). Also localName is empty while qName is present.
Comment 5 Nick Burch 2016-09-22 21:01:28 UTC
(In reply to Ed Chu from comment #4) > Created attachment 34294 [details] > Excel file broken with the change Can you try with a more recent version of POI? I've just tried with the most recent, and I see the same text from POI and Tika as I do in OpenOffice ("Table 1" but that's about it)