On TIKA-1859, Movses raised an issue that he can extract content with POI from a specific xlsx file but not from Tika. I confirmed that the content is extractable with XSSFWorkbook. However, Tika does a streaming read with XSSFSheetXMLHandler. XSSFSheetXMLHandler relies on qName to find "row" and "c". In the submitted problematic file, the qName includes the namespace (i.e. "x:row", "x:c") and the sheet handler entirely skips that content. When I switched the string processing in startElement and endElement in XSSFSheetXMLHandler to rely on localName, instead of qName, content was correctly extracted. Movses ranked this a blocker on Tika. It would be great if we could get the fix in before we cut 3.14... I should have time tonight so make the fix in trunk.
I'm hesitant to rely on localName because that's the whole point of namespaces. Should we test that the URI == "http://schemas.openxmlformats.org/spreadsheetml/2006/main" in startElement and endElement to make sure we're dealing with the right "c" and "row"? Or is there a more elegant option?
It'll be a bit more verbose, but checking both the namespace and the tag name is the safest way to do it
Thank you, Nick! Done. r1730992
Created attachment 34294 [details] Excel file broken with the change The attached excel file is generated by Microsoft Excel 2016 for Mac. The file can no longer be extracted since 3.14 after the change in this ticket is introduced. As I debug startElement() function call, uri is empty but not null (""). Also localName is empty while qName is present.
(In reply to Ed Chu from comment #4) > Created attachment 34294 [details] > Excel file broken with the change Can you try with a more recent version of POI? I've just tried with the most recent, and I see the same text from POI and Tika as I do in OpenOffice ("Table 1" but that's about it)