On TIKA-2147 and TIKA-2149, Seva Alekseyev and Sharath Kumar shared two documents that throw: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument at org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) at org.apache.poi.xwpf.usermodel.XWPFFootnote.<init>(XWPFFootnote.java:47) at org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) at org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) I think the issue is that the because the footnotes are within a glossary, when we call getXWPFDocument(), we're invoking .getParent() which gets the POIXMLDocumentPart. If we get the grandparent in this case, we actually get the XWPFDocument. I propose something along these lines: public XWPFDocument getXWPFDocument() { if (document != null) { return document; } else { Object parent = getParent(); if (parent != null) { if (parent instanceof XWPFDocument) { return (XWPFDocument)parent; } else if (parent instanceof POIXMLDocumentPart) { Object grandParent = ((POIXMLDocumentPart) parent).getParent(); if (grandParent instanceof XWPFDocument) { return (XWPFDocument) grandParent; } } } throw new IllegalStateException("couldn't find the parent"); } }
On further review, and given TIKA-2163, it looks like this is a whole new kettle of worms. The proposed fix is incorrect duct tape over a far larger issue. We aren't currently handling the glossaryDocument as a special relationship type. Anyone have experience with glossaryDocument? Looks like an entire other document stored within the document...
Does anyone have a recommendation for a more graceful outcome than a ClassCastException for files with a GlossaryDocument? I suspect the actual fix will take a nontrivial amount of work. I don’t want to hide/forget the issue, but I also would prefer a different outcome...logging perhaps? This issue was recently raised on https://issues.apache.org/jira/browse/TIKA-2769 via an elasticsearch issue. Our current workaround on Tika is to recommend the SAX based docx parser.
I would opt for more gracefully handling this, just because POI does not support a feature it would be nice if it still can handle the document to some degree, so a log would probably be more appropriate for now.
Thank you, Dominik. Unless there are objections, I'll try to add logging as a first step. I'll leave this ticket open for when someone has time to add the new capability.
In r1845517, I added a check+log+skip to avoid a ClassCastException until we have time to implement correct handling of a glossary document.
I shouldn't have skipped "template" types. I should have skipped "glossary" types. This leads to a regression where headers/footers are not extracted from template documents. Will commit fix and new unit test once local build/test/test-integration completes successfully.
Fixed in r1847263