Bug 60316 - Handle Glossary in XWPFDocument
Summary: Handle Glossary in XWPFDocument
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.16-dev
Hardware: PC Windows NT
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-28 16:24 UTC by Tim Allison
Modified: 2018-11-23 13:33 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2016-10-28 16:24:19 UTC
On TIKA-2147 and TIKA-2149, Seva Alekseyev and Sharath Kumar shared two documents that throw:

java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
at org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
at org.apache.poi.xwpf.usermodel.XWPFFootnote.<init>(XWPFFootnote.java:47)
at org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
at org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)

I think the issue is that the because the footnotes are within a glossary, when we call getXWPFDocument(), we're invoking .getParent() which gets the POIXMLDocumentPart.  If we get the grandparent in this case, we actually get the XWPFDocument.

I propose something along these lines:

    public XWPFDocument getXWPFDocument() {
        if (document != null) {
            return document;
        } else {
            Object parent = getParent();
            if (parent != null) {
                if (parent instanceof XWPFDocument) {
                    return (XWPFDocument)parent;
                } else if (parent instanceof POIXMLDocumentPart) {
                    Object grandParent = ((POIXMLDocumentPart) parent).getParent();
                    if (grandParent instanceof XWPFDocument) {
                        return (XWPFDocument) grandParent;
                    }
                }
            }
            throw new IllegalStateException("couldn't find the parent");
        }
    }
Comment 1 Tim Allison 2016-11-04 16:56:18 UTC
On further review, and given TIKA-2163, it looks like this is a whole new kettle of worms.  The proposed fix is incorrect duct tape over a far larger issue.

We aren't currently handling the glossaryDocument as a special relationship type.  Anyone have experience with glossaryDocument?  Looks like an entire other document stored within the document...
Comment 2 Tim Allison 2018-10-31 12:42:54 UTC
Does anyone have a recommendation for a more graceful outcome than a ClassCastException for files with a GlossaryDocument?

I suspect the actual fix will take a nontrivial amount of work. I don’t want to hide/forget the issue, but I also would prefer a different outcome...logging perhaps?

This issue was recently raised on https://issues.apache.org/jira/browse/TIKA-2769 via an elasticsearch issue. Our current workaround on Tika is to recommend the SAX based docx parser.
Comment 3 Dominik Stadler 2018-10-31 19:12:20 UTC
I would opt for more gracefully handling this, just because POI does not support a feature it would be nice if it still can handle the document to some degree, so a log would probably be more appropriate for now.
Comment 4 Tim Allison 2018-10-31 19:33:57 UTC
Thank you, Dominik.

Unless there are objections, I'll try to add logging as a first step.  I'll leave this ticket open for when someone has time to add the new capability.
Comment 5 Tim Allison 2018-11-01 21:23:47 UTC
In r1845517, I added a check+log+skip to avoid a ClassCastException until we have time to implement correct handling of a glossary document.
Comment 6 Tim Allison 2018-11-23 13:15:53 UTC
I shouldn't have skipped "template" types.  I should have skipped "glossary" types.  This leads to a regression where headers/footers are not extracted from template documents.

Will commit fix and new unit test once local build/test/test-integration completes successfully.
Comment 7 Tim Allison 2018-11-23 13:33:34 UTC
Fixed in r1847263