Bug 60471 - Not loading AlternateContent in XWPF
Summary: Not loading AlternateContent in XWPF
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2016-12-12 14:53 UTC by Tim Allison
Modified: 2016-12-28 19:55 UTC (History)
0 users

triggering file based on testWORD_2006ml.docx in Tika (23.05 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2016-12-12 14:53 UTC, Tim Allison

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2016-12-12 14:53:56 UTC
Created attachment 34522 [details]
triggering file based on testWORD_2006ml.docx in Tika

XWPFDocument's onDocumentLoad() looks for paragraphs, tables and sdts at the main level of the body.  As we saw with Bug 54849 (SDTs), there can be other intervening structures between the body and text-containing elements.  

I recently noticed that AlternateContent elements can also appear at the body level, and we should probably add those to our document model.

To create this test file, I added a title page via Word's default "add a title page function".

In the SAX parser that I added to Tika, I chose to extract text from the Fallback section on the theory that that would have the more easily parseable content.  If we're modeling read/write in our DOM/XWPFDocument, we'll probably want to point to both Fallback and Choice?

Unit test:

    public void testAlternateContent() throws IOException {
        XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("testAlternateContent.docx");
        XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

        String txt = extractor.getText();
        assertContainsSpecificCount("engaging abstract", txt, 1);
        assertContainsSpecificCount("MyDocumentTitle", txt, 1);
        assertContainsSpecificCount("MyDocumentSubtitle", txt, 1);

    private void assertContainsSpecificCount(String needle, String haystack, int expectedCount) {
        int index = haystack.indexOf(needle);
        int found = 0;
        while (index > -1) {
            index = haystack.indexOf(needle, index+1);
        assertEquals(expectedCount, found);