Created attachment 34522 [details] triggering file based on testWORD_2006ml.docx in Tika XWPFDocument's onDocumentLoad() looks for paragraphs, tables and sdts at the main level of the body. As we saw with Bug 54849 (SDTs), there can be other intervening structures between the body and text-containing elements. I recently noticed that AlternateContent elements can also appear at the body level, and we should probably add those to our document model. To create this test file, I added a title page via Word's default "add a title page function". In the SAX parser that I added to Tika, I chose to extract text from the Fallback section on the theory that that would have the more easily parseable content. If we're modeling read/write in our DOM/XWPFDocument, we'll probably want to point to both Fallback and Choice? Unit test: public void testAlternateContent() throws IOException { XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("testAlternateContent.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); String txt = extractor.getText(); assertContainsSpecificCount("engaging abstract", txt, 1); assertContainsSpecificCount("MyDocumentTitle", txt, 1); assertContainsSpecificCount("MyDocumentSubtitle", txt, 1); } private void assertContainsSpecificCount(String needle, String haystack, int expectedCount) { int index = haystack.indexOf(needle); int found = 0; while (index > -1) { found++; index = haystack.indexOf(needle, index+1); } assertEquals(expectedCount, found); }