Created attachment 35274 [details] example docx In regression tests for 3.17-rc2, I found some duplication of content in Tika, and this is replicated with POI's XWPFWordExtractor. XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("dupe1.docx"); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); In the attached file, "When readers open..." should only appear once, but it appears twice. Full reports are here: http://162.242.228.174/reports/poi-3.17-rc2-docx.tar.gz Roughly ~8000 docxs have apparently at least some duplicated content out of ~170k. Some of the extra content can be explained by the phonetic/ruby issue, but not the majority.
My fault on 61740. The appending of the picture text slipped into the loop instead of being applied after it. 1123 // Any picture text? 1124 if (pictureText != null && pictureText.length() > 0) { 1125 text.append("\n").append(pictureText); 1126 }
r1806839