Bug 61475 - Duplication of content in some XWPF
Summary: Duplication of content in some XWPF
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2017-08-31 18:52 UTC by Tim Allison
Modified: 2017-08-31 19:15 UTC (History)
0 users

example docx (60.70 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-08-31 18:52 UTC, Tim Allison

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-08-31 18:52:04 UTC
Created attachment 35274 [details]
example docx

In regression tests for 3.17-rc2, I found some duplication of content in Tika, and this is replicated with POI's XWPFWordExtractor.

        XWPFDocument doc = XWPFTestDataSamples.openSampleDocument("dupe1.docx");
        XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

In the attached file, "When readers open..." should only appear once, but it appears twice.

Full reports are here:

Roughly ~8000 docxs have apparently at least some duplicated content out of ~170k.  Some of the extra content can be explained by the phonetic/ruby issue, but not the majority.
Comment 1 Tim Allison 2017-08-31 19:11:28 UTC
My fault on 61740.

The appending of the picture text slipped into the loop instead of being applied after it.

1123	        // Any picture text?
1124	        if (pictureText != null && pictureText.length() > 0) {
1125	            text.append("\n").append(pictureText);
1126	        }
Comment 2 Tim Allison 2017-08-31 19:15:07 UTC