Bug 58067 - getText() of XWPFParagraph returns deleted text if in "review" mode
Summary: getText() of XWPFParagraph returns deleted text if in "review" mode
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: Macintosh All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2015-06-22 11:36 UTC by femmer
Modified: 2016-01-03 13:29 UTC (History)
1 user (show)

A test file to reproduce the problem with (28.04 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2015-06-22 11:36 UTC, femmer
Patch (1.65 KB, application/x-gzip)
2015-06-22 11:41 UTC, femmer

Note You need to log in before you can comment on or make changes to this bug.
Description femmer 2015-06-22 11:36:48 UTC
Created attachment 32843 [details]
A test file to reproduce the problem with

Dear all,

Iā€™m looking for a simple solution to parse only the newest version of an XWPF file (as if all changes are accepted or so). As far as I could google and browse through the javadoc there is no such functionality in apache poi, is that correct?
- Open a MS Word document
- Track changes
- Remove text from the document (in tracked-mode)
- Save. (see file attached)

- Open file with apache POI
- iterate through paragraphs
- call getText() on the paragraphs

Outcome: The removed text is returned.
Expected: Only text of the "final version" of the document is returned.

Comment 1 femmer 2015-06-22 11:41:04 UTC
Created attachment 32844 [details]

Here is a patch, that checks if there is a deletion item associated with a run, before it adds the text. I'm not sure which other items could contain such a deletion, so I just checked for XWPFRuns.
Comment 2 femmer 2015-06-22 11:41:45 UTC
The fix is a simple check:

 if (run instanceof XWPFRun) {
+                XWPFRun xRun = (XWPFRun) run;
+                if (xRun.getCTR().getRsidDel() == null) {
+                    out.append(xRun.toString());
+                }
+            }
Comment 3 Dominik Stadler 2015-06-22 14:43:25 UTC
Here is the output:

bffvalidator c:\temp\58061good.xls
BFFValidator: "c:\temp\58061good.xls" FAILED at 06/22/15 16:42:09
Log at: c:\temp\58061good.xls.bffvalidator.06-22-15_16-42-09.xml
See: http://msdn.microsoft.com/en-us/library/A6FFF2B4-470A-463D-A6E9-9DAD9676CD44 for more information

bffvalidator c:\temp\58061corrupt.xls
BFFValidator: "c:\temp\58061corrupt.xls" NOT RECOGNIZED (The Microsoft Office Binary File Format Validator encountered an error reading the file you specified, OR The Microsof
t Office Binary File Format Validator supports Word, Excel, and PowerPoint binary file formats only. The file you specified is an unsupported file type.) at 06/22/15 16:42:14
Log at: c:\temp\58061corrupt.xls.bffvalidator.06-22-15_16-42-14.xml
Comment 4 Dominik Stadler 2015-06-22 14:43:34 UTC
sorry, wrong bug!
Comment 5 Dominik Stadler 2016-01-03 13:29:19 UTC
Thanks for the patch, this is now applied via r1722715