Bug 61787

Summary: Text extraction omitting text incorrectly
Product: POI Reporter: Mark Murphy <jmarkmurphy>
Component: XWPFAssignee: POI Developers List <dev>
Severity: normal CC: gaeremyncks
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   
Bug Depends on: 58067    
Bug Blocks:    
Attachments: A .docx with rsidDel attributes

Description Mark Murphy 2017-11-20 13:53:54 UTC
Text extract omits run text where the run contains a rsidDel attribute. This is incorrect as rsid* attributes are simply revision session identifiers. It is possible for this attribute to be present, but the run text still be valid. Instead of the revision session id attributes, text extract should key on specific revision tags to determine which text to omit. The appropriate tag to omit is <delText>
Comment 1 Mark Murphy 2017-11-20 13:56:52 UTC
This issue was introduced by Bug #58067
Comment 2 Simon Gaeremynck 2017-11-20 16:33:07 UTC
Created attachment 35540 [details]
A .docx with rsidDel attributes
Comment 3 Dominik Stadler 2017-12-28 08:50:19 UTC
Adjusted this with r1819405 as follows:
* Instead of rsiddel check for deltext to exclude deleted content
* Also add runs from insertions in trackchanges to add inserted text correctly

Hopefully this now makes it work better across the various ways documents can contain text-content.