Bug 61787 - Text extraction omitting text incorrectly
Summary: Text extraction omitting text incorrectly
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on: 58067
  Show dependency tree
Reported: 2017-11-20 13:53 UTC by Mark Murphy
Modified: 2017-12-28 08:50 UTC (History)
1 user (show)

A .docx with rsidDel attributes (6.88 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-11-20 16:33 UTC, Simon Gaeremynck

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Murphy 2017-11-20 13:53:54 UTC
Text extract omits run text where the run contains a rsidDel attribute. This is incorrect as rsid* attributes are simply revision session identifiers. It is possible for this attribute to be present, but the run text still be valid. Instead of the revision session id attributes, text extract should key on specific revision tags to determine which text to omit. The appropriate tag to omit is <delText>
Comment 1 Mark Murphy 2017-11-20 13:56:52 UTC
This issue was introduced by Bug #58067
Comment 2 Simon Gaeremynck 2017-11-20 16:33:07 UTC
Created attachment 35540 [details]
A .docx with rsidDel attributes
Comment 3 Dominik Stadler 2017-12-28 08:50:19 UTC
Adjusted this with r1819405 as follows:
* Instead of rsiddel check for deltext to exclude deleted content
* Also add runs from insertions in trackchanges to add inserted text correctly

Hopefully this now makes it work better across the various ways documents can contain text-content.