Bug 61354 - Tika fails to get full HTML
Summary: Tika fails to get full HTML
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.17-dev
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2017-07-27 21:49 UTC by Karthik Ramachandran
Modified: 2017-07-28 11:03 UTC (History)
1 user (show)

MultipleBodyBug (99.71 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-07-27 21:49 UTC, Karthik Ramachandran
Patch for reading all body (98.52 KB, patch)
2017-07-27 21:52 UTC, Karthik Ramachandran
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Karthik Ramachandran 2017-07-27 21:49:09 UTC
Created attachment 35184 [details]

Apache Tika fails to get full HTML if the Word Document has multiple body.  We only get the data from the first body.
Comment 1 Karthik Ramachandran 2017-07-27 21:52:26 UTC
Created attachment 35185 [details]
Patch for reading all body
Comment 2 PJ Fanning 2017-07-28 07:44:56 UTC
Merged using https://svn.apache.org/repos/asf/poi/trunk@1803250
Comment 3 Tim Allison 2017-07-28 11:03:25 UTC
Karthik, Thank you for sharing a patch and triggering document!  PJ, thank you for fixing this so quickly!

As a side note, Tika's experimental SAX parser for docx does extract everything; and this is exactly one of the reasons that I added it -- so that if we don't account for structural rareties(?), we'll still get the text.  With our DOM model, we're looking for some specific things in specific places (see also TIKA-1130).

Make no mistake, we need to fix our DOM parser when people find problems, and I'm grateful that you opened this!