Created attachment 35184 [details] MultipleBodyBug Apache Tika fails to get full HTML if the Word Document has multiple body. We only get the data from the first body.
Created attachment 35185 [details] Patch for reading all body
Merged using https://svn.apache.org/repos/asf/poi/trunk@1803250
Karthik, Thank you for sharing a patch and triggering document! PJ, thank you for fixing this so quickly! As a side note, Tika's experimental SAX parser for docx does extract everything; and this is exactly one of the reasons that I added it -- so that if we don't account for structural rareties(?), we'll still get the text. With our DOM model, we're looking for some specific things in specific places (see also TIKA-1130). Make no mistake, we need to fix our DOM parser when people find problems, and I'm grateful that you opened this!