Bug 58093

Summary: Rework of getDocumentText() in HWPFDocument
Product: POI Reporter: Andreas Meier <andreas.meier>
Component: HWPFAssignee: POI Developers List <dev>
Severity: minor    
Priority: P2    
Version: 3.12-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Example

Description Andreas Meier 2015-07-02 09:59:33 UTC
Created attachment 32878 [details]

I haven't found any possibility for a change request of POI, so I write my request here:

If there are any word documents (doc/docx) embedded into a word document(doc)  [Word 97(-2007)] getDocumentText() will return an identifier for the embedded document (EMBED Word.Document.12) and control characters ("end-of-text", "end-of-transmission"), as you can see in the left side of the attached image.

Is this method meant to act like this?
Why is there no option to determine the structure and content of the document text?
In my opinion, the document text shall be the document content. That means: header, content, footer but no metadata/metainformation!

I recommend to rework the getDocumentText()-method in HWPFDocument in the following way:
- Add a flag (boolean) "suppressEmbeddedInformation", to suppress metainformation like embedded objects (EMBED Word.Document.X) and the control characters that come with this metainformation
- Add a flag (boolean) "recursiveExtraction". On "true" every embedded Document calls its getDocumentText() or getText()-method and provides its content as a string. On "false" only the document content (header, content, footer) of the main document is extracted.

The attached image shows two results for the extraction of an embedded docx document inside a doc document:
On the left side you see the current result of the getDocumentText()-method. On the right side you see one possible (clean) result I would like to have.

What do you think about it?
Comment 1 Nick Burch 2015-07-02 10:39:52 UTC
There are all sorts of control sequences / fields that can come through in the text, as the .doc format handles loads of things that way

If you don't want these, and only want the text, then use a util method like https://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29 to have them removed

Note that the javadocs for https://poi.apache.org/apidocs/org/apache/poi/hwpf/HWPFDocumentCore.html#getDocumentText%28%29 explicitly state that you get the fields included in the response. Other methods (eg via WordExtractor, or Apache Tika) are provided to give content-text only, for those who want it