Bug 58093 - Rework of getDocumentText() in HWPFDocument
Summary: Rework of getDocumentText() in HWPFDocument
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.12-FINAL
Hardware: All All
: P2 minor (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-02 09:59 UTC by Andreas Meier
Modified: 2015-07-02 10:39 UTC (History)
0 users



Attachments
Example (45.74 KB, image/jpeg)
2015-07-02 09:59 UTC, Andreas Meier
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Meier 2015-07-02 09:59:33 UTC
Created attachment 32878 [details]
Example

I haven't found any possibility for a change request of POI, so I write my request here:

If there are any word documents (doc/docx) embedded into a word document(doc)  [Word 97(-2007)] getDocumentText() will return an identifier for the embedded document (EMBED Word.Document.12) and control characters ("end-of-text", "end-of-transmission"), as you can see in the left side of the attached image.

Is this method meant to act like this?
Why is there no option to determine the structure and content of the document text?
In my opinion, the document text shall be the document content. That means: header, content, footer but no metadata/metainformation!

I recommend to rework the getDocumentText()-method in HWPFDocument in the following way:
- Add a flag (boolean) "suppressEmbeddedInformation", to suppress metainformation like embedded objects (EMBED Word.Document.X) and the control characters that come with this metainformation
- Add a flag (boolean) "recursiveExtraction". On "true" every embedded Document calls its getDocumentText() or getText()-method and provides its content as a string. On "false" only the document content (header, content, footer) of the main document is extracted.

The attached image shows two results for the extraction of an embedded docx document inside a doc document:
On the left side you see the current result of the getDocumentText()-method. On the right side you see one possible (clean) result I would like to have.

What do you think about it?
Comment 1 Nick Burch 2015-07-02 10:39:52 UTC
There are all sorts of control sequences / fields that can come through in the text, as the .doc format handles loads of things that way

If you don't want these, and only want the text, then use a util method like https://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29 to have them removed

Note that the javadocs for https://poi.apache.org/apidocs/org/apache/poi/hwpf/HWPFDocumentCore.html#getDocumentText%28%29 explicitly state that you get the fields included in the response. Other methods (eg via WordExtractor, or Apache Tika) are provided to give content-text only, for those who want it