Bug 64164 - (POI 3.17) - Embedded files in .doc text extracted automatically - how to skip these
Summary: (POI 3.17) - Embedded files in .doc text extracted automatically - how to ski...
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: unspecified
Hardware: PC Mac OS X 10.1
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2020-02-20 16:01 UTC by Rob Squire
Modified: 2020-02-20 17:29 UTC (History)
1 user (show)

Sample input file (128.00 KB, application/msword)
2020-02-20 16:01 UTC, Rob Squire

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Squire 2020-02-20 16:01:47 UTC
Created attachment 37029 [details]
Sample input file

Hi there,

we recently realised that documents (.doc not .docx) with embedded excel spreadsheets have their text automatically extracted as part of the text extraction process.

  // pass an input stream (.doc sample containing an embedded excel file with 
  // some text in the cells)
  POITextExtractor t = 

  // produces the text of the .doc document BUT also the embedded excel 
  // documents contents - is there a way to turn this feature off?

Please let us know if there is something we can do to get around this and turn this feature off for the text extractor.

Comment 1 Nick Burch 2020-02-20 16:15:18 UTC
For text extraction, you would be better off using Apache Tika. Tika wraps POI, but gives full control over the processing of text + metadata + embedded resources
Comment 2 Rob Squire 2020-02-20 16:33:27 UTC
Thanks a lot for the quick reply Nick!

I think in the meantime we can try to do some gymnastics with the paragraphText and build up what we need to give back to the user (as POI is used quite a lot throughout our app already).

Much appreciated!
Comment 3 Tim Allison 2020-02-20 17:29:57 UTC
To follow up on Nick's advice, download from here: https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.23.jar

and try it with these options:

java -jar tika-app-1.23.jar -J -t input_file.doc