Bug 64164

Summary:	(POI 3.17) - Embedded files in .doc text extracted automatically - how to skip these
Product:	POI	Reporter:	Rob Squire <squirer7492>
Component:	POI Overall	Assignee:	POI Developers List <dev>
Status:	NEW ---
Severity:	normal	CC:	squirer7492
Priority:	P2
Version:	unspecified
Target Milestone:	---
Hardware:	PC
OS:	Mac OS X 10.1
Attachments:	Sample input file

Description Rob Squire 2020-02-20 16:01:47 UTC

Created attachment 37029 [details]
Sample input file

Hi there,

we recently realised that documents (.doc not .docx) with embedded excel spreadsheets have their text automatically extracted as part of the text extraction process.

  // pass an input stream (.doc sample containing an embedded excel file with 
  // some text in the cells)
  
  POITextExtractor t = 
       org.apache.poi.extractor.ExtractorFactory.createExtractor(bis);

  // produces the text of the .doc document BUT also the embedded excel 
  // documents contents - is there a way to turn this feature off?
    
  t.getText()


Please let us know if there is something we can do to get around this and turn this feature off for the text extractor.

Thanks,
Rob

Comment 1 Nick Burch 2020-02-20 16:15:18 UTC

For text extraction, you would be better off using Apache Tika. Tika wraps POI, but gives full control over the processing of text + metadata + embedded resources

Comment 2 Rob Squire 2020-02-20 16:33:27 UTC

Thanks a lot for the quick reply Nick!

I think in the meantime we can try to do some gymnastics with the paragraphText and build up what we need to give back to the user (as POI is used quite a lot throughout our app already).

Much appreciated!
Rob

Comment 3 Tim Allison 2020-02-20 17:29:57 UTC

To follow up on Nick's advice, download from here: https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.23.jar

and try it with these options:

java -jar tika-app-1.23.jar -J -t input_file.doc