Bug 64164

Summary: (POI 3.17) - Embedded files in .doc text extracted automatically - how to skip these
Product: POI Reporter: Rob Squire <squirer7492>
Component: POI OverallAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: normal CC: squirer7492
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X 10.1   
Attachments: Sample input file

Description Rob Squire 2020-02-20 16:01:47 UTC
Created attachment 37029 [details]
Sample input file

Hi there,

we recently realised that documents (.doc not .docx) with embedded excel spreadsheets have their text automatically extracted as part of the text extraction process.

  // pass an input stream (.doc sample containing an embedded excel file with 
  // some text in the cells)
  
  POITextExtractor t = 
       org.apache.poi.extractor.ExtractorFactory.createExtractor(bis);

  // produces the text of the .doc document BUT also the embedded excel 
  // documents contents - is there a way to turn this feature off?
    
  t.getText()


Please let us know if there is something we can do to get around this and turn this feature off for the text extractor.

Thanks,
Rob
Comment 1 Nick Burch 2020-02-20 16:15:18 UTC
For text extraction, you would be better off using Apache Tika. Tika wraps POI, but gives full control over the processing of text + metadata + embedded resources
Comment 2 Rob Squire 2020-02-20 16:33:27 UTC
Thanks a lot for the quick reply Nick!

I think in the meantime we can try to do some gymnastics with the paragraphText and build up what we need to give back to the user (as POI is used quite a lot throughout our app already).

Much appreciated!
Rob
Comment 3 Tim Allison 2020-02-20 17:29:57 UTC
To follow up on Nick's advice, download from here: https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.23.jar

and try it with these options:

java -jar tika-app-1.23.jar -J -t input_file.doc