Summary: | (POI 3.17) - Embedded files in .doc text extracted automatically - how to skip these | ||
---|---|---|---|
Product: | POI | Reporter: | Rob Squire <squirer7492> |
Component: | POI Overall | Assignee: | POI Developers List <dev> |
Status: | NEW --- | ||
Severity: | normal | CC: | squirer7492 |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Mac OS X 10.1 | ||
Attachments: | Sample input file |
For text extraction, you would be better off using Apache Tika. Tika wraps POI, but gives full control over the processing of text + metadata + embedded resources Thanks a lot for the quick reply Nick! I think in the meantime we can try to do some gymnastics with the paragraphText and build up what we need to give back to the user (as POI is used quite a lot throughout our app already). Much appreciated! Rob To follow up on Nick's advice, download from here: https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.23.jar and try it with these options: java -jar tika-app-1.23.jar -J -t input_file.doc |
Created attachment 37029 [details] Sample input file Hi there, we recently realised that documents (.doc not .docx) with embedded excel spreadsheets have their text automatically extracted as part of the text extraction process. // pass an input stream (.doc sample containing an embedded excel file with // some text in the cells) POITextExtractor t = org.apache.poi.extractor.ExtractorFactory.createExtractor(bis); // produces the text of the .doc document BUT also the embedded excel // documents contents - is there a way to turn this feature off? t.getText() Please let us know if there is something we can do to get around this and turn this feature off for the text extractor. Thanks, Rob