Bug 60519

Summary: Extractor for *SSF embeddings
Product: POI Reporter: Andreas Beeker <kiwiwings>
Component: SS CommonAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Bug Depends on: 60520    
Bug Blocks:    
Attachments: embedded extractor - changes not related to common ss

Description Andreas Beeker 2016-12-26 22:36:03 UTC
Created attachment 34555 [details]
embedded extractor - changes not related to common ss

Find attached an extractor for various embeddings of excel files.

This is based on the work for [1] and [2].
Apart of evaluating the ClassIDs of Ole10Native objects, this also finds PDFs hidden in EMFs, which seems to be some specialty of Mac Excel 2011.

I'm not sure if the extraction part in org.apache.poi.ss.extractor.EmbeddedExtractor should be part of POI or maybe Tika - but for other type of extraction helper we didn't make this destinction too.

The code depends on changes to Common SS which I document in a separate issue, but need to commit it together.

I'll commit the code on the 30.12.2016, if no-one objects earlier ...

[1] http://stackoverflow.com/questions/41101012
[2] http://stackoverflow.com/questions/27011634
Comment 1 Andreas Beeker 2016-12-26 23:05:44 UTC
The test data for EMF with embedded PDF can be found under
Comment 2 Andreas Beeker 2016-12-31 21:58:10 UTC
Applied via r1776819