Bug 60519 - Extractor for *SSF embeddings
Summary: Extractor for *SSF embeddings
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: 3.16-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on: 60520
Blocks:
  Show dependency tree
 
Reported: 2016-12-26 22:36 UTC by Andreas Beeker
Modified: 2016-12-31 21:58 UTC (History)
0 users



Attachments
embedded extractor - changes not related to common ss (74.25 KB, patch)
2016-12-26 22:36 UTC, Andreas Beeker
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Beeker 2016-12-26 22:36:03 UTC
Created attachment 34555 [details]
embedded extractor - changes not related to common ss

Find attached an extractor for various embeddings of excel files.

This is based on the work for [1] and [2].
Apart of evaluating the ClassIDs of Ole10Native objects, this also finds PDFs hidden in EMFs, which seems to be some specialty of Mac Excel 2011.

I'm not sure if the extraction part in org.apache.poi.ss.extractor.EmbeddedExtractor should be part of POI or maybe Tika - but for other type of extraction helper we didn't make this destinction too.

The code depends on changes to Common SS which I document in a separate issue, but need to commit it together.

I'll commit the code on the 30.12.2016, if no-one objects earlier ...


[1] http://stackoverflow.com/questions/41101012
[2] http://stackoverflow.com/questions/27011634
Comment 1 Andreas Beeker 2016-12-26 23:05:44 UTC
The test data for EMF with embedded PDF can be found under
https://people.apache.org/~kiwiwings/Basic_Expense_Template_2011.xls
Comment 2 Andreas Beeker 2016-12-31 21:58:10 UTC
Applied via r1776819