Bug 60570 - Add rudimentary EMF read-only capability
Summary: Add rudimentary EMF read-only capability
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-10 17:18 UTC by Tim Allison
Modified: 2017-01-19 16:27 UTC (History)
0 users



Attachments
initial patch (77.37 KB, patch)
2017-01-10 17:18 UTC, Tim Allison
Details | Diff
test file 1 (130.20 KB, image/x-emf)
2017-01-10 17:19 UTC, Tim Allison
Details
test file 2 (27.21 KB, image/x-emf)
2017-01-10 17:19 UTC, Tim Allison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-01-10 17:18:45 UTC
Created attachment 34605 [details]
initial patch

It would be useful to start building up some emf parsing functionality.  EMFs can contain text as well as fully embedded documents.

A full EMF parser would take some time; let's focus on extraction first.
Comment 1 Tim Allison 2017-01-10 17:19:08 UTC
Created attachment 34606 [details]
test file 1
Comment 2 Tim Allison 2017-01-10 17:19:26 UTC
Created attachment 34607 [details]
test file 2
Comment 3 Tim Allison 2017-01-10 17:20:32 UTC
I made everything @Internal and dumped this all in scratchpad.  Let me know what you think.

Obviously, I have to strip out the static calls in the test files... <face_palm/>
Comment 4 Tim Allison 2017-01-19 16:27:36 UTC
r1779493

This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs.

This offers a start towards text extraction, although more work remains, including: 
1) parsing and tracking the fonts to handle exttextouta and polytexta
2) implementation of the polytexts (I couldn't find examples)

I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1.  I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files.

If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests.