Bug 60570

Summary: Add rudimentary EMF read-only capability
Product: POI Reporter: Tim Allison <tallison>
Component: POI OverallAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: initial patch
test file 1
test file 2

Description Tim Allison 2017-01-10 17:18:45 UTC
Created attachment 34605 [details]
initial patch

It would be useful to start building up some emf parsing functionality.  EMFs can contain text as well as fully embedded documents.

A full EMF parser would take some time; let's focus on extraction first.
Comment 1 Tim Allison 2017-01-10 17:19:08 UTC
Created attachment 34606 [details]
test file 1
Comment 2 Tim Allison 2017-01-10 17:19:26 UTC
Created attachment 34607 [details]
test file 2
Comment 3 Tim Allison 2017-01-10 17:20:32 UTC
I made everything @Internal and dumped this all in scratchpad.  Let me know what you think.

Obviously, I have to strip out the static calls in the test files... <face_palm/>
Comment 4 Tim Allison 2017-01-19 16:27:36 UTC
r1779493

This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs.

This offers a start towards text extraction, although more work remains, including: 
1) parsing and tracking the fonts to handle exttextouta and polytexta
2) implementation of the polytexts (I couldn't find examples)

I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1.  I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files.

If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests.