Created attachment 34605 [details] initial patch It would be useful to start building up some emf parsing functionality. EMFs can contain text as well as fully embedded documents. A full EMF parser would take some time; let's focus on extraction first.
Created attachment 34606 [details] test file 1
Created attachment 34607 [details] test file 2
I made everything @Internal and dumped this all in scratchpad. Let me know what you think. Obviously, I have to strip out the static calls in the test files... <face_palm/>
r1779493 This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs. This offers a start towards text extraction, although more work remains, including: 1) parsing and tracking the fonts to handle exttextouta and polytexta 2) implementation of the polytexts (I couldn't find examples) I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1. I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files. If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests.