|Summary:||Add rudimentary EMF read-only capability|
|Product:||POI||Reporter:||Tim Allison <tallison>|
|Component:||POI Overall||Assignee:||POI Developers List <dev>|
test file 1
test file 2
Description Tim Allison 2017-01-10 17:18:45 UTC
Created attachment 34605 [details] initial patch It would be useful to start building up some emf parsing functionality. EMFs can contain text as well as fully embedded documents. A full EMF parser would take some time; let's focus on extraction first.
Comment 3 Tim Allison 2017-01-10 17:20:32 UTC
I made everything @Internal and dumped this all in scratchpad. Let me know what you think. Obviously, I have to strip out the static calls in the test files... <face_palm/>
Comment 4 Tim Allison 2017-01-19 16:27:36 UTC
r1779493 This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs. This offers a start towards text extraction, although more work remains, including: 1) parsing and tracking the fonts to handle exttextouta and polytexta 2) implementation of the polytexts (I couldn't find examples) I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1. I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files. If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests.