Summary: | Add rudimentary EMF read-only capability | ||
---|---|---|---|
Product: | POI | Reporter: | Tim Allison <tallison> |
Component: | POI Overall | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | ||
Priority: | P2 | ||
Version: | 3.16-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
initial patch
test file 1 test file 2 |
Created attachment 34606 [details]
test file 1
Created attachment 34607 [details]
test file 2
I made everything @Internal and dumped this all in scratchpad. Let me know what you think. Obviously, I have to strip out the static calls in the test files... <face_palm/> r1779493 This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs. This offers a start towards text extraction, although more work remains, including: 1) parsing and tracking the fonts to handle exttextouta and polytexta 2) implementation of the polytexts (I couldn't find examples) I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1. I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files. If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests. |
Created attachment 34605 [details] initial patch It would be useful to start building up some emf parsing functionality. EMFs can contain text as well as fully embedded documents. A full EMF parser would take some time; let's focus on extraction first.