Bug 61296 - Bring over missing constants from Tika
Summary: Bring over missing constants from Tika
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.17-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-13 20:31 UTC by Nick Burch
Modified: 2017-07-14 14:54 UTC (History)
0 users



Attachments
a quick comparison of Tika and POI constants (19.85 KB, text/tab-separated-values)
2017-07-14 03:15 UTC, Javen O'Neal
Details
a quick comparison of Tika and POI constants (19.85 KB, text/tab-separated-values)
2017-07-14 03:43 UTC, Javen O'Neal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Burch 2017-07-13 20:31:45 UTC
In Apache Tika, under tika-parsers/src/main/java/org/apache/tika/parser/microsoft/, there's now a surprisingly large number of POI and OOXML constants in the parser codebase

We should review these, add our own constants where we don't already have them (eg relationships or types we don't have defined), then swap the Tika classes to using our constants after a release
Comment 1 Javen O'Neal 2017-07-14 03:15:49 UTC
Created attachment 35138 [details]
a quick comparison of Tika and POI constants

https://github.com/apache/tika/tree/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/

git clone https://github.com/apache/tika.git apache-tika
pushd apache-tika
cd tika-parsers/src/main/java/org/apache/tika/parser/microsoft/
grep -r -P "(static final|final static|http://schemas|vnd|urn)" .

Most notably,
* ./ooxml/AbstractOOXMLExtractor.java has 8 relationship schema URLS and 1 ooxml mime type
* ./ooxml/OOXMLWordAndPowerPointTextHandler.java has 6 schema urls and 2 urns
* ./POIFSContainerDetector.java has several mime types
And a few others
See attachment for a list of current constants that could be copied over.
Comment 2 Javen O'Neal 2017-07-14 03:16:27 UTC
r1801901
Comment 3 Javen O'Neal 2017-07-14 03:40:12 UTC
r1801903
Comment 4 Javen O'Neal 2017-07-14 03:43:44 UTC
Created attachment 35139 [details]
a quick comparison of Tika and POI constants
Comment 5 Tim Allison 2017-07-14 11:10:03 UTC
Yup.  Sorry.  I've been meaning to do this.  Thank you, Nick and Javen!

Speaking of which...is there any interest in moving over the SAX-based docx/pptx code from Tika into POI?
Comment 6 Javen O'Neal 2017-07-14 14:54:48 UTC
Yes, absolutely!