Bug 54332 - WMF extraction failing in Tika for older PowerPoint Files
Summary: WMF extraction failing in Tika for older PowerPoint Files
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-19 19:53 UTC by Dave Meikle
Modified: 2015-06-25 19:41 UTC (History)
0 users



Attachments
file that triggers issue from govdocs1 (49.00 KB, application/vnd.ms-powerpoint)
2015-06-25 11:56 UTC, Tim Allison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Meikle 2012-12-19 19:53:44 UTC
For the PowerPoint 97 file attached to the  JIRA issue below, the embedded WMF file fails to be read within the WMF class as it appears not be compressed.
 
See https://issues.apache.org/jira/browse/TIKA-1046 for more details.
Comment 1 Tim Allison 2015-06-24 16:38:24 UTC
Similar issue re-discovered in https://issues.apache.org/jira/browse/TIKA-1612 during analysis of most common caught exceptions in govdocs1.  However, for the file I posted in TIKA-1612, I'm not sure that there is a valid WMF file that is being extracted.

See TIKA-1612 for the example file and the result of getRawBytes().
Comment 2 Andreas Beeker 2015-06-24 23:40:02 UTC
The error was, that some pictures have two checksum/UID fields, this was ignored up till now - see [MS-ODRAW] 2.2.25 OfficeArtBlipWMF

fixed with r1687398

... hopefully I don't forget to merge it back when I merge the common_sl branch ... :|
Comment 3 Tim Allison 2015-06-25 11:53:45 UTC
Thank you for fixing this!  With the ppt on TIKA-1612, I'm no longer getting an exception.  Great! However, the bytes that I'm extracting (with .getData())aren't valid png (or any other image, as far as I can tell).

Is there something else going on, too?  Should I open a separate issue?
Comment 4 Tim Allison 2015-06-25 11:56:36 UTC
Created attachment 32856 [details]
file that triggers issue from govdocs1
Comment 5 Andreas Beeker 2015-06-25 13:06:40 UTC
The new testcase uses the two tika files [TIKA-1612]/[TIKA-1046].
I could extract the WMF and open it with irfanview.
Please drop me an email and I'll check it next week - I don't have much WIFI available until Sunday night ...

Andi
Comment 6 Dominik Stadler 2015-06-25 13:44:40 UTC
Tim, WMF is its own format, not png or jpeg, see https://en.wikipedia.org/wiki/Windows_Metafile, Windows machines will display it, not sure about Java support, though.
Comment 7 Tim Allison 2015-06-25 19:41:58 UTC
Doh!  User error, of course...WMF not PNG.  Sorry!  And thank you, again!