Bug 21775

Summary: Non-MS Office Docs with Valid Header Signature
Product: POI Reporter: Jacob Zwiers <apache_bugzilla>
Component: HPSFAssignee: POI Developers List <dev>
Status: CLOSED FIXED    
Severity: normal    
Priority: P3    
Version: 3.0-dev   
Target Milestone: ---   
Hardware: All   
OS: other   
Attachments: An offending Corel Presentation File
Source Code to demonstrate ClassCastException on SummaryInformation.getWordCount() for .shw files

Description Jacob Zwiers 2003-07-21 17:02:59 UTC
The Corel Presentation software (at least versions 8 and 9) fake the recognized 
header signature that POIFS recognizes (0xE11AB1A1E011CFD0L) and the properties 
are then read from this file which causes a problem.  The net result is that 
SummaryInformation.getWordCount() throws a ClassCastException in this 
situation.  I will attach (if possible on a later screen) a small test class 
and .shw file for demonstration purposes.
Comment 1 Jacob Zwiers 2003-07-21 17:04:14 UTC
Created attachment 7425 [details]
An offending Corel Presentation File
Comment 2 Jacob Zwiers 2003-07-21 17:08:24 UTC
Created attachment 7426 [details]
Source Code to demonstrate ClassCastException on SummaryInformation.getWordCount() for .shw files
Comment 3 Andy Oliver 2003-07-21 17:25:04 UTC
Just because it doesn't have SummaryInformation or does SummaryInformation wrong, that doesn't 
mean POIFS is wrong to recognize it as an OLE 2 Compound Document format file.   Is this a 
problem just with HPSF not recognizing the SummaryInformation stream isn't what it thinks it is?
Comment 4 Jacob Zwiers 2003-07-21 18:14:34 UTC
Here's what I've figured out.  I'll let you decide which it is.   

If I spin through all the properties that I get back, the following is returned 
for the word count (ID#15 according to the PropertyIDMap.PID_WORDCOUNT 
constant); element 13 in the array of properties).

DocumentPropertyReader - props[13].getID() = 15
DocumentPropertyReader - props[13].getType() = 0
DocumentPropertyReader - props[13].getValue() = [B@9eca9c26
DocumentPropertyReader - props[13].getValue().getClass() = class [B

The type that's created here (in the default of the  switch in the 
org.apache.poi.hpsf.Property constructor based on the return of 
org.apache.poi.hpsf.littleendian.DWord.intValue()) means that value was not 
recognized as a Variant.VT_I4.  Instead, it's a zero == VT_EMPTY.  This gets put 
into the properties as a byte array which causes the ClassCastException when 
calling getWordCount().

I'm not sure (not knowing the nuts and bolts) if the Corel doc is actually 
behaving properly as an OLE2 doc.  If it isn't, I guess the problem is that 
POIFS thinks it is.  If it is, the the problem is either that the VT_EMPTY 
doesn't get treated as a null  OR that the getWordCount() doesn't propertly take 
this into account.
Comment 5 Andy Oliver 2003-07-24 17:24:02 UTC
POIFS is right, HPSF is wrong.
Comment 6 Rainer Klute 2003-07-25 09:30:24 UTC
HPSF does not yet support VT_EMPTY. A proper implementation would be to return
null. See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/automat/htm/chap6_7zdz.asp
for an explanation of the variant types. I'll prepare a patch.
Comment 7 Rainer Klute 2003-07-26 21:52:57 UTC
HPSF is now able to read properties which are given in the property set stream
but which don't have a value. The type of such properties is VT_EMPTY.
PropertySet's getXXX methods return either a null or a 0 whichever is
appropriate. Details about return types can be found in the API documentation.