Bug 44886 - Format of PICT records seems different to other metafile blips
Summary: Format of PICT records seems different to other metafile blips
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.0-FINAL
Hardware: PC Windows Vista
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-27 18:38 UTC by Trejkaz (pen name)
Modified: 2008-04-29 23:27 UTC (History)
0 users



Attachments
Hex dump of PICT blip (5.35 KB, text/plain)
2008-04-27 18:38 UTC, Trejkaz (pen name)
Details
A version of EscherMetafileBlip which correctly processes primary blip UID (10.83 KB, application/octet-stream)
2008-04-28 01:59 UTC, Yegor Kozlov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Trejkaz (pen name) 2008-04-27 18:38:01 UTC
Created attachment 21864 [details]
Hex dump of PICT blip

Now that Bug 44857 is resolved and I can fully parse the Escher records in a test file we happen to use in testing, I have been investigating the subtle change in the MD5 of the returned data.  The reason turned out to be obvious, but it turns out the contents of the picture were unusable both before and after as the format of the picture record is nonsense internally, or at least it's nonsense when parsed as a WMF/EMF blip.

I've attachd a hex dump of the blip in question which has the header parsed out by hand.  The declared length of this blip I have already confirmed to be correct, but as you can see the contents seem to be completely unrelated to the code currently used for EMF and WMF, so the format must be something else entirely.

The remaining data doesn't appear to contain a declaration of the length of itself, so I've been wondering if perhaps it's an opaque blob of PICT data.  I don't know the format of PICT though so I can't confirm this right now.
Comment 1 Trejkaz (pen name) 2008-04-27 18:56:58 UTC
Okay here's some more analysis.  It isn't a raw PICT, but it isn't the same as the other blip either.  However, it's remarkably similar unless I have this all wrong.

After the header, we have...

   57 32 7B 91 23 5D DB 36 7A DB FF 17 FE F3 A7 05
   C7 15 69 2D E5 89 A3 6F 66 03 D6 24 F7 DB 1D 13 (32 bytes unknown)

   72 A1 00 00                                       <-- cb (uncompressed size)

   00 00 00 00 00 00 00 00 A3 00 00 00 40 00 00 00   <-- rcBounds

   25 ED 1F 00 6A B1 0C 00                           <-- ptSize

   23 04 00 00                                       <-- cbSave (compressed size)

   00                                                <-- fCompression

   FE                                                <-- fFilter

cbSave using this scheme does exactly match the remaining data in the blip.

So I take it this is the same as EMF/WMF but with 32 bytes of UID instead of 16?
Comment 2 Trejkaz (pen name) 2008-04-27 19:51:05 UTC
Someone emailed me from the POI project saying they're looking into it.
Comment 3 Josh Micich 2008-04-27 20:16:14 UTC
(In reply to comment #2)
> Someone emailed me from the POI project saying they're looking into it.
> 

I'm having trouble finding documentation for the Escher file stream.  This is the best I have found -from this page:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
this file:
http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/OfficeDrawing97-2007BinaryFormatSpecification.pdf

My understanding is that it is completely OK for POI contributors to use these documents.  Does anyone know of a better resource describing the Escher file format?  Perhaps we could update the POI source with a reference/URL to that document.

It looks like this particular record (recordId == RECORD_ID_PICT) is described on page 16 of the above document and from what I can tell, the unknown binary data might be in zlib/deflate format.

Hope this helps.
Comment 4 Trejkaz (pen name) 2008-04-27 21:12:47 UTC
If we can trust the comments in that document, then:
  1. EMF, WMF and PICT are the same afterall.
  2. Any of these may have a second UID after the first.
  3. The means of determining this is
     blip_instance ^ blip_signature == 1, where both of these values appear to be
     nontrivial to compute (to me anyway.)
Comment 5 David Fisher 2008-04-27 21:40:54 UTC
Hi Guys,

Yegor has worked through these formats for me and he can tell you what is up. If I recall the PICT format may require that you download Quicktime for Java from Apple. Also, Yegor had success with either WMF or EMF, but not the other.

Also, AFAIK the OSP should cover the use of the format spec, but that won't help with PICT. That is Apple's Quickdraw format as grown from what 24 years ago. Apple has always published the format. It wouldn't be too hard to format. I do have gnerative code in FORTRAN if there becomes a desire to generate.

Regards,
Dave
Comment 6 Trejkaz (pen name) 2008-04-27 23:23:26 UTC
In terms of getting the actual PICT data into a renderable image, that's a separate problem IMO, and one which lies outside of POI.

For this bug record, the problem is determining when to read the extra 16 bytes of UID.  If someone can figure that out, then we'll have a way to get out the byte[] data, and some other library can read the PICT data, just like some other library reads the WMF and EMF.
Comment 7 Josh Micich 2008-04-28 01:38:16 UTC
(In reply to comment #4)
> If we can trust the comments in that document, then:
>   3. The means of determining this is
>      blip_instance ^ blip_signature == 1, where both of these values appear to
> be nontrivial to compute (to me anyway.)

I guess you got this from page 17.  The full text is:

"The primary UID is only saved to disk if (blip_instance ^ blip_signature == 1). Blip_instance is MSOFBH.inst and  blip_signature is one of the values defined in MSOBI"

MSFOBH seems to be the common record header from page 8.  I believe the POI class EscherRecordHeader corresponds to this:
MSOFBH.ver,inst <=> EscherRecordHeader.options
MSOFBH.fbt      <=> EscherRecordHeader.recordId 
MSOFBH.cbLength <=> EscherRecordHeader.remainingBytes

So the inst field probably corresponds to EscherRecord.getInstance()

MSOBI enum is mentioned on page 15.  It's not clear to me how to calculate blip_signature.  The exclusive or operator giving a result of 1 is also a bit weird here.  Note that none of the constants from MSOBI have the LSB set. So perhaps the test for writing the extra UID is whether the LSB of EscherRecord.getInstance() is set.  Perhaps the expression was written as such to emphasize that this rule only works when blip_signature == EscherRecord.getInstance() & 0x0FFE.

This is all speculation on my part.  You might be best to verify the behaviour empirically.  Two existing POI junits hit the method EscherMetafileBlip.fillFields() 4 times:
TestHSSFPictureData.testPictures() line: 45	"SimpleWithImages.xls"
TestOLE2Embeding.testEmbeding() line: 36	"ole2-embedding.xls"
- so perhaps with these files, and your current examples you can decipher Microsoft's cryptic description of the m_rgbUidPrimary field.
Comment 8 Yegor Kozlov 2008-04-28 01:57:11 UTC
//  3. The means of determining this is
//     blip_instance ^ blip_signature == 1, where both of these values appear to be
//     nontrivial to compute (to me anyway.)

I figured out how to do this check. See what we have:

Metafile signatures are defined in the spec as follows:

typedef enum
   {
   msobiWMF  = 0x216,      // Metafile header then compressed WMF
   msobiEMF  = 0x3D4,      // Metafile header then compressed EMF
   msobiPICT = 0x542,      // Metafile header then compressed PICT
   }
MSOBI;

In your test data EscherMetafileBlip.Options=0x5430

According to the spec:
 0x543 ^ 0x542 == 1; //bingo! need to read extra 16 bytes 


I attached my version of EscherMetafileBlip. Please exercise it against your test data and confirm it works OK. If it does, I will commit the fix.
Note, I reverted your previous fix. EscherMetafileBlip.field_2_cb always defines the correct metafile size.   

Also, it would be good to have test data where blip_instance ^ blip_signature != 1. Please attach a sample if you find one.

Regards,
Yegor
Comment 9 Yegor Kozlov 2008-04-28 01:59:40 UTC
Created attachment 21867 [details]
A version of EscherMetafileBlip which correctly processes primary blip UID
Comment 10 Yegor Kozlov 2008-04-28 02:06:30 UTC
> 
> Also, AFAIK the OSP should cover the use of the format spec, but that won't
> help with PICT. That is Apple's Quickdraw format as grown from what 24 years
> ago. Apple has always published the format. It wouldn't be too hard to format.
> I do have gnerative code in FORTRAN if there becomes a desire to generate.
> 

I don't think we will encounter legal issues with it.
We don't create  or interpret metafiles. We only extract metafiles from existing documents or insert them into xls or ppt. 
Comment 11 Trejkaz (pen name) 2008-04-28 16:08:00 UTC
That version of EscherMetafileBlip fixes the problem for me.

Also I stepped through all our test files looking for a blip where the result was 0x00, but I couldn't find one.
Comment 12 Trejkaz (pen name) 2008-04-28 16:22:45 UTC
Somewhat related to this, is it possible that suggestFileExtension() using the format mask directly is also slightly incorrect?
Comment 13 Yegor Kozlov 2008-04-29 08:16:32 UTC
(In reply to comment #12)
> Somewhat related to this, is it possible that suggestFileExtension() using the
> format mask directly is also slightly incorrect?
> 

Good catch. The correct version should use blip.recordId():

    public String suggestFileExtension()
    {
        switch (blip.getRecordId())
        {
            case EscherMetafileBlip.RECORD_ID_WMF:
                return "wmf";
            case EscherMetafileBlip.RECORD_ID_EMF:
                return "emf";
            case EscherMetafileBlip.RECORD_ID_PICT:
                return "pict";
            case EscherBitmapBlip.RECORD_ID_PNG:
                return "png";
            case EscherBitmapBlip.RECORD_ID_JPEG:
                return "jpeg";
            case EscherBitmapBlip.RECORD_ID_DIB:
                return "dib";
            default:
                return "";
        }
    }

Yegor
Comment 14 Yegor Kozlov 2008-04-29 23:27:27 UTC
Thanks for the patch. 
I committed my version and a unit test.

Yegor