Bug 46392

Summary: Code for Reading Ole10Native Data
Product: POI Reporter: Rainer Schwarze <rsc>
Component: POI OverallAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: enhancement CC: bonniot
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows 2000   
Attachments: source file for reading the ole10native data
source code for tests
test data file
Some Package files which don't work :)
Another one which doesn't work

Description Rainer Schwarze 2008-12-13 14:45:06 UTC
Created attachment 23013 [details]
source file for reading the ole10native data

Finally I found the time to submit some code for reading Ole10Native structures. (For instance for extracting ZIP files embedded in Word files)

As I am not sure to which package it should belong, I put it in the default package.

The test file is a DOCX. The test case therefore depends on openxml4j. If necessary, this certainly can be resolved by extracting the files in the "/word/embeddings" location and working directly on them.

Best wishes,
Rainer
Comment 1 Rainer Schwarze 2008-12-13 14:46:09 UTC
Created attachment 23014 [details]
source code for tests
Comment 2 Rainer Schwarze 2008-12-13 14:46:51 UTC
Created attachment 23015 [details]
test data file
Comment 3 Trejkaz (pen name) 2009-03-24 17:10:08 UTC
Created attachment 23413 [details]
Some Package files which don't work :)

Here are some package files from our own test data suite which don't comply with the format in this parser.

The \u0001Ole10Native stream has just 4 bytes length and the data.  However, there are also 3 additional OLE2 entries which may or may not contain the data missing from the \u0001Ole10Native.

Creator in this case was Microsoft Word 8.0 (Word 97.)

I'd like to say it's a different format, but the items have the same CLSID as other items which do have the full structure you have submitted code for, and they are handled by the same DLL (packager.dll).  So I suspect MS consider them to be two versions of the same format.
Comment 4 Trejkaz (pen name) 2009-03-24 17:17:56 UTC
I will also add what I have discovered so far.  All the new files I can create using WordPad or Word 2003 have 6 bytes in \u0003ObjInfo, whereas these old ones have 4.  But I only have these 6 old files and don't have a sufficiently old version of Windows and/or Office to get a copy of any more of them...
Comment 5 Trejkaz (pen name) 2009-03-24 18:16:36 UTC
Created attachment 23414 [details]
Another one which doesn't work

Here's one I created by dropping a .txt file into a Word 2003 document.  It doesn't work because the code which reads the 8 bytes between the first two strings and the third string apparently makes a false assumption about the meaning of the third byte.

Disassembled stream:

@0x0: 36 01 00 00
  Length of remaining data.

@0x4: 02 00
  Our original theory was that this is TYMED_FILE, but TYMED are normally 4 bytes.
  My new theory is that this is actually a header for the string array to follow,
  as there happen to be two strings.

@0x6: 50 61 63 6b 61 67 65 64 20 66 69 6c 65 00
  "Packaged file" + NUL

@0x14: 44 3a 5c 70 61 63 6b 61 67 65 64 2e 74 78 74 00
  "D:\packaged.txt" + NUL

@0x24: 00 00
  Unknown.  Possibly terminates the string array, assuming that theory was correct.

@0x26: 03 00
  Unknown.
  Matches the value in \u0003ObjInfo but this may just be coincidence.
  Also matches the number of strings which follow the data but this may be coincidence too.

@0x28: 30 00 00 00
  Unknown.  Matches the length of the following string, possibly a coincidence.

@0x2c: 43 3a 5c 55 73 65 72 73 5c 64 61 6e 69 65 6c 5c 41 70 70 44 61 74 61 5c
       4c 6f 63 61 6c 5c 54 65 6d 70 5c 70 61 63 6b 61 67 65 64 2e 74 78 74 00
  "C:\Users\daniel\AppData\Local\Temp\packaged.txt" + NUL

@0x5c: 38 00 00 00
  Length of actual data following.

@0x98: 2f 00 00 00
  Number of UTF-16LE chars following.

@0xfa: 0d 00 00 00
  Number of UTF-16LE chars following.

@0x118: 0f 00 00 00
  Number of UTF-16LE chars following.
Comment 6 Trejkaz (pen name) 2009-03-24 19:18:41 UTC
The value @0x26 which was "03 00" is "01 00" and "00 00" throughout the existing test data in this issue.  I'm starting to think it's a format number of some kind.

00 00 , followed by:
  00 00 00 00 (and then EOF)

01 00 , followed by:
  00 00
  ASCIIZ command line
  00 00

03 00 , followed by:
  4 bytes - length of command line
  ASCIIZ command line (we know the length already though)
  4 bytes - attachment length
  attachment data itself

  Optionally in here, some multiple (normally 3 it seems) of this:
    4 bytes - length of Unicode string
    Unicode string value (no null termination on these)

  00 00

What worries me is the missing 02 00.  I doubt there will be any values newer than 04 00 yet if Office 2007 on Vista is still generating 03 00, but I haven't yet seen a file with 02 00 which suggests a big gap in my knowledge.  So far the above is consistent for all files I have actually seen.
Comment 7 Maxim Valyanskiy 2010-09-09 09:19:07 UTC
Added in r995415 with some changes. Thank you