Bug 41898 - Word 2003 pictures cannot be extracted
Summary: Word 2003 pictures cannot be extracted
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: Other other
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2007-03-19 22:34 UTC by Trejkaz (pen name)
Modified: 2011-08-09 13:03 UTC (History)
0 users

Test document containing a single image, possibly in EMF+ format (23.50 KB, application/octet-stream)
2007-03-19 22:34 UTC, Trejkaz (pen name)

Note You need to log in before you can comment on or make changes to this bug.
Description Trejkaz (pen name) 2007-03-19 22:34:10 UTC
While trying to generate an EMF+ image file by creating a picture in Office, I
found something slightly interesting.

Using getDocument().getPicturesTable().getAllPictures() on a Word 2003 document
results in an image coming out as four bytes of data, equal to 0x80000000
(little endian.)

This is probably a placeholder indicating that the actual data is stored
elsewhere, but I'm not yet sure where.
Comment 1 Trejkaz (pen name) 2007-03-19 22:34:49 UTC
Created attachment 19749 [details]
Test document containing a single image, possibly in EMF+ format
Comment 2 Nick Burch 2007-03-20 03:29:52 UTC
Is it possible to also get the original image you added to the document?

(We can then go hunting for byte sequences from the original image, which might
make it easier to figure out how to turn the 4 byte value into a way to get at
the image)
Comment 3 Trejkaz (pen name) 2007-03-20 03:46:50 UTC
It was drawn in the picture editor.  I found the bytes in there (search 
for "This" encoded as UTF-16 LE.)

If I had the raw bytes for the image already then I wouldn't go through this 
convoluted mechanism of trying to obtain them.  My own goal is just to acquire 
an EMF+ image.  I think I've failed to do so but in the process I found this 
interesting situation. :-)
Comment 4 Nick Burch 2007-03-29 10:37:49 UTC
I've added your word doc to svn, and included a test of the current (incorrect)
behavour in usermodel.TestPictures

Did you have any luck finding a real offset anywhere? I wonder if the 0x8000...
is some sort of mask you have to apply to a value to get the real offset, but
that could be wrong.
Comment 5 Trejkaz (pen name) 2007-03-30 01:05:04 UTC
I can't see another offset no matter how hard I look.

Maybe it's stored elsewhere, like in the WordDocument stream where the image 
is embedded (which would mean we'd have to either read through the 
WordDocument to find the images, or locate images by scanning for that magic 
number instead of assuming the next one starts at the same spot.)
Comment 6 Sergey Vladimirov 2011-08-09 13:03:30 UTC
The image is not stored as EMF, but as Office Drawing format.

You can access shape id (1044) using new OfficeDrawings interface. After that you will need additional work with org.apache.poi.ddf package to draw/extract/convert picture.

Thus I'm closing this issue as INVALID.