While trying to generate an EMF+ image file by creating a picture in Office, I found something slightly interesting. Using getDocument().getPicturesTable().getAllPictures() on a Word 2003 document results in an image coming out as four bytes of data, equal to 0x80000000 (little endian.) This is probably a placeholder indicating that the actual data is stored elsewhere, but I'm not yet sure where.
Created attachment 19749 [details] Test document containing a single image, possibly in EMF+ format
Is it possible to also get the original image you added to the document? (We can then go hunting for byte sequences from the original image, which might make it easier to figure out how to turn the 4 byte value into a way to get at the image)
It was drawn in the picture editor. I found the bytes in there (search for "This" encoded as UTF-16 LE.) If I had the raw bytes for the image already then I wouldn't go through this convoluted mechanism of trying to obtain them. My own goal is just to acquire an EMF+ image. I think I've failed to do so but in the process I found this interesting situation. :-)
I've added your word doc to svn, and included a test of the current (incorrect) behavour in usermodel.TestPictures Did you have any luck finding a real offset anywhere? I wonder if the 0x8000... is some sort of mask you have to apply to a value to get the real offset, but that could be wrong.
I can't see another offset no matter how hard I look. Maybe it's stored elsewhere, like in the WordDocument stream where the image is embedded (which would mean we'd have to either read through the WordDocument to find the images, or locate images by scanning for that magic number instead of assuming the next one starts at the same spot.)
The image is not stored as EMF, but as Office Drawing format. You can access shape id (1044) using new OfficeDrawings interface. After that you will need additional work with org.apache.poi.ddf package to draw/extract/convert picture. Thus I'm closing this issue as INVALID.