Bug 46220 - Regression: Some embedded images being lost
Summary: Regression: Some embedded images being lost
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.2-FINAL
Hardware: PC Windows Vista
: P2 regression (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2008-11-16 14:05 UTC by Trejkaz (pen name)
Modified: 2011-06-24 08:19 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Trejkaz (pen name) 2008-11-16 14:05:10 UTC
Some of our own test cases have been failing since an update from POI 3.1 to 3.2.

The embedded images now come out incorrectly or are not found at all.
Comment 1 Trejkaz (pen name) 2008-11-16 14:15:42 UTC
Test file 1 (too big to attach. :-( )

This one now gets 3 images instead of 4.

Correct MD5 digests for each image (first three do match up so the images it does pick up must be okay.)

72d07b8db5fad7099d90bc4c304b4666   <-- this is the missing one
Comment 2 Nick Burch 2008-11-17 02:02:02 UTC
I wonder if this is another area of hwpf that is assuming bytes/characters but getting characters/bytes. (I tried to make it a bit more sane for 3.2, so that unicode text extraction worked more reliably, but the file format it really crazy about this sort of thing)

One test that'd be interesting is creating a few small test files with images, and seeing how hwpf copes with them:
* non unicode, one image near start
* non unicode, one image near end
* non unicode, image near start, image near end
* unicode, one image near start
* unicode, one image near end
* unicode, image near start, image near end

If 5 and 6 have issues with their later images, we'll know it's another byte/character problem. If it's something different, maybe it'll help us track down.

As a bonus, the files will make a good regression test for the future :)
Comment 3 Trejkaz (pen name) 2008-11-17 17:38:35 UTC
Any suggestions on how to go about this?  I tried doing what was suggested on the list, and incrementally adding images to existing documents, but even for "broken" ones, Word saved them in a form which fixed any warnings.
Comment 4 Nick Burch 2008-11-19 04:49:23 UTC
The fact that re-saving them in word fixes the warning does tend to make me think these files really are slightly dodgy, and it isn't just us

First up, could you try older versions of word, to see if maybe there was one earlier version which produced these dodgy files?

Otherwise, if you could find a couple of different files which do trigger the warning, but not all have image issues, that'd be a big help. Especially if we could have two versions of each, the original (with warning) and the newer word re-saved version (without warnings). We can then compare them, see the differences, and hopefully figure out what needs fixing/working around
Comment 5 Trejkaz (pen name) 2008-11-19 14:59:23 UTC
Same file as before, with one more image added in and the number edited from 4 to 5 in the document...

72d07b8db5fad7099d90bc4c304b4666 <-- missing (same image as before)


A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best
A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best
Comment 6 Trejkaz (pen name) 2008-11-19 15:01:12 UTC
Actually for comparison, this is the same warnings I'm getting on the original file in this case, i.e. the Unicode one isn't appearing for this test case, only for the other ones (which I can't redistribute, because someone checked in IP as test cases.) :-(
Comment 7 Yegor Kozlov 2011-06-24 08:19:54 UTC
Images are properly read with current trunk. I included your sample in our collection of test documents and added a unit test.