Bug 46220

Summary: Regression: Some embedded images being lost
Product: POI Reporter: Trejkaz (pen name) <trejkaz>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: regression    
Priority: P2    
Version: 3.2-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows Vista   

Description Trejkaz (pen name) 2008-11-16 14:05:10 UTC
Some of our own test cases have been failing since an update from POI 3.1 to 3.2.

The embedded images now come out incorrectly or are not found at all.
Comment 1 Trejkaz (pen name) 2008-11-16 14:15:42 UTC
Test file 1 (too big to attach. :-( )
http://dl.getdropbox.com/u/50201/nonconsecutive-images.doc

This one now gets 3 images instead of 4.

Correct MD5 digests for each image (first three do match up so the images it does pick up must be okay.)

851be142bce6d01848e730cb6903f39e
7fc6d8fb58b09ababd036d10a0e8c039
a7dc644c40bc2fbf17b2b62d07f99248
72d07b8db5fad7099d90bc4c304b4666   <-- this is the missing one
Comment 2 Nick Burch 2008-11-17 02:02:02 UTC
I wonder if this is another area of hwpf that is assuming bytes/characters but getting characters/bytes. (I tried to make it a bit more sane for 3.2, so that unicode text extraction worked more reliably, but the file format it really crazy about this sort of thing)

One test that'd be interesting is creating a few small test files with images, and seeing how hwpf copes with them:
* non unicode, one image near start
* non unicode, one image near end
* non unicode, image near start, image near end
* unicode, one image near start
* unicode, one image near end
* unicode, image near start, image near end

If 5 and 6 have issues with their later images, we'll know it's another byte/character problem. If it's something different, maybe it'll help us track down.

As a bonus, the files will make a good regression test for the future :)
Comment 3 Trejkaz (pen name) 2008-11-17 17:38:35 UTC
Any suggestions on how to go about this?  I tried doing what was suggested on the list, and incrementally adding images to existing documents, but even for "broken" ones, Word saved them in a form which fixed any warnings.
Comment 4 Nick Burch 2008-11-19 04:49:23 UTC
The fact that re-saving them in word fixes the warning does tend to make me think these files really are slightly dodgy, and it isn't just us

First up, could you try older versions of word, to see if maybe there was one earlier version which produced these dodgy files?

Otherwise, if you could find a couple of different files which do trigger the warning, but not all have image issues, that'd be a big help. Especially if we could have two versions of each, the original (with warning) and the newer word re-saved version (without warnings). We can then compare them, see the differences, and hopefully figure out what needs fixing/working around
Comment 5 Trejkaz (pen name) 2008-11-19 14:59:23 UTC
Same file as before, with one more image added in and the number edited from 4 to 5 in the document...

851be142bce6d01848e730cb6903f39e
7fc6d8fb58b09ababd036d10a0e8c039
a7dc644c40bc2fbf17b2b62d07f99248
5eee0af68b7856b731a7775db8a6e6e2
72d07b8db5fad7099d90bc4c304b4666 <-- missing (same image as before)

http://dl.getdropbox.com/u/50201/nonconsecutive-images-2.doc

Warnings:
A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best
A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best
Comment 6 Trejkaz (pen name) 2008-11-19 15:01:12 UTC
Actually for comparison, this is the same warnings I'm getting on the original file in this case, i.e. the Unicode one isn't appearing for this test case, only for the other ones (which I can't redistribute, because someone checked in IP as test cases.) :-(
Comment 7 Yegor Kozlov 2011-06-24 08:19:54 UTC
Images are properly read with current trunk. I included your sample in our collection of test documents and added a unit test.

Yegor